I started grilling the idea before writing the execution plan

There’s this pattern I kept running into while working with OpenAI Codex, and it took me longer than I’d like to admit to realize what was actually going wrong, because on the surface everything looked fine, like the plans were clean, structured, even slightly over-engineered in a good way, the kind where you feel like you’ve thought things through, you hit run, and then nothing really breaks, which almost makes it worse.

Because things don’t fail loudly.

They just drift.

And that drift is subtle, it’s the kind where you’re three files deep, scrolling back and forth, and you can feel that something is off but you don’t have a concrete failure to point to, so you start tweaking prompts, adjusting wording, maybe adding more context, and it works a bit better, but not in a way that actually gives you confidence, more like you’re nudging something unstable into temporarily behaving.

I used to think that was just part of the process.

It isn’t.

What actually happens, at least in my case, is that the idea underneath the plan was never fully specified, it just looked like it was, which is a much more annoying failure mode because it passes the initial smell test.

The shift wasn’t about better prompts

At some point I leaned more seriously into execution plans, specifically the approach described in OpenAI’s Codex execution plans cookbook, and that already changed quite a bit because instead of vague instructions I started writing things that had actual shape to them, steps that were explicit, outcomes that were scoped, fewer places where the model had to guess what I meant.

And that helped.

But not in the way I expected.

Because even a very clean plan can still be fundamentally wrong if the idea underneath it was never interrogated, just wrong in a way that isn’t obvious until halfway through execution, which is exactly when you don’t want to discover it.

So I added friction on purpose

This is where Matt Pocock’s grill-me skill comes in, and I’m aware that “adding friction” sounds like the opposite of what you want when everything is already fast, but the interesting part is that this friction shows up earlier, which means it removes a much more expensive kind of friction later.

The idea is simple to describe and slightly uncomfortable to actually do.

You don’t jump straight into execution or treat the initial feature request as enough.

You interrogate it.

Relentlessly.

One question at a time, no skipping ahead, no hand-waving answers, and importantly, no pretending you know something when you don’t, because the whole thing falls apart if you do.

What that looks like in practice

I’ll usually start both parts at once.

I reference the skill, describe the feature I want to implement, and make it clear that the execution plan should come after the grilling pass, not before it.

Something like this:

$grill-me I want to add request-level caching to the blog post list page so repeated navigation does not refetch markdown metadata, but the cache still updates correctly when posts change. Create the execution plan after the grilling.

So instead of first writing a plan and then asking the skill to critique it, I start with the feature itself, let the grilling expose all the hidden decisions, and only then turn the answers into an execution plan.

It starts as something that sounds reasonable, like “add caching,” which is exactly the kind of sentence that feels clear until you realize how many decisions are hidden inside it.

Then the grilling starts and the surface area of that sentence explodes almost immediately.

You go from one sentence to a graph of questions.

Cache invalidation alone branches into timing, triggers, scope, storage, and suddenly you’re not talking about “adding caching” anymore, you’re defining a system.

Same with failure modes, same with state ownership, same with API stability, everything that looked like a single decision turns out to be five or six that depend on each other in ways that are very easy to ignore if nobody is forcing you to spell them out.

At some point one of two things happens.

Either the feature becomes much more concrete, in which case the execution plan is almost boring because everything has already been decided, or the idea collapses because it was relying on assumptions that don’t hold, which is also a good outcome, just earlier than it would have been otherwise.

The unexpected part

What changed isn’t that the code got dramatically better in some abstract sense.

What changed is that I stopped producing things that were “almost correct,” which is a category of output that is incredibly expensive because it looks finished until you start touching edge cases.

Before, I would get something working on the happy path, then spend time chasing inconsistencies, patching behavior, slowly realizing that the original request didn’t account for half the scenarios it needed to.

Now those scenarios show up during the interrogation phase, which means they get resolved before anything is written.

Which in practice means fewer rewrites, fewer weird bugs that only appear under specific conditions, and a lot less of that quiet technical debt that accumulates when you move too fast without actually being precise.

Codex feels different when you do this

Without this step, OpenAI Codex behaves like a very fast executor, which is already useful, but also a bit dangerous because it will happily run with whatever you give it.

With grilling in the loop, it starts to feel more like a design partner that refuses to move forward until the shape of the problem actually makes sense.

And that changes the interaction.

You stop asking “can you build this” and start asking “does this even hold up,” which is a very different question and tends to lead to very different outcomes.

It feels slower, right up until it isn’t

There’s definitely a moment where this workflow feels inefficient, mostly because you’re doing more thinking upfront, answering more questions, sometimes going down branches that don’t lead anywhere.

But what disappears almost entirely is the second phase, the one where you debug something that was never fully specified in the first place.

That phase used to take longer than I thought.

It just didn’t feel like it because it was spread out and disguised as iteration.

Where this breaks

This isn’t something I apply to everything.

If the problem is trivial, this is overkill.

If the context is incomplete, the questions won’t go deep enough to matter.

And if you rush through the answers just to get to execution, then you’re effectively skipping the whole point and just adding ceremony.

The value comes from actually engaging with the questions, not just responding to them.

Resources

The workflow is mostly these two pieces held together.

Using PLANS.md for multi-hour problem solving is the execution-plan side of the loop. It gives the work a concrete shape before Codex starts moving, especially when the task is large enough that “just implement this” would hide too many decisions.

grill-me is the pressure-testing side. It turns the feature request into a sequence of questions, walks the decision tree branch by branch, and forces vague parts of the design to become explicit before they become a plan.

What my loop looks like now

At this point my workflow is fairly consistent.

I start with the grill-me skill and the feature request in the same prompt, I answer the grilling questions until the shape of the work is clear, then I have Codex create the execution plan from those answers.

After I review that plan, that’s when implementation starts.

It’s stricter than what I used to do.

But the plans that come out of this survive contact with reality much more often, and that’s the only thing I really care about.