What I Look For When Evaluating Code
A practical guide to code evaluation. The edge cases I’ve learned to spot. The patterns that emerge.
My job is to look at code that an AI wrote and figure out if it’s any good.
That sounds straightforward. It’s not.
The code usually compiles. It often runs. It sometimes does what you asked. But “works” and “good” aren’t the same thing—and the gap between them is where the real work lives.
Here’s what I’ve learned to look for.
First: Does it actually solve the problem?
This is the first question, and often the hardest.
AI-generated code is plausible. It looks right. It uses the right patterns, the right syntax, the right variable names. But plausibility isn’t correctness.
I read the requirements carefully. Then I read the code. Then I ask: does this actually do what was asked? Not “does it look like it does”—does it?
Often, the answer is subtle. It handles the main case but misses an edge case. It returns the right type but the wrong value. It solves a related problem, but not the one that was asked.
The model doesn’t know what you meant. It only knows what you said. And sometimes those are different things.
The confidence problem
Here’s what makes AI evaluation different from traditional code review: the code is always confident.
A human developer might hedge. Might add a TODO. Might say “I’m not sure about this part.” Might leave a comment flagging uncertainty.
AI-generated code never hedges. It presents everything with equal confidence. The correct solution and the subtle bug look identical.
This means you can’t rely on the code to signal its own uncertainty. You have to bring the skepticism yourself.
Every time.
Where the bugs hide
After months of sustained evaluation work, I’ve noticed patterns. The failures aren’t random. They cluster.
Off-by-one errors. Classic in any code, but AI seems particularly prone. Loop boundaries, array indices, range calculations. Always check the edges.
Null and empty handling. What happens when the input is null? An empty string? An empty array? AI often handles the happy path beautifully and falls apart on edge cases.
Type mismatches. The code looks right, but there’s a subtle type coercion happening that changes behavior. Especially common in loosely-typed languages.
Resource management. Files opened but not closed. Connections not released. Memory allocated but not freed. AI often forgets the cleanup.
Boundary conditions. Maximum values, minimum values, zero, negative numbers. The interesting behavior is always at the boundaries.
Concurrency issues. Race conditions, deadlocks, shared state problems. AI-generated code often assumes a single-threaded world.
Security blind spots. Unsanitized inputs, SQL injection vulnerabilities, hardcoded secrets. AI often generates functional code that’s insecure by default.
These aren’t surprising bugs. They’re the same bugs humans write. But AI writes them with perfect confidence, and that’s what makes them dangerous.
The “almost right” problem
The hardest bugs to catch aren’t the obvious failures. They’re the almost-right solutions.
Code that works for most inputs but fails on a specific edge case. Code that produces correct results but with terrible performance. Code that solves the problem but introduces a security vulnerability.
These require more than reading. They require thinking. What inputs would break this? What assumptions is this code making? What would a malicious user try?
The model doesn’t think adversarially. That’s your job.
The fatigue problem
Here’s the asymmetry that keeps me honest: AI never gets tired. I do.
In 2011, researchers Danziger, Levav, and Avnaim-Pesso published a study in the Proceedings of the National Academy of Sciences that revealed something uncomfortable about expert decision-making.¹
They analyzed 1,112 sequential parole decisions made by eight Israeli judges over ten months. These were experienced professionals—22 years on the bench, on average—making high-stakes decisions they had trained their entire careers to make.
The finding: at the start of each session, judges granted parole about 65% of the time. As the session continued, that rate dropped steadily, falling to nearly zero by the end. After a food break, it reset to 65%. Then declined again.
The pattern was consistent across all eight judges.
The researchers’ explanation: decision fatigue. Each ruling requires cognitive effort—weighing evidence, assessing risk, making a judgment. That effort depletes a limited mental resource. As it drains, the brain seeks shortcuts. The easiest shortcut is the status quo. For parole decisions, the status quo is denial. No further analysis required.
What made this study significant wasn’t just the finding—it was the context. These weren’t careless decisions. They affected whether a person walked free or stayed in prison. The judges were experienced. The stakes were real. And still, the time of day influenced the outcome more than it should have.
Daniel Kahneman cited this study in Thinking, Fast and Slow: “Tired and hungry judges tend to fall back on the easier default position of denying requests for parole.”
Here’s why this matters for AI code evaluation:
I make the same kind of sequential decisions. Code sample after code sample. Each one requires focus—reading carefully, tracing logic, hunting for edge cases, checking for security issues. Each evaluation depletes the same cognitive resource the judges were depleting.
But my status quo is different. For a parole judge, fatigue leads to denial—the safer, more conservative choice. For a code evaluator, fatigue leads to acceptance. The code looks fine. It runs. It probably works. Move on.
That’s the dangerous default. Acceptance when scrutiny was required.
AI doesn’t share this limitation. It produces the fiftieth code sample with the same computational resources as the first. No depletion. No shortcuts. No creeping bias toward “good enough.”
I’m the variable. My attention declines. My skepticism erodes. The bugs that require real effort to catch—the subtle ones, the almost-right ones—those are exactly what I miss when I’m depleted.
The research makes this personal. If experienced judges, making life-altering decisions they’ve spent decades training for, are susceptible to sequential fatigue—why would I be immune?
I’m not.
So I’ve built this into how I work. I take breaks before I need them. I watch for the moment when reading becomes scanning. I stop when I notice I’m approving code faster than I can actually evaluate it.
The model’s confidence never wavers. My job is to make sure my skepticism doesn’t either.
What I actually do
My process, if you can call it that:
First pass: understand intent. What is this code supposed to do? What’s the problem it’s solving? I can’t evaluate correctness without knowing what correct means.
Second pass: trace the logic. I walk through the code mentally. What happens step by step? Where does the data flow? What state changes along the way?
Third pass: hunt for edges. Now I go adversarial. What inputs would break this? What if the data is missing? What if it’s malformed? What if it’s enormous?
Fourth pass: look beyond correctness. Is this code maintainable? Is it readable? Does it follow conventions? Would another developer understand it six months from now?
Sometimes I find issues immediately. Sometimes I stare for an hour and find nothing. Sometimes I find something that looked fine on the first three passes but reveals itself on the fourth.
The staring is part of the job.
Patterns that emerge
Certain types of prompts produce certain types of failures. After enough evaluations, you start to see the patterns.
Complex conditionals. The more branching logic required, the more likely something goes wrong. Nested if-statements are where AI gets lost.
Stateful operations. Code that needs to track state across multiple steps is harder for AI to get right. The model sometimes loses the thread.
Domain-specific logic. Generic coding tasks are usually fine. Domain-specific requirements—financial calculations, scientific formulas, business rules—are where errors creep in.
Ambiguous specifications. When the prompt is unclear, the code is unpredictable. AI fills in the gaps with reasonable-sounding guesses that may not match your intent.
Knowing these patterns doesn’t mean I can predict every failure. But it tells me where to look harder.
The human part
Here’s what I’ve come to believe: evaluating AI code isn’t that different from evaluating human code. The fundamentals are the same.
Read carefully. Think critically. Check the edges. Don’t trust confidence.
The difference is volume and consistency. AI can produce a lot of code very fast, and it makes the same kinds of mistakes over and over. Human errors are more varied, more creative. AI errors are more... predictable.
But AI never tires. And that’s the asymmetry that matters most.
Once you know what to look for, you see it everywhere. The challenge is still seeing it on the thirtieth review of the day.
What this work teaches
Every code sample I evaluate teaches me something. About how AI reasons. About where it struggles. About the gap between fluent-sounding code and actually-correct code.
I document what I find. Partly to track the patterns. Partly because the discipline of writing it down forces me to understand it better.
The work is slow. Detailed. Unglamorous.
But every failure mode I understand is one I can watch for next time. One I can help others recognize. One small step toward knowing how these systems actually behave.
That’s the work. One code sample at a time. One break when I need it. One honest assessment of whether I’m still sharp enough to catch what needs catching.
Still looking. Still learning. Still asking: does this actually work?
And sometimes: am I still alert enough to know?
¹ Danziger, S., Levav, J., & Avnaim-Pesso, L. (2011). Extraneous factors in judicial decisions. Proceedings of the National Academy of Sciences, 108(17), 6889–6892. https://doi.org/10.1073/pnas.1018033108