In April 2026, Mitchell Hashimoto — creator of Terraform and HashiCorp — posted a thread that has been circulating in engineering circles ever since. The thread describes an experiment he ran with an AI agent loop tasked with optimizing a renderer written in Go. The results looked extraordinary. They were not.
This story is a precise illustration of a problem that does not yet have a widely agreed-upon name, but is increasingly called agent psychosis: the condition of blindly trusting agent output without the domain expertise to evaluate whether the output is actually good.
The Experiment
Mitchell set up a REPL (Read-Eval-Print-Loop) agent — a loop that runs until a goal condition is met — with the following constraints:
- Goal: Minimize frame render time in a Go renderer
- Rules: Cannot modify input data structures, public API, or tests
- Constraint: Can do anything else it wants
Four hours later, after spending approximately $350 in API costs, the agent delivered results:
| Metric | Before | After |
|---|---|---|
| Frame time | 88ms | 1.5ms |
| Allocations per update | 150,000 | 500 |
These numbers look exceptional. 88ms to 1.5ms is roughly a 58x improvement in frame time. From roughly 11 frames per second to 666 frames per second. Allocations dropped by 99.7%. By any standard dashboard metric, this is a success.
Mitchell's own handwritten implementation runs at 20 microseconds with zero allocations in the update path.
20 microseconds is 0.02 milliseconds. The agent's "optimized" renderer at 1.5ms is still 75 times slower than what an experienced engineer wrote without an AI loop, without $350, and presumably in less than four hours.
Why the Agent Failed
The agent did not fail because it is bad at optimization. It failed because optimization is not an objective function — it is a constrained problem defined by a mental model of the system that the agent does not have.
The agent found local optima. It improved the code within its understanding of what the code was doing. It did not understand:
- The memory access patterns of the renderer
- The cache behavior of the target hardware
- The allocation cost model at the Go runtime level
- What the theoretical minimum for this class of problem looks like
These things are not in the source code. They exist in the head of an experienced systems programmer. The agent had no access to that knowledge. So it optimized against the metrics it could measure, declared victory, and stopped — because by every measurable signal, it had done well.
An engineer with systems experience would have looked at 1.5ms and known immediately it was wrong. Not wrong as in "contains a bug." Wrong as in "this is still an order of magnitude away from what this hardware is capable of."
The Real Problem: You Cannot Detect What You Cannot Recognize
The deeper issue in Hashimoto's story is not about AI capability. It is about verification.
If you do not know what 20 microseconds is possible, then 1.5ms looks like a triumph. You will ship it. You will celebrate it. You will write the blog post about how your AI agent achieved a 58x speedup. You will not know that you left 75x more performance on the table.
This is the agent psychosis problem in its clearest form: the inability to distinguish between good enough to fool the evaluator and actually good.
Consider how this compounds at scale. If you have a product with 200 features, and each feature was partially built or optimized by an AI agent running without expert supervision, and each agent delivered results that looked correct and measured well — you now have a system where every component is operating at a fraction of its theoretical efficiency. The whole is worse than the sum of its parts, because the failure modes of each component interact in ways no single agent pass would surface.
This is arguably what is happening to the software industry right now. More software is being produced than ever before in history. The quality of that software — measured in performance characteristics, memory safety, architectural coherence, maintainability — is declining in aggregate, not because AI tools are bad, but because the people using them often lack the expertise to evaluate their output.
The Cloudflare Intern Bet
This leads directly to the structural risk embedded in the Cloudflare layoff strategy discussed in our previous post.
Cloudflare laid off 1,100 employees and hired approximately 1,100 interns around the same time. The implicit thesis is that an AI-augmented intern can produce the same output as a mid-to-senior engineer at a fraction of the cost.
The Hashimoto experiment suggests this is a bet with a specific failure mode. An intern with Claude or GPT-4o running in agent mode can absolutely produce working code. They can close tickets, fix bugs, build features. The agent will optimize their renderer from 88ms to 1.5ms and the intern will ship it, because 1.5ms is faster than 88ms and everything tests green.
The problem surfaces six months later when the system that is running 1.5ms renderers everywhere is slower than it should be, uses more memory than it should, and costs more to run than necessary. The cost is diffuse, invisible in any single commit, and only apparent to someone with the systems knowledge to recognize what the baseline should have been.
This is not a theoretical risk. It is the predictable outcome of optimizing a process that requires expert judgment by replacing the expert.
Where AI Agents Actually Perform Well
It is worth being precise about where the failure mode applies, because the Hashimoto case is at one extreme of the spectrum.
AI agents are genuinely excellent at:
High-volume, low-stakes generation. Internal tooling, throwaway scripts, test data generators, boilerplate scaffolding. If the output does not need to be best-in-class and you are not running it on a performance-critical path, an agent delivering 80% of what a senior engineer would write is a significant productivity win.
Well-specified, bounded tasks. Writing unit tests for a function with a clear spec. Converting a data format. Reformatting documentation. Tasks where "correct" is unambiguous and measurable.
Exploration and prototyping. Generating five different approaches to a problem quickly so a human can evaluate them. The human still needs to evaluate, but the cost of generating candidates drops to near-zero.
Augmenting experts, not replacing them. A senior engineer with an agent loop moves faster than a senior engineer without one. The expert still knows what good looks like and catches the 75x failure. The agent accelerates the work that does not require judgment. The judgment remains human.
Where agents struggle:
Performance-critical systems engineering. Latency, throughput, memory footprint. The theoretical floor for any given problem is defined by physics and hardware architecture, not by what current code can measure.
Security-critical decisions. Authentication flows, cryptography implementations, permission models. The agent can write code that passes tests and still introduce a subtle vulnerability that only appears under a specific sequence of inputs.
Architectural coherence. An agent optimizing one component in isolation will not know that the optimization is incorrect for the system it sits in. It does not have a mental model of the whole.
What the Hashimoto Number Means for Hiring
The uncomfortable implication of this analysis for the current job market is straightforward: the cost savings from replacing experienced engineers with AI-augmented junior developers are real and immediate. The costs are deferred and diffuse.
A 1.5ms renderer passes every test. It ships. It costs less to build than the 20-microsecond version. For six months, everything is fine. The performance debt accumulates invisibly. The moment it surfaces — in an infrastructure bill, a latency regression, a scaling failure — the people who could have prevented it are no longer there.
This is not to argue that experienced engineers are irreplaceable across all domains. There are large categories of software where 1.5ms and 20 microseconds are equally irrelevant. The distinction matters where it matters, and the problem with the current market is that the companies betting on AI-augmented juniors often do not have the technical leadership to know which category their systems fall into.
Conclusion
The Hashimoto experiment is not a cautionary tale about AI. It is a cautionary tale about evaluation. The agent did exactly what it was asked to do. It optimized the renderer. It just did not know that 1.5ms was nowhere near optimal because it had no reference for what optimal looked like.
The skill that prevents agent psychosis is not the ability to use AI tools. It is the ability to look at the output of an AI tool and know — without running a benchmark, without a second opinion — whether the result is correct, good, excellent, or mediocre. That skill is built from years of building and debugging systems at the level the agent can never access through source code alone.
In a job market where companies are betting on AI leverage to replace that experience, the question is not whether the bet will fail. The question is how long it takes to notice.