Thanks for attending my AI Engineer Melbourne 2026 talk, Fail fast, fix faster. Below you'll find the slides, benchmark materials, model references and extra reading around the core idea: once a model is competent enough, loop speed and validation quality start to matter as much as raw intelligence.
Talk slides and notes
You can view the slides below, or visit the slide site for the large version.
Slides made with presso
The rest of this page collects the context, benchmark materials, model references and follow-on reading from the talk.
Resources
Start with the context links if you want the narrative, or jump to the benchmark materials if you want the code, results and model references.
Talk context
These links give the framing for the talk, including the conference info, the article that started the experiment, and the loop-oriented thinking behind the benchmark design.
- Web Directions AI Engineer listing for the talk and conference details.
- Mercury 2 won't outthink frontier models but diffusion might out-iterate them, the original article that kicked off the talk.
- Everything is a Ralph loop by Geoffrey Huntley, which framed the persistent agent loop pattern used in the benchmark.
- AutoResearch by Andrej Karpathy, which explores a similar cheap-experiment, measurable-feedback loop in model training.
Benchmark materials
The benchmark uses a small quote approval API task with an OpenAPI contract and a fast validation harness. Each model gets up to 15 turns, receives targeted failure feedback, and is scored across repeated runs.
- Benchmark repository: ajfisher/demo-dumb-fast-agents
- Talk benchmark summary
- Final talk benchmark matrix
- Quote approval API task specification
Headline benchmark results
The headline result is not that Mercury 2 is the smartest model, but that a fast enough model can compound improvement loops very quickly when the validation harness is tight.
| Model | Pass | Median turns | Mean time to pass |
|---|---|---|---|
| Mercury 2 | 10/10 | 2 | 6s |
| GPT-5.4 mini | 10/10 | 1 | 41s |
| GPT-5.4 | 10/10 | 1 | 88s |
| GPT-4.1 mini | 10/10 | 2 | 54s |
| Gemini 2.5 Flash | 8/10 | 4 | 77s |
| Gemma 4 31B cloud | 10/10 | 5 | 133s |
| Qwen3 Coder 480B cloud | 7/10 | 5 | 611s |
| Orthrus Qwen3 8B MLX | 0/5 | n/a | n/a |
| Phi3 | 0/5 | n/a | n/a |
Model cards and provider docs
These are the model and provider references for the systems included in the benchmark runs.
- Mercury 2 from Inception Labs.
- GPT-5.4 and the GPT-5.4 Thinking system card.
- GPT-5.4 mini.
- GPT-4.1 mini.
- Gemini 2.5 Flash.
- Gemma 4 31B cloud on Ollama and the Gemma 4 model page.
- Qwen3 Coder 480B cloud on Ollama and the Qwen3 Coder launch post.
- Orthrus Qwen3 8B, plus the Orthrus code and paper.
- Phi3 on Ollama and the Phi-3 Mini 128K Instruct model card.
Mercury and diffusion LLM references
These cover the Mercury launch material and the underlying diffusion language model work that makes the fast iteration loop interesting.
- Introducing Mercury 2 from Inception Labs.
- Introducing Mercury, the earlier Mercury Coder launch post.
- Mercury: Ultra-Fast Language Models Based on Diffusion.
- Mercury 2 and the Rise of Real-time Subagents.
Extra references worth reading
These are useful additions if you want to go deeper on agent evaluation, repeatable workflows and feedback-driven optimisation.
- Inspect, the UK AI Security Institute evaluation framework, for thinking about repeatable LLM and agent evals.
- SWE-bench and SWE-bench Verified, useful references for repository-level coding agent evaluation.
- OpenAI Evals, useful as a reference point for building repeatable, code-backed eval suites.
- Claude Code best practices, especially the parts on context, verification and repeatable workflows.
- DSPy, which is relevant when you want to optimise an LLM workflow against a metric rather than hand-tune prompts forever.
Where to next?
The home page will give you featured articles and recent posts.
You can also check out the full list of articles that have ever been published.