Fail fast, fix faster: AI Engineer Melbourne 2026 resources

Slides, benchmark code, model cards and references for Fail fast, fix faster at AI Engineer Melbourne 2026.

Published: Wednesday, June 3rd 2026

Thanks for attending my AI Engineer Melbourne 2026 talk, Fail fast, fix faster. Below you'll find the slides, benchmark materials, model references and extra reading around the core idea: once a model is competent enough, loop speed and validation quality start to matter as much as raw intelligence.

Talk slides and notes

You can view the slides below, or visit the slide site for the large version.

Slides made with presso

The rest of this page collects the context, benchmark materials, model references and follow-on reading from the talk.

Resources

Start with the context links if you want the narrative, or jump to the benchmark materials if you want the code, results and model references.

Talk context

These links give the framing for the talk, including the conference info, the article that started the experiment, and the loop-oriented thinking behind the benchmark design.

Benchmark materials

The benchmark uses a small quote approval API task with an OpenAPI contract and a fast validation harness. Each model gets up to 15 turns, receives targeted failure feedback, and is scored across repeated runs.

Headline benchmark results

The headline result is not that Mercury 2 is the smartest model, but that a fast enough model can compound improvement loops very quickly when the validation harness is tight.

Model Pass Median turns Mean time to pass
Mercury 2 10/10 2 6s
GPT-5.4 mini 10/10 1 41s
GPT-5.4 10/10 1 88s
GPT-4.1 mini 10/10 2 54s
Gemini 2.5 Flash 8/10 4 77s
Gemma 4 31B cloud 10/10 5 133s
Qwen3 Coder 480B cloud 7/10 5 611s
Orthrus Qwen3 8B MLX 0/5 n/a n/a
Phi3 0/5 n/a n/a

Model cards and provider docs

These are the model and provider references for the systems included in the benchmark runs.

Mercury and diffusion LLM references

These cover the Mercury launch material and the underlying diffusion language model work that makes the fast iteration loop interesting.

Extra references worth reading

These are useful additions if you want to go deeper on agent evaluation, repeatable workflows and feedback-driven optimisation.

  • Inspect, the UK AI Security Institute evaluation framework, for thinking about repeatable LLM and agent evals.
  • SWE-bench and SWE-bench Verified, useful references for repository-level coding agent evaluation.
  • OpenAI Evals, useful as a reference point for building repeatable, code-backed eval suites.
  • Claude Code best practices, especially the parts on context, verification and repeatable workflows.
  • DSPy, which is relevant when you want to optimise an LLM workflow against a metric rather than hand-tune prompts forever.

Where to next?

The home page will give you featured articles and recent posts.

You can also check out the full list of articles that have ever been published.