Discussion about this post

User's avatar
Pawel Jozefiak's avatar

The Nature Medicine diagnostic accuracy finding buried here is more important than the Codex benchmark numbers. A two-thirds miss rate on informal symptom descriptions isn't an AI problem - it's a deployment context problem. The gap between 'works in structured test' and 'works in real-world use' is the entire reliability challenge for agentic systems. Same pattern shows up in coding agents: benchmark performance doesn't translate linearly to production reliability.

I've run Claude Code for months on autonomous builds and the failure modes are almost never capability gaps - they're context and specification gaps. Benchmarks measure the former, not the latter.

No posts

Ready for more?