How Smart Are Today’s Coding Agents?
AI Inside for Wednesday, February 11, 2026
This episode is sponsored by Airia. Get started today at airia.com.
Lots of interesting stuff this week on the AI Inside podcast, including Anthropic’s Claude 4.6 rattling Wall Street, OpenAI’s GPT‑5.3 Codex coding like a full dev team, Super Bowl’s AI ad blitz, and Waymo’s wild Genie 3 driving sims.
But first, big thanks to those who support us directly via Patreon at patreon.com/aiinsideshow like Charles Gillogly, Lindethier, and Steve Isaacson!!
Anthropic’s Claude 4.6
Alright, buckle up, this top block is a doozy. First, there’s Anthropic’s new Claude 4.6, positioned as an analyst-grade financial research engine (Bloomberg). It can look through company data, regulatory filings, and market info and produce detailed financial research. It creates spreadsheets, presentations, software development, yadda yadda. It also increases the token limit to 1 million and introduces “agent teams” for splitting up agentic workflows amongst different agents (TechCrunch). We saw a return to the Wall Street concern over software, with some financial services software dropping 10 percent on the idea that this model was coming for their work (Yahoo Finance).
OpenAI’s GPT‑5.3 Codex
Second, there’s OpenAI’s GPT‑5.3 Codex. It sits on a very similar trend line, but on the software side. This new agentic model merges the strength of 5.2 Codex in coding with GPT‑5.2’s reasoning and knowledge (OpenAI). It is able to write and debug code and manage full software lifecycles. And yes, spreadsheet creation, slide decks, reports, etc. OpenAI is saying that 5.3 Codex was instrumental in creating itself, not building it from scratch, but as an assistant throughout the development process.
Matt Shumer says AI is coming for nearly every job
Making the rounds is a long post by Matt Shumer, CEO of an AI company called HyperWrite. In it, he basically shares the common story of “AI is coming for your job too.” He admits that yes, he is part of the AI train as he has a company invested in it. He shares that the release of Opus 4.6 and GPT‑5.3 Codex hints at what is coming down the pipeline for nearly every job in a few years, considering how rapidly models have evolved (Something Big Is Happening). He describes asking it to create an app that writes tens of thousands of lines of code, opens and tests the app, iterates like a developer, and returns when it is ready. Evolution is inevitable, so why is this essay catching so much steam?
Super Bowl’s AI-heavy ad blitz
So the Super Bowl was last Sunday. I definitely watched as did many. Lots of AI ads, as can be expected. iSpot reported that 23 percent of Super Bowl ads, 15 out of 66, featured AI in some way (Adweek). Yes, there was OpenAI vs Anthropic, a pitch to consumers in both cases (CNBC). Google showed how AI in its Photos product is good for new home buyers looking to picture their stuff in a new house. AI.com really wants you to know that AGI is coming, but first, you should grab your AI.com handle (Tom’s Hardware). I definitely noticed an uptick in ads that included genAI video, like Dunkin Donuts de-aging 90s actors (Creative Bloq) and Svedka’s weird AI spot (Yahoo). Can these Super Bowl ads make Americans love AI, or do they just piss people off (Washington Post)?
OpenAI turns on ads in ChatGPT
OpenAI is activating ads inside of ChatGPT Free and Go accounts (OpenAI). That includes sponsored placements that sit alongside the answers, though those ads do not change the answers themselves. Ads target logged-in US adults only for now and avoid sensitive topics like health, mental health, and politics (Yahoo Finance). They rely on conversational themes, past chats, and prior ad interactions, though some of that is controllable in settings. Free users can hide ads in exchange for fewer daily messages. OpenAI says advertisers see none of the message content in their metrics, only views and clicks. All of this lands alongside criticism like “Why I Quit My Job at OpenAI” in the New York Times (NYT).
Waymo and DeepMind’s Genie 3 world model
Waymo is using DeepMind’s Genie 3 world model to spin up realistic driving scenarios for its vehicles (Bloomberg). Certain scenarios are difficult, if not impossible, to encounter or replicate for training in the real world, so making sure the vehicles know what to do in the small chance they happen is really important. This allows them to train on simulations of edge cases. The model can generate depth data that the vehicles learn from. It is one great example of the tangibly useful benefits of developing world model systems like Genie 3 and how synthetic data becomes incredibly useful and potentially life-saving.
Generative AI speeds up work and expands it
An HBR study finds that generative AI does not actually lighten workloads. It speeds up work and expands it (Harvard Business Review). People move faster, which means they take on more kinds of tasks and stretch work into evenings and breaks because “doing work” feels possible and rewarding. They run multiple threads in parallel and squeeze in a quick prompt here and there. In time that erodes mental and physical recovery, raises the expectations bar, and builds a self-reinforcing cycle of more output. The report suggests companies should build recovery time and reflection time into their processes. I feel this one hard. I can’t even begin to tell you how many tabs with chatbots I have open at any given time, and the urge to open one more window during times of rest happens to me ALL. THE. TIME.
AI and medical advice gone wrong
Reuters is reporting on a Nature Medicine study that shows that AI is mislabeling body parts and impacting surgeries in operating rooms (Reuters). The study looked at 1,298 people in the UK, randomly assigned GPT‑4o, Llama 3, or Cohere’s Command R. They were asked to “make decisions about a medical scenario as though they had encountered it at home.” The models miss two-thirds of relevant conditions when real people describe their symptoms in their own non-structured words. When they are fed structured, clinician-grade data, they identify 94.9 percent of the cases. Very different outcomes depending on how people use the tools.
Meta’s Vibes app and waning Sora hype
Meta is spinning Vibes, its AI-generated video feed, out of the Meta AI app and into its own standalone app (Engadget). Meta is competing with Sora in this department. But maybe that is not a smart idea. OpenAI’s Sora app started strong when it launched late 2025. Now, installs have fallen 32 percent in December and 45 percent in January (TechCrunch). So maybe people do not actually care that much about genAI content yet.
Alphabet’s 100-year AI bonds
Alphabet is tapping ultra-long, ultra-cheap debt to lock in funding for its AI and data center ambitions. It is issuing rare 100 year bonds at rates lower than your typical 30 year mortgage and raising nearly 32 billion dollars in two days (Semafor). That is a long investment horizon, perfect for ultra-conservative investors who believe Google will still be around 100 years from now. Google sits on more than 100 billion dollars in cash, which it can use for buybacks, deals, and shocks to the business as they arise.
OpenAI’s delayed hardware device
OpenAI has dropped the “io” name for its upcoming hardware device developed in collaboration with Jony Ive’s design firm (Wired). This was revealed in a court filing related to a copyright lawsuit with audio startup “iYo.” OpenAI expects the hardware to hit the market sometime around Feb 2027. It was originally expected to hit the market later in 2026. So the timeline is slipping a bit.
AI.com, the 70 million dollar domain
We spoke about AI.com earlier; let’s go again. Apparently, Crypto.com founder Kris Marszalek bought the AI.com domain for around 70 million dollars in crypto, the largest disclosed domain sale (FT). Second largest domain sale was Voice.com for 30 million dollars in 2019. Third, that I could find was Chat.com at 15.5 million dollars in 2023. That is a lot of money for a short URL and a signal of just how hot the AI branding race has become. AI.com’s Super Bowl push did not go smoothly either, with its 85 million dollar campaign crashing servers (Tom’s Hardware).
Programming note
There will be NO episode next week. Honestly, ya’ll: I need a break. I’ve been grinding for months on end and, with a full week trip to Tahoe for some family snow time, I’m taking the opportunity to truly disconnect. Apologies for the episode gap, but hey, gotta take care of my mental health and be with the fam! Thanks for understanding.
HUGE thank you to Executive Producers
HUGE thank you to Executive Producers on the Patreon: DrDew, Jeffrey Marraccini, Radio Asheville 103.7, Dante St James, Bono De Rick, Jason Neiffer, Jason Brady, Anthony Downs, Mark Starcher, Karsten Samaschke!!
Thank you for watching and reading.


The Nature Medicine diagnostic accuracy finding buried here is more important than the Codex benchmark numbers. A two-thirds miss rate on informal symptom descriptions isn't an AI problem - it's a deployment context problem. The gap between 'works in structured test' and 'works in real-world use' is the entire reliability challenge for agentic systems. Same pattern shows up in coding agents: benchmark performance doesn't translate linearly to production reliability.
I've run Claude Code for months on autonomous builds and the failure modes are almost never capability gaps - they're context and specification gaps. Benchmarks measure the former, not the latter.