Google Reclaims the AI Race: Deep Think, Aletheia, Codex-Spark, and the Infrastructure Power Shift in 2026
Google Reclaims the AI Race: Deep Think, Aletheia, Codex-Spark, and the Infrastructure Power Shift in 2026
I had that weird, slightly uncomfortable realization this week that the AI race isn’t slowing down — it’s accelerating in directions that feel less like product launches and more like scientific inflection points.
And I don’t mean flashy demos. I mean math proofs. Physics Olympiads. Benchmark charts that don’t look incremental anymore — they look like cliffs.
For most of early 2026, the headlines felt predictable. OpenAI. Anthropic. Funding rounds the size of small countries. New model tiers. Bigger context windows. Faster inference. Same rhythm, louder volume.
And then Google walked back in.
Not with marketing fluff. With numbers.
And the numbers are not subtle.
The Week Google Stopped Playing Quiet
Google has been oddly restrained this year compared to the constant noise from OpenAI and Anthropic, which made this latest drop feel less like an update and more like a reminder.
Gemini 3 Deep Think just posted an 84.6% score on ARC-AGI-2.
Let that sit for a second.
That’s not a tiny improvement over competitors — it cleared Opus 4.6 at 68.8% and GPT-5.2 at 52.9%. That’s a gap you don’t smooth over with spin. That’s a gap that reshuffles pecking order conversations.
And then there’s Humanity’s Last Exam — Deep Think hit 48.4%, a new high.
Olympiad-level performance in physics and chemistry for 2025 standards.
A 3,455 Elo on Codeforces, nearly a thousand points above Opus 4.6.
At some point the tone changes from “competitive” to “dominant.”
I didn’t expect Google to reclaim the narrative this bluntly. There was a quiet assumption forming — maybe unfair — that it was trailing in the reasoning race.
That assumption just died.
This Isn’t Just About Benchmarks
Benchmarks can be gamed. We all know that. Companies optimize for leaderboards all the time.
But here’s the part that made me pause longer than the percentages.
Google revealed a math research agent called Aletheia. It doesn’t just solve known problems. It tackles open ones. It verifies proofs. It pushes domain benchmarks higher.
That moves the conversation from “assistant” to “research collaborator.”
It’s a subtle shift, but it matters.
Because when an AI can independently explore open mathematical space and validate its own reasoning chains, we’re not talking about autocomplete anymore.
We’re talking about something closer to structured discovery.
That’s a different category of power.
And it’s live — at least for Google AI Ultra users in the Gemini app, with early API access for researchers.
Not hypothetical. Not a lab-only teaser.
Live.
Meanwhile, OpenAI Decided Speed Is the Real Battlefield
While Google flexed on reasoning depth, OpenAI attacked a different pain point entirely.
OpenAI launched GPT-5.3-Codex-Spark — a speed-optimized coding model running on Cerebras hardware.
And here’s the headline stat: 1,000+ tokens per second.
That’s not incremental acceleration. That’s “blink and your diff is done” territory.
Spark isn’t as strong as full Codex on benchmarks like SWE-Bench Pro. It trades some intelligence for speed.
But that trade-off is strategic.
Because real-time coding workflows aren’t about autonomous long-horizon planning. They’re about rapid edits. Feedback loops. Flow state.
And until now, Codex’s speed bottleneck was a real friction point.
This feels like OpenAI solving for usability rather than leaderboard dominance.
There’s another layer here too.
This model runs on Cerebras hardware — marking OpenAI’s first product outside its Nvidia stack.
After the $10B+ Cerebras deal and partnerships with AMD and Broadcom, this doesn’t feel experimental. It feels deliberate.
The Nvidia dependency question has been hanging over the industry for years.
Now it’s being actively dismantled.
The Quiet Infrastructure War
If you zoom out, something bigger is happening.
Google is pushing deeper reasoning.
OpenAI is diversifying hardware and optimizing speed.
And in the background, cost curves are getting punched in the face by Chinese labs.
Enter MiniMax.
MiniMax just released M2.5 — an open-source model rivaling Opus 4.6 and GPT-5.2 on agentic coding.
At a fraction of the cost.
M2.5-Lightning: $2.40 per million output tokens.
M2.5 Standard: $1.20 per million.
Those numbers matter more than people want to admit.
Because if frontier-level coding intelligence becomes economically cheap, the conversation shifts from “Can we deploy agents?” to “Why aren’t we deploying agents everywhere?”
MiniMax says M2.5 already powers 30% of its internal daily tasks and handles 80% of new code commits.
That’s not marketing fluff — that’s operational replacement.
Open-source weights and licensing details are still pending, but the direction is obvious.
The cost floor is dropping.
Again.
The Part That Feels Slightly Uncomfortable
Here’s where my optimism starts to blur into something else.
When reasoning models start hitting Olympiad gold levels…
When math agents solve open problems autonomously…
When coding models stream at 1,000 tokens per second…
When open-source labs undercut pricing by huge margins…
The barrier between “AI as tool” and “AI as infrastructure layer” dissolves.
And infrastructure layers don’t just assist industries — they reshape them.
There’s a quiet centralization risk here.
Google controls distribution through Gemini.
OpenAI controls distribution through ChatGPT and enterprise APIs.
MiniMax is pushing open weights, but cost efficiency at scale still requires serious compute backing.
We’re not decentralizing intelligence. We’re consolidating it — just across a slightly wider set of players.
And I don’t think we’re emotionally processing that fast enough.
Even the “Small” Stories Aren’t Small
Look at the rest of the week.
ByteDance launched Seedance 2.0 — another leap in video generation.
Mustafa Suleyman suggested most white-collar work could be automated within 12–18 months.
OpenAI is retiring GPT-4o, GPT-4.1, and o4-mini.
Anthropic closed a $30B funding round at a $380B valuation.
An OpenAI researcher resigned over concerns about ads and manipulation.
Individually, these are headlines.
Collectively, they form a pattern.
The models are getting stronger.
The capital pools are getting deeper.
The internal tensions are getting louder.
And the timeline is compressing.
The Democratization Illusion
There’s something ironic happening at the same time.
We’re getting tutorials on how to generate full TV commercials with AI — 20-second polished ads, built from Gemini prompts, Higgsfield frames, Kling 3.0 clips, stitched in a free editor.
Anyone can do it.
That feels empowering.
But the tools powering that workflow are built on enormous centralized models trained on enormous centralized datasets, running on enormous centralized compute clusters.
It’s creative democratization on top of industrial concentration.
That tension is going to define the next few years.
So Where Does This Leave Us?
Google just proved it’s still a frontier leader in reasoning.
OpenAI just proved speed and hardware independence are strategic priorities.
MiniMax just proved cost disruption isn’t slowing down.
And the rest of the ecosystem is layering applications on top of increasingly powerful foundations.
The AI race isn’t just about who has the smartest model anymore.
It’s about who controls the reasoning layer, the infrastructure layer, and the cost layer simultaneously.
That’s a different game.
And it feels less like a product cycle and more like the early days of cloud computing — except faster, and with higher stakes.
Chapter-by-Chapter Outline
Chapter 1 – Google Reclaims the Narrative
Deep Think’s benchmark dominance and what it signals about frontier reasoning.
Chapter 2 – From Assistant to Research Partner
Aletheia and the shift from tool to autonomous scientific agent.
Chapter 3 – Speed as Strategy
Why OpenAI’s Codex-Spark matters more than its raw benchmark power.
Chapter 4 – The Cost Collapse
MiniMax M2.5 and the economics of agentic coding.
Chapter 5 – Infrastructure Power and Centralization
The uncomfortable consolidation beneath “AI democratization.”
Chapter 6 – Automation Timelines and Cultural Shock
White-collar automation claims and the retirement of older models.
Conclusion – Intelligence as a Utility
A quiet look at where consumer AI may be heading next.
Chapter 1 – Google Reclaims the Narrative
For a while, it felt like Google was watching the AI race from the sidelines.
Not absent — just less noisy.
OpenAI dominated headlines with hardware deals and product refreshes. Anthropic kept tightening its positioning around safety and enterprise alignment. Funding rounds ballooned.
Google, meanwhile, seemed oddly calm.
Then Gemini 3 Deep Think dropped its new numbers.
And the calm evaporated.
An 84.6% score on ARC-AGI-2 isn’t incremental progress. It’s a leap that forces comparison. Opus 4.6 at 68.8%. GPT-5.2 at 52.9%. Those aren’t rounding errors.
That’s separation.
When Deep Think posts 48.4% on Humanity’s Last Exam and achieves gold-medal-level performance on the 2025 Physics and Chemistry Olympiads, it’s not just claiming intelligence — it’s demonstrating structured reasoning depth.
And the 3,455 Elo on Codeforces — nearly 1,000 points above Opus 4.6 — makes this even harder to downplay.
Google didn’t just catch up.
It reasserted itself.
The timing matters too.
After a quieter start to 2026, this feels less like iteration and more like strategic patience paying off.
The AI race has always been cyclical — one lab surges, another recalibrates, then someone lands a breakthrough.
But this surge feels heavier.
Because it isn’t just about chat fluency or creative writing polish.
It’s about math.
It’s about science.
It’s about structured reasoning at levels that historically required years of human specialization.
And once reasoning quality jumps like this, the downstream effects ripple into everything else — agents, automation, research tooling, education.
Google didn’t just post good numbers.
It changed the tone of the race again.
And if this is what a “quiet period” produces, the next phase might not feel quiet at all.
Chapter 2 – From Assistant to Research Partner
The benchmark numbers were loud.
But Aletheia is what keeps echoing in my head.
Because benchmarks are controlled environments — curated questions, measurable outputs, a scoreboard everyone agrees on. That’s comfortable. That’s competitive sport.
Open mathematical problems are not comfortable.
When Google revealed Aletheia — a math agent that can autonomously tackle open problems, verify proofs, and push domain benchmarks higher — the framing subtly changed. It stopped sounding like “look how smart our chatbot is” and started sounding like “this thing can explore the unknown.”
And that’s not marketing language. That’s a structural shift.
We’ve been used to AI as augmentation. Draft this. Suggest that. Refactor this function. Summarize that research paper. Helpful. Efficient. Still subordinate.
But an agent that independently navigates proof structures and validates its own reasoning chains? That’s creeping toward research collaboration.
I hesitate even writing that because it sounds dramatic. But the implication is baked into the capability.
If a system can explore open mathematical space — not just replicate known solutions — then we’re edging into discovery territory. Even if the discoveries are incremental. Even if humans still guide the process.
And it’s live for Google AI Ultra users in the Gemini app, with early API access for researchers.
That detail matters.
This isn’t a sealed lab experiment. It’s partially productized. That changes the tempo.
Because once researchers start integrating this into workflows, even cautiously, the feedback loop accelerates. The model improves. The dependency deepens. The line between tool and collaborator blurs.
I’m not saying it replaces mathematicians.
I am saying it changes what being a mathematician might look like.
And that’s a subtler disruption than mass automation headlines — but arguably more profound.
Chapter 3 – Speed as Strategy, Not Vanity
While Google was flexing depth, OpenAI pivoted to something that feels almost mundane in comparison.
Speed.
But not trivial speed. Structural speed.
GPT-5.3-Codex-Spark pushing over 1,000 tokens per second on Cerebras hardware isn’t about bragging rights — it’s about workflow friction disappearing.
Because here’s the truth: most developers don’t need Olympiad-level reasoning every second. They need fast, reliable edits. They need iteration without latency breaking their concentration.
Spark trades some raw intelligence for responsiveness. It’s slightly weaker than full Codex on benchmarks like SWE-Bench Pro. And that trade-off feels intentional, almost surgical.
Full Codex handles long autonomous tasks. Spark handles rapid interaction.
That division of labor tells you something about how OpenAI sees the near-term future: hybrid workflows. Humans steering. Models accelerating.
And then there’s the hardware shift.
Running on Cerebras chips marks OpenAI’s first product outside its Nvidia stack. After the $10B+ Cerebras deal and partnerships with AMD and Broadcom, this doesn’t look like experimentation. It looks like infrastructure hedging.
For years, Nvidia has been the gravitational center of AI compute.
Now we’re seeing deliberate diversification.
And that’s not just technical — it’s geopolitical, economic, strategic.
If intelligence becomes infrastructure, the chips underneath it become leverage points.
OpenAI knows that.
Google knows that.
Everyone at scale knows that.
This isn’t just model competition anymore. It’s supply chain chess.
Chapter 4 – The Cost Collapse Nobody Can Ignore
Then there’s MiniMax.
MiniMax dropping M2.5 into the ecosystem feels like another one of those quiet detonations.
On performance, it rivals Opus 4.6 and GPT-5.2 for agentic coding.
On price, it undercuts dramatically.
M2.5-Lightning at $2.40 per million output tokens.
M2.5 Standard at $1.20.
That’s not symbolic. That’s operational.
MiniMax says M2.5 powers 30% of its daily internal tasks and handles 80% of new code commits.
That’s internal reliance at scale.
And if open-source weights and licensing details materialize as expected, the downstream impact could be larger than any single benchmark win.
Because cost determines deployment.
If agentic coding becomes cheap enough to run 24/7 without executive anxiety about burn rate, experimentation explodes.
And when experimentation explodes, unexpected automation patterns emerge.
The cost curve resetting every few months is destabilizing in its own way.
It erodes moats.
But it also forces escalation.
Closed labs must justify premium pricing with differentiated capability.
Open labs must prove sustainability without massive enterprise lock-in.
Meanwhile, enterprises quietly test everything.
Chapter 5 – The Centralization Paradox
This is the part where my excitement turns slightly uneasy.
We talk constantly about AI democratization.
Anyone can build a TV commercial now using Gemini prompts, Higgsfield frames, Kling 3.0 clips, stitched in a free editor. Add music from Suno or Eleven Labs. Twenty seconds of polished content without a production crew.
That’s real empowerment.
But underneath that creative surface is massive centralized infrastructure.
The reasoning models? Controlled by a handful of companies.
The compute clusters? Concentrated in even fewer.
The capital funding this? Measured in tens of billions.
Even MiniMax’s cost disruption relies on substantial backing and infrastructure.
So yes, access is widening.
But control is not dispersing at the same rate.
And that asymmetry matters.
Because infrastructure layers shape markets quietly. They set pricing norms. They influence standards. They determine who builds on top of whom.
We’ve seen this movie before with cloud computing.
It starts with agility and innovation.
It ends with dependence.
I’m not saying we’re doomed to monopoly dynamics.
I am saying consolidation pressure is real — and accelerating.
Chapter 6 – The Background Noise That Isn’t Noise
Then there are the surrounding signals.
ByteDance launching Seedance 2.0.
Mustafa Suleyman suggesting most white-collar work could be automated within 12–18 months.
OpenAI retiring GPT-4o, GPT-4.1, and o4-mini.
Anthropic closing a $30B funding round at a $380B valuation.
An OpenAI researcher resigning over concerns about ads and manipulation.
Individually, these feel like news cycle fragments.
Collectively, they form a pattern of acceleration and tension.
Models are improving rapidly.
Funding is ballooning.
Internal ethical debates are surfacing publicly.
And timelines for automation are being discussed in months, not decades.
That compression changes psychology.
It forces companies to move faster than they might otherwise choose.
It forces workers to reevaluate skills faster than feels comfortable.
It forces regulators to play catch-up at impossible speed.
And somewhere inside all that momentum, consumer expectations quietly recalibrate.
We start expecting intelligence on demand.
Instant. Cheap. Integrated everywhere.
Conclusion – Intelligence as a Utility, and the Quiet Trade-Off
When I zoom out from this week — Google’s Deep Think surge, OpenAI’s hardware diversification, MiniMax’s cost disruption — it feels less like isolated product news and more like infrastructure solidifying.
Reasoning quality is climbing.
Inference speed is accelerating.
Costs are dropping.
Access is expanding.
Those are objectively positive vectors.
But infrastructure layers always come with trade-offs.
They make things easier.
They also make dependency invisible.
If intelligence becomes a utility — like electricity, like cloud storage — we’ll barely notice the shift day to day. We’ll just plug in. Generate. Iterate. Automate.
And slowly, decision-making scaffolding moves outward from individuals to systems optimized by companies whose incentives don’t always align perfectly with ours.
That’s not dystopian speculation.
It’s how infrastructure works.
The uncomfortable truth isn’t that AI is getting smarter.
It’s that it’s getting embedded.
And embedded systems are hardest to question once they’re everywhere.
This week wasn’t just about benchmarks.
It was about the quiet realization that the AI race isn’t merely competitive anymore.
It’s foundational.
And foundations, once poured, are very hard to reshape.

Comments
Post a Comment