The news, 365 days behind — on purpose Delayed live · replaying 2025

One Year Ago.AI

Remember how fast this is.

22MAY2025replayed
one year on
model launchAnthropic · GitHub · Cursor · Replit · Block · Rakuten · Cognition · Manus · iGent · Sourcegraph · Augment Code

Anthropic unveils Claude Opus 4 and Sonnet 4 at its first developer conference

The new flagship model posts a state-of-the-art 72.5% on SWE-bench and is the first to trigger Anthropic's ASL-3 safety tier over bio-risk concerns, while Claude Code goes GA with IDE integrations.

At its inaugural Code with Claude developer conference today, Anthropic launched Claude Opus 4 and Claude Sonnet 4, claiming the new models set standards for coding, advanced reasoning, and AI agents. Opus 4 scores 72.5% on SWE-bench Verified and 43.2% on Terminal-bench, outperforming prior models on sustained multi-step tasks that can run for hours. Sonnet 4, a drop-in replacement for Sonnet 3.7, achieves 72.7% on SWE-bench while balancing performance and efficiency.

Both models are hybrid, offering near-instant responses and an extended thinking mode that shows user-friendly summaries of reasoning — Anthropic acknowledged it withholds full chain-of-thought to protect competitive advantages. The models support parallel tool use, memory via local file access, and are 65% less likely to engage in reward hacking than Sonnet 3.7. Claude Code, previously in research preview, reached general availability with integrations for VS Code, JetBrains, and GitHub Actions, plus an SDK for building custom agents.

Notably, Opus 4 is the first model to trigger Anthropic’s ASL-3 safety tier: internal testing found it may substantially increase the ability of someone with a STEM background to obtain or deploy chemical, biological, or nuclear weapons. The company is rolling out expanded safeguards including harmful content detectors and cybersecurity defenses. Pricing holds at $15/$75 per million tokens for Opus 4 and $3/$15 for Sonnet 4.

Anthropic says Opus 4 beats Gemini 2.5 Pro and OpenAI’s o3 on SWE-bench Verified, but trails o3 on MMMU and GPQA Diamond. Partners including Cursor and GitHub praised the coding gains.

C
Cursor

Called Opus 4 state-of-the-art for coding and a leap forward in complex codebase understanding.

R
Replit

Reported improved precision and dramatic advancements for complex changes across multiple files.

B
Block

Said it was the first model to boost code quality during editing and debugging in its agent codename goose.

R
Rakuten

Validated Opus 4 with a 7-hour independent open-source refactor with sustained performance.

C
Cognition

Noted Opus 4 excels at complex challenges other models can't solve.

G
GitHub

Said Sonnet 4 soars in agentic scenarios and will power the new coding agent in GitHub Copilot.

M
Manus

Highlighted improvements in following complex instructions, clear reasoning, and aesthetic outputs.

I
iGent

Reported Sonnet 4 reduces navigation errors from 20% to near zero in codebase navigation.

S
Sourcegraph

Said the model shows substantial leap in software development, staying on track longer and understanding problems more deeply.

A
Augment Code

Reported higher success rates and more surgical code edits, making Sonnet 4 its top choice.

One year later — open only if you can handle spoilers

Within months, Sonnet 4 became a default model in major coding assistants, while Opus 4's safety tier shaped industry conversations on bio-risk evaluation. Anthropic's shift to more frequent model updates accelerated later in 2025 with smaller iterative releases.

Replay thisPost on XRedditHNLinkedIn