one year on
OpenAI unveils o1 models that 'think' before answering, achieving 83% of problems on IMO qualifying exam
The new Strawberry models use hidden chain-of-thought reasoning and test-time compute, outperforming GPT-4o on complex tasks but costing six times more and sparking debate over hidden chains of thought.
OpenAI today released o1-preview and o1-mini, the company’s first models that use hidden chain-of-thought reasoning before answering. Dubbed internally as Strawberry, the models mark a new approach: they spend additional compute time at inference to fact-check themselves and plan responses holistically.
In benchmarks, o1 achieves 83% on the International Mathematical Olympiad qualifying exam, compared to GPT-4o’s 13%. On Codeforces programming challenges, it reaches the 89th percentile. OpenAI also says o1 reached the 89th percentile of participants in Codeforces. However, Arredondo says o1 can take over 10 seconds to answer some questions. and significantly more expensive: $15 per million input tokens and $60 per million output tokens, roughly six times GPT-4o’s cost.
The most controversial aspect is that o1’s chain-of-thought is hidden from users at the API level. OpenAI shows only summaries, citing competitive advantage. Users pay for the reasoning tokens but cannot inspect them, prompting immediate debate about transparency and pricing. o1 can’t browse the web or analyze files yet, and its image-analyzing features are disabled pending additional testing, with weekly rate limits of 30 messages for o1-preview and 50 for o1-mini.
Reactions are mixed. Early testers praise its reasoning depth in law, science, and code, but note it still hallucinates — perhaps more than GPT-4o, per OpenAI’s own paper. The company says it will experiment with models that ‘reason for hours, days, or even weeks’ in future releases.
The record
VP at Thomson Reuters said o1 is better than previous models at analyzing legal briefs and identifying solutions to LSAT logic games, calling it 'more substantive, multi-faceted analysis'.
Wharton professor who tested o1 for a month wrote that it solved a challenging crossword puzzle correctly but still hallucinated a new clue, and noted 'Errors and hallucinations still happen.'
OpenAI research scientist said on X that o1 is trained with reinforcement learning and 'the longer it thinks, the better it does'.
One year later — open only if you can handle spoilers
o1-preview set a new paradigm for reasoning models, with competitors like Google and Anthropic quickly launching similar 'thinking' variants. The hidden chain-of-thought debate continued for months, eventually leading OpenAI to offer opt-in visibility in later versions.