What happened, in plain English
DeepSeek has released an open‑weight, math‑specialist model called DeepSeekMath‑V2 that reaches gold‑medal level on the 2025 International Mathematical Olympiad (IMO), matches gold level on the 2024 Chinese Mathematical Olympiad (CMO), and scores 118/120 on Putnam 2024 with scaled test‑time compute. The company published the weights and a detailed technical report explaining the training and evaluation recipe—the “playbook”—on GitHub and Hugging Face under a permissive license.

This matters because until now, gold‑level IMO results were achieved by proprietary systems from Google DeepMind and OpenAI. DeepSeek’s release is the first time a model at that level is available as open weights for anyone to run and study. Google and OpenAI disclosed their gold‑level results in July 2025; DeepSeek’s open release arrived on November 27–29, 2025 (China/US time zones), closing the openness gap.
What’s actually new here
- Open weights at gold‑level IMO performance: The DeepSeekMath‑V2 model card and repo state gold‑level scores on IMO 2025 and CMO 2024, plus a near‑perfect Putnam 2024 run. The Hugging Face page lists the license as Apache‑2.0 and exposes the full weight files. Total downloadable size is roughly 689 GB.
- A published “playbook” for self‑verifiable proofs: The technical report describes training an LLM‑based verifier (to check proof rigor), then using that verifier as a reward model to train a proof generator. DeepSeek scales “verification compute” to auto‑label hard cases and improves both verifier and generator in a loop. The repo includes predictions/outputs for inspection.
- Built on a modern open backbone: DeepSeekMath‑V2 sits on DeepSeek‑V3.2‑Exp‑Base, an experimental MoE backbone related to DeepSeek‑V3 (671B total parameters with ~37B activated per token, plus a 14B‑parameter Multi‑Token Prediction module; 685B total on Hugging Face). The V3.2 release ships kernels and recipes and is itself open‑weight under MIT.
The playbook: How DeepSeek trained a model to check its own math
DeepSeek’s report frames a shift from “final‑answer correctness” to “proof rigor,” aiming for self‑verifiable reasoning:
- Train a verifier: Build an LLM‑based verifier that scores proofs for completeness and rigor—even without a reference solution.
- Reward the generator with the verifier: Use the verifier as a reward model in reinforcement learning so the generator is incentivized to detect and fix issues in its own proofs before finalizing.
- Scale verification compute: As the generator improves, scale up verification to automatically label harder proofs and feed that back to strengthen the verifier.
- Spend compute at test time: Use “scaled test‑time compute” (multi‑sample, verify‑and‑select) to search for a correct, rigorous proof path.
This verifier‑generator loop is the core intellectual contribution—the part many practitioners have wanted to study closely. Unlike prior “answer‑only” reinforcement learning, it trains the model to care about the steps, not just the score at the end.
How it stacks up
- IMO 2025: Gold‑level performance, joining Google DeepMind and OpenAI’s earlier (closed) gold‑level announcements in July 2025. DeepMind’s result was officially graded by IMO coordinators, a useful point of reference for the bar DeepSeek is targeting.
- Putnam 2024: 118/120 with scaled test‑time compute, according to DeepSeek’s report and model card.
- Proof‑focused benchmarks: The repo cites strong performance on DeepMind’s IMO‑ProofBench and publishes model predictions for review.
Snapshot of DeepSeekMath‑V2 at a glance
| Item | What DeepSeek published | Where |
|---|---|---|
| License | Apache‑2.0 (open weights) | Hugging Face model card |
| Weights size | ~689 GB across 160+ safetensors | Hugging Face file tree |
| Base model | DeepSeek‑V3.2‑Exp‑Base (MoE, related to V3’s 671B total / ~37B active; +14B MTP = 685B on HF) | V3 HF card, V3.2 GitHub |
| Headline results | IMO 2025 and CMO 2024 at gold level; Putnam 2024 at 118/120 | DeepSeekMath‑V2 repo |
Why this is a big deal for builders
- Reproducibility and auditing: Publishing outputs and a clear training/evaluation narrative lets teams audit behavior and attempt ablations, not just take a benchmark number on faith.
- On‑prem and regulated use: Open weights under Apache‑2.0 enable private deployments, which matters for research groups and enterprises that can’t send data to a third‑party API.
- A foundation for automation: Proof‑style verification maps to real tasks that require step‑wise rigor—formal checks, scientific derivations, safety‑critical control logic, and compliance drafting—areas where “answer‑only” models often stumble.
TipWhat to try first
- Start with smaller slices: Recreate DeepSeek’s verify‑and‑select loop on your own domain proofs (e.g., data‑pipeline invariants, symbolic math steps) before scaling up inference samples.
- Use the backbone recipes: V3.2‑Exp ships kernels and example inference recipes (vLLM, SGLang). Even if you don’t need 685B weights end‑to‑end, you can adopt the serving patterns.
- Budget for test‑time compute: Gold‑level results lean on multi‑sample search and verification. Plan latency/throughput trade‑offs up front.
Important caveats and open questions
- Official grading context: DeepMind’s IMO submission was graded by IMO coordinators; OpenAI also reported gold‑level results. DeepSeek’s report provides detailed outputs and methodology, but (as with many AI Olympiad evaluations) independent certification and compute budgets remain areas to watch.
- Open‑weight ≠ fully open: The weights and report are public, but the entire training corpus and pipeline aren’t fully released—par for the course in 2025. Still, Apache‑2.0 licensing is among the most permissive for running and adapting the model.
- Hardware reality check: The full checkpoint is massive (≈689 GB). Many teams will experiment via sharded, accelerated inference stacks or distilled variants rather than loading the whole model on a single machine.
The bigger picture
DeepSeek’s move validates a direction many in the field have been pushing toward: models that can articulate, check, and refine their own reasoning, not just emit a final answer. Making that capability openly inspectable—weights, outputs, and a training playbook—should accelerate independent research on verifier quality, test‑time scaling, and the limits of self‑correction. If the last wave was about “thinking longer,” the next one looks like “thinking better—with receipts.”
Sources
- DeepSeekMath‑V2 GitHub: README, tech report PDF, and outputs. DeepSeek‑AI on GitHub.
- DeepSeekMath‑V2 model card and license. Hugging Face.
- File manifest and total download size (~689 GB). Hugging Face tree.
- Context: DeepMind’s officially graded gold‑level IMO result (July 21, 2025). Google DeepMind blog.
- Context: Reuters roundup on Google/OpenAI gold‑level IMO results. Reuters.
- Overview coverage and Delangue quote on the significance of open availability. South China Morning Post.
- Backbone details: DeepSeek‑V3 parameters (671B total/37B active, +14B MTP). Hugging Face: DeepSeek‑V3.
- V3.2‑Exp release (sparse attention, kernels, licenses). GitHub: DeepSeek‑V3.2‑Exp.