Tracking humanity's last invention — in real time.
General fluid intelligence via abstract pattern tasks. Closest proxy to true general reasoning.
Real-world GitHub bug fixes requiring full codebase understanding and autonomous execution.
Expert-crafted mathematical problems beyond current textbooks. AI already exceeds expert humans.
PhD-level questions in biology, chemistry, and physics written and validated by domain experts.
Professional-level knowledge across 57 academic domains. Harder variant of original MMLU.
Competitive programming problems from Codeforces, LeetCode, AtCoder — contamination-free.
Jan 2023 → Projected 2028
Six benchmarks covering orthogonal dimensions of intelligence: abstract reasoning (ARC-AGI), real-world engineering (SWE-Bench), mathematics (FrontierMath), scientific knowledge (GPQA), broad knowledge (MMLU-Pro), and coding (LiveCodeBench).
Each benchmark has a published "expert human" baseline — PhD researchers, senior engineers, competitive programmers. AGI is defined as surpassing this threshold across all tracked benchmarks simultaneously.
Weighted average of per-benchmark progress (capped at 100%). Weights reflect uniqueness and replacement difficulty. Data fetched daily from the llm-stats.com API.
Contributions, corrections, and new benchmark proposals welcome.