I think one of the most important fundamental concepts in the discussion about the capabilities of AI is the concept of jagged intelligence. I think that is one of the things heavily coloring perception. In short AI’s skills are implicitly compared to human ones, but that is misleading because its error modes are very inhuman. Bluntly an AI gets things wrong only a very stupid person would get wrong and as such it is sometimes brought to that level.
Jagged intelligence is a vague term coined by Karpathy that describes the phenomenon of LLMs being able to perform some very complex task remarkably well while failing at simple tasks.
AnachronisticPenguin on
Here was a little chart I had gemini come up with a few weeks back. Basically if you look at the different models and different benchmarks. You can see where the largest jump occurs for a specific form of understanding.
2 Comments
Submission statement:
I think one of the most important fundamental concepts in the discussion about the capabilities of AI is the concept of jagged intelligence. I think that is one of the things heavily coloring perception. In short AI’s skills are implicitly compared to human ones, but that is misleading because its error modes are very inhuman. Bluntly an AI gets things wrong only a very stupid person would get wrong and as such it is sometimes brought to that level.
Jagged intelligence is a vague term coined by Karpathy that describes the phenomenon of LLMs being able to perform some very complex task remarkably well while failing at simple tasks.
Here was a little chart I had gemini come up with a few weeks back. Basically if you look at the different models and different benchmarks. You can see where the largest jump occurs for a specific form of understanding.
|**Model (Release Era)**|**Technological Milestone**|**MMLU (General Knowledge)**|**GPQA Diamond (PhD Science)**|**AIME (Competition Math)**|**SWE-Bench Verified (Real-World Coding)**|**Humanity’s Last Exam (Deep Reasoning)**|
|:-|:-|:-|:-|:-|:-|:-|
|**GPT-3** (2020)|The Scaling Proof|~43.9%|~25.0% *(Random)*|0.0%|0.0%|0.0%|
|**GPT-4/4o** (2023/24)|Dense Parameter Peak|86.4%|53.6%|9.3%|4.4%|< 2.0%|
|**o1** (Sep 2024)|Process Reward Models|92.3%|77.3%|74.4%|~40.0%|~8.0%|
|**o3** (Dec 2024)|Agentic CoT Maturation|93.3%|83.4%|96.7%|71.7%|20.3%|
|**GPT-5.2 Thinking** (Dec 2025)|Unified Agentic Intelligence|*Saturated*|92.4%|100.0%|80.0%|35.2%|