The leaked MMLU-Pro scores rising marginally from 86.5% to 89.2% alongside estimated training costs 8-10× GPT-4 levels clearly demonstrate accelerating diminishing returns to Scaling Laws. More tellingly, GPQA Diamond improved only 2.1 points to 78.4%, indicating that mere parameter and data scaling approaches marginal utility limits for deep scientific reasoning. OpenAI's internal 'Project Orion' delays corroborate this interpretation—they're pivoting from pre-training to inference-time compute scaling, a strategic admission that foundation model scaling is decelerating.
The 'ceiling' narrative ignores benchmark saturation effects. MMLU-Pro wasn't designed for 90%+ accuracy; its discriminative power fails against frontier models. The genuinely critical metric is GPT-5.4's 81.7% on SWE-bench—a qualitative leap from GPT-4o's 53.1% on a benchmark still rapidly evolving. Leaked data also shows multimodal understanding jumping from 72.3% to 89.6% on Video-MME, indicating capability growth is migrating from text to cross-modal domains. Condemning overall progress from single-dimension stagnation is methodologically fallacious.
We must distinguish 'capability growth slowing' from 'research paradigm shifting.' While leaked HumanEval scores rose modestly from 92% to 94.5%, OpenAI's parallel o3 reasoning model hit 96.8% on the same test, indicating resource migration from general pre-training to specialized inference architectures. The deeper problem is evaluative lag: current benchmarks cannot capture genuine progress on open-ended creative tasks, long-horizon planning, and value alignment. GPT-5.4's rumored performance on internal 'hidden benchmarks'—particularly multi-round negotiation and scientific research assistance—likely far exceeds what public numbers suggest.