GPT-5: the current winner in the Frontier Model Race

GPT-5: the current winner in the Frontier Model Race

The frontier AI race never slows down. Just a month after xAI’s Grok-4 launch, OpenAI unveiled GPT-5 on August 7th, pushing the competitive bar higher yet again.

While the online buzz around GPT-5 has been mixed—some early video reviewers called it “less revolutionary”—my own testing and research paint a different picture. GPT-5 is a significant leap forward, not just an incremental update. Its combination of higher accuracy (lower hallucination rates), cheaper token pricing, massive context windows, and multimodal capabilities has the potential to usher in a new era for AI application development.

Before I explain why, let’s break down the key new features and see how GPT-5 stacks up against Grok-4 across benchmarks and real-world use cases. I especially encourage you to see those benchmark comparisons.

Feature Highlights of GPT-5

OpenAI describes GPT-5 as a unified, smarter, and more broadly useful system—and much of that marketing talk holds up in practice. Here’s what stands out:

1. Unified Model with Broad Strengths

GPT-5 isn’t just an upgrade in one area—it’s been tuned to excel in writing, coding, and health-related tasks, three major ChatGPT use cases. The improvements aren’t flashy gimmicks but targeted upgrades that matter for professional and enterprise users.

2. Faster, More Efficient Reasoning

The model responds more quickly, handles long and complex prompts better, and produces more accurate, safer answers across a wider range of queries. 

OpenAI stated GPT‑5 (with thinking) performs better than OpenAI o3 with 50-80% less output tokens across capabilities, including visual reasoning, agentic coding, and graduate-level scientific problem solving.

3. Accuracy Improvements

OpenAI claims GPT-5 is:

  • ~45% less likely to hallucinate than GPT-4o
  • ~80% less likely to hallucinate than OpenAI’s o3 model (even with web search enabled)
  • On fact-seeking benchmarks like LongFact and FActScore, GPT-5 thinking produces 6× fewer hallucinations than o3

Not everyone agrees—my friend Puneet Anand at AIMon found reduced accuracy compared to GPT-4.5 in their SimpleQA tests. But on aggregate, GPT-5 appears more reliable on open-ended and real-world knowledge queries.

4. More Honest Responses

The model is also less likely to mislead. For example:

  • On the CharXiv benchmark (with missing images), GPT-5’s deception rate was just 9% versus 86.7% for o3.
  • In real ChatGPT traffic, deception dropped from 4.8% (o3) to 2.1% (GPT-5 reasoning).

This honesty translates to better trust and transparency—crucial for enterprise AI adoption.

5. Safer and More Helpful in Sensitive Domains

A new safe completions training approach means GPT-5 can answer nuanced questions in dual-use domains like virology, where older models might default to rigid refusals. Instead, GPT-5 offers safe, well-framed alternatives.

6. Less Sycophantic, More Professional

GPT-5 reduces over-agreement and “yes-man” behavior:

  • Sycophancy dropped from 14.5% to <6%
  • Responses are less emoji-heavy and feel more like talking to a PhD-level colleague than a friendly assistant.
  • Over-agreement rates fell without hurting user satisfaction.

7. Biological Risk Safeguards

OpenAI treats GPT-5 thinking as high capability in biological and chemical domains. It underwent 5,000+ hours of red-teaming, multi-layered safeguards, and thorough risk modeling—while showing no evidence of enabling severe harm.

8. GPT-5 Pro for Advanced Reasoning

Replacing OpenAI’s o3-pro, GPT-5 Pro adds extended reasoning and parallel compute:

  • 88.4% on GPQA (state-of-the-art)
  • Outperforms standard GPT-5 reasoning in 67.8% of complex prompts

22% fewer major errors on real-world reasoning tasks.

Benchmark Comparison: GPT-5 vs. Grok-4

Not all benchmark sites have tested GPT-5 yet, and results vary by methodology. I’m especially curious about Andon Lab’s Vending-Bench, which hasn’t published its GPT-5 results yet. Still, here’s what’s clear from available data:

GPT-5: 94.6% on AIME 2025 (math), 88.4% on GPQA (science, with GPT-5 Pro), 42% on Humanity’s Last Exam

GROK: 98.4% on AIME 2025, 87.5% on GPQA, 40% on Humanity’s Last Exam

Grok-4 holds a math edge, but GPT-5 leads in science and the more holistic Humanity’s Last Exam (HLE).

So  Artificial Analysis leaderboard clearly stated GPT-5 are the highest intelligent models, followed by Grok4.


Article content

The Humanity’s Last Exam (HLE) Factor

AI capability is evaluated based on benchmarks, yet as their progress accelerates, benchmarks become quickly saturated, losing their utility as a measurement tool. Performing well on formerly frontier benchmarks such as MMLU and GPQA are no longer strong signals of progress as frontier models reach or exceed human level performance on them. 

The HLE, created by Scale AI and the Center for AI Safety, is designed to combat benchmark saturation. Traditional tests like MMLU and GPQA are becoming less useful as models hit or exceed human-level scores.

HLE is different:

  • 2,500 multi-modal, subject-diverse, cutting-edge questions
  • Tests both depth of reasoning and breadth of knowledge
  • Forces models to operate at the frontier of academic and scientific capability

High accuracy on HLE would demonstrate AI has achieved expert-level performance on closed-ended cutting-edge scientific knowledge, but it would not alone suggest autonomous research capabilities or “artificial general intelligence.”

In HLE bench, GPT-5 is leading. Even the best models score low, and GPT-5’s 42% is only a modest lead—but in such a difficult setting, those extra points matter.


Article content

Human Preference Ratings: LMSYS Chatbot Arena

The LMSYS Chatbot Arena is particularly valuable because it ranks models based on crowdsourced, head-to-head comparisons. Users interact with two anonymous models and vote for the better experience.

From the latest Arena overview:

  • GPT-5 dominates—#1 in every subcategory: Hard Prompts, Coding, Math, Creative Writing, Instruction Following, Longer Query handling, and Multi-Turn conversations.

Gemini 2.5 Pro is the #2.


Article content

However, when I dropped the above screenshot into Gemini 2.5 Pro, Grok-4, and GPT-5 to see their own interpretation:

  • Gemini 2.5 Pro hallucinated, talking about GPT-4.0 instead of GPT-5—likely due to outdated knowledge.

Grok-4 and GPT-5 both accurately reported GPT-5’s leadership.

You can see Gemini 2.5 Pro's answer


Article content

Mixed Signals from LLM-Stats

Interestingly, LLM-Stats ranked GPT-5 slightly behind Grok-4—likely due to weighting differences or incomplete benchmark coverage. I wonder if the unreleased Vending-Bench is included later, GPT-5’s standing there could improve.


Article content

Different Strength Profiles

From my simple tasks, I see GPT-5 and Grok-4 are top performers. Gemini 2.5 Pro disappointed me. 

  • GPT-5: Best choice for broad, reliable, and safe enterprise-grade usage—especially where speed, accuracy, and long-context handling are essential.

Grok-4 (Heavy variant): Shines in complex, dynamic reasoning tasks, particularly those involving real-time tool integration and creative problem-solving—though at higher complexity and cost.

Concerns About GPT-5’s Architecture

While not mentioning Grok-4 specifically, my AIMon friend suspects GPT-5’s occasional accuracy drops come from acting as a smart switchboard—routing queries to specialized models. In their tests, this sometimes reduced performance compared to a single, specialized GPT-4.5.

Article content

My Take: Subtle but Significant

GPT-5’s launch may not have the “wow” factor of its 2022 debut, but that’s partly because the low-hanging fruit is gone. The gains now come from nuanced improvements that deeply impact real-world usability.

From my perspective, GPT-5 represents:

  • Higher trustworthiness through reduced hallucination and deception
  • Lower costs with cheaper tokens, enabling broader application deployment
  • Better safety handling, opening up responsible use in sensitive fields
  • Enterprise readiness with massive context windows and multimodal capabilities

AI is only as valuable as it is trustworthy. That matters most for AI, AI Application, AI Agent’s end users. For developers, token costs are another defining factor. 

GPT-5 delivers unmatched API value at $1.25/M input tokens and $10.00/M output tokens—half the input cost of GPT-4o while keeping the same output rate. By comparison, Grok 4 charges $3.00/M input and $15.00/M output. Gemini 2.5 Pro matches GPT-5 on output and on input for prompts ≤200k tokens, but for longer prompts its input cost rises to $2.50/M. Gemini's edge is the super big context window of 1 M input tokens, vs GPT-5’s 400K input. How critical is that? I would not rate it higher than trustworthy and taken costs.

 With GPT-5, developers can build apps and agents that are not just smarter and faster, but also more dependable and cost-effective. If I am developing a general AI application, which model should I choose? It seems to be a no-brainer.


Article content


Federico Oldoni, Ph.D.

MBA Candidate | Rady Venture Capital Club & Venture Institute by VC Lab | Ex-Amgen | Healthcare & Biopharma | M&A, Finance & Strategy | Startup Ecosystem | Consulting

4w

Thanks for sharing, Nicole

Like
Reply
Alexandra Rogers

Investment Director | CPG, Health & Wellness, Technology | Passionate about Startups, Innovation & Early-Stage Investing | Venture Institute VC Lab fellow Cohort V

4w

We’re entering the new era of AI utility with multimodal capabilities leading the charge. The bar just got higher!

Like
Reply
Nicole Hu

Investor, Experimenter, Author, Xoogler, Ex-Appdynamics

1mo

I heard mixed experiences now. Some friend shared that his experience led him feel GPT-5 was "super non-functional, and surprisingly incapable". Some complained it took 15 mins to generate a spreadsheet, and could not wait to opt to use GPT 4. Certainly benchmark metrics are one thing, individual's experience is a different thing. There were some good experience of using GPT-5 too. I wonder which made the difference. Is it the use case, or is it the Prompt?

Riccardo Petrantoni

Tech Entrepreneur | Startup Advisor | VC Lab Fellow

1mo

thanks for sharing your research, Nicole. definitely worth a read.

To view or add a comment, sign in

Explore content categories