The latest GDPval leaderboard suggests frontier AI models are approaching industry-expert performance. But what does “parity” actually mean in practice and do the results overstate real-world impact?
Introduction
GDPval, is an evaluation benchmark for assessing AI model capabilities on real-world economically valuable tasks. It covers the majority of U.S. Bureau of Labor Statistics work activities for 44 occupations across the top 9 sectors contributing to U.S. GDP (Gross Domestic Product). Tasks are constructed from the representative work of industry professionals with an average of 14 years of experience.
I covered GDPVal in late September in my post, “Evaluating AI Models for Real-World Tasks” and as 3-months is a long time in the world of AI, I wanted to look at how the latest AI models have performed on this evaluation benchmark.
It is important to remember that unlike benchmarks that are analogous to competitions e.g. Maths Olympiads and the like, GDPval is designed to evaluate real-world tasks and the results are graded by humans to determine if they are better, the same as or worse than a human expert’s work.
Leaderboard
Let’s start with the current GDPval Leaderboard.

- Showing the percentage of Wins Only and Wins+Ties of an LLM versus an Industry expert, with 50% representing parity (vertical red-dotted line).
- Open AI GPT-5.2 is the current top model with 49.7% Wins (pretty much at parity with Industry Experts) and 70.9% Wins+Ties.
- On September 25, 2025, the top model was Claude Opus 4.1 and since then we have had new releases for Claude, Gemini and GPT, each of which are now in the top 3 positions.
GPT-5.2 has been assessed to now better or the same as an Industry Expert in 7 out of 10 real-world tasks, which is impressive to say the least.
Occupations
The Leaderboard has further charts for each Sector and Occupation:

In this case the highlighted blue-bar is GPT-5.2 and we can see that performance varies significantly for each model on each sector and occupation (within which there are multiple tasks).
From the ones shown above the Financial Managers occupation has by far the lowest percentage of Wins+Ties at 24%, while Administrative Services has the highest at 87%; not a surprise to anyone I assume.
Financial and Investment Analysts
The Occupation most of interest to readers of this blog, is that of Financial and Investment Analysts (Wins+Ties 62%), as well as Securities, Commodities and Financial Services Sales Agents (Wins+Ties 78%).
The Leaderboard charts provide a convenient way to click on a chart and go straight to the repository of public tasks on Hugging Face, which I did for Financial and Investment Analysts:

Showing the 5 public tasks for this occupation with part of the prompt string. Expanding the highlighted one, we see the full prompt:

This is the Quant task I looked at in my earlier blog and ran in Gemini 2.5 Pro to get a feel for how well it was performed.
Let’s do so again but with GPT-5.2 which has a 62% Wins+Tie rating for this occupation compared to 29% for Gemini 2.5 Pro, so twice as good.
Quant Task
I entered the above prompt into ChatGPT with ChatGPT5.2 Auto selected and it started Thinking and then Analyzing.
After 3 minutes, it completed and returned with the following:

I have saved the Python Notebook here for you to download.
After a comparison of the above summary to the one in my September blog, I would say the above is superior and more expert.
Similarly scanning the new python code, it is better than the earlier one for a number of reasons:
- it considers both call and put examples, while the earlier only had a put example run for the 3 different methods
- better ranges tested for the models, so more steps for the binomial tree, more space and time coverage for the finite-difference and monte-carlo simulation methods
- better visuals for convergence and benchmarks
Still there are aspects that I would have preferred to see such as a deeper analysis covering more than ATM options examples, as well as calculations of greeks. (Note I could have asked for this in a follow-on prompt).
Overall, the output is pretty decent.
Results
Is the output better or the same as an Industry expert?
It depends on the both the time available for the task and the level of skill of the Industry expert. We know some experts are more capable than others.
I would say the GPT5.2 output is better or the same as that produced by an expert who is given 1-day to work on the task.
Given a week, I am sure that a senior quant could produce a better analysis.
However the fact that GPT5.2 took just 3 minutes is important.
The same quant starting from this prompt rather than a blank page and iterating with follow-on prompts, reviewing and modifying the code, could produce an even better result and could do so in a few days as opposed to a week.
A productivity gain and improvement in quality.
Thoughts
Frontier AI models continue to improve on evaluations and the results are very impressive for all benchmarks including GDPval.
Major vendors have similar capability, presumably as the pre-training dataset and compute is not that different.
Re-enforcement learning (RL) in post-training is different between the vendors and a focus of competitive gain.
The danger is in overfitting to the evaluations, so if the evaluations are included in the RL process, the model will overfit to these and not generalise sufficiently.
Is this why actual model impact in the real-world is far less than the benchmark scores would suggest?
Perhaps too early to tell.
I did listen to an interesting discussion on Youtube between Dwarkesh and Ilya Sutskever. In this at 6 mins in, Ilya uses the analogy of a student spending 10,000 hours learning to get the best score in coding competitions and another student that just spends just 100 hours on that but still did really well; which one do you think will do better in their career? Dwarkesh, replied second and Ilya says the models are like the first and with this level of training have less generalisable skills.
One to think about and there is a lot more thought provoking discussion in the video, which I recommend watching.
In Summary
- Question: What is the GDPval and what does it measure?
- Answer: GDPval is a human-graded benchmark that evaluates how well AI models perform real-world, economically valuable tasks based on the work of experienced professionals across major sectors of U.S. GDP. Outputs are judged as better, the same, or worse than an industry expert, making it a practical measure of on-the-job capability rather than abstract problem-solving.
- Question: How should GDPval leaderboard scores be interpreted?
- Answer: A Wins score near 50% indicates parity with an industry expert, while Wins+Ties shows how often a model is at least as good as a human. These scores should be interpreted by occupation and task, as aggregate results can hide large performance differences across sectors.
- Question: Why is GDPval useful for assessing LLMs in real-world use?
- Answer: GDPval highlights where LLMs can deliver expert-level or near-expert results in minutes, creating meaningful productivity gains. It also helps distinguish genuine real-world impact from inflated benchmark performance due to overfitting.


Leave a Reply