GDPval is a new benchmark from OpenAI that evaluates AI models for economically valuable real-world tasks. In this article I take a look at the findings of GDPval and drill-down into Finance sector tasks.
Introduction
OpenAI recently introduced GDPval, a new evaluation model that measures performance on economically valuable real-world tasks for 44 occupations across the top 9 sectors contributing to U.S. GDP.
For each occupation, the OpenAI team worked with experienced professionals to create representative tasks that reflected their day to day work. Tasks have two primary components, a request (often with reference files) and a deliverable work product.
Human experts then graded by blindly comparing AI model-generated deliverables with those produced by task writers, in a pair-wise comparison ranking to decide which was “better”, “as good as” or “worse” than the other.
And the results?

The top model producing deliverables rated as good or better than the human experts in just under half the tasks.
Which is hugely impressive.
And model performance is increasing linearly over time, as the below chart shows from GPT 4o to GPT-5 (an internal, experimental model).

(OpenAI’s introduction to GDPval is here, and the full paper here).
Advantages
GDPval differs from existing AI model evaluations, which are generally in the style of an academic test (see my blog on LLM Comparisons – Benchmarks and Leaderboards), in a number of ways:
- Realistic tasks based on actual work product from industry experts
- Representative breadth across the U.S. economy, 9 sectors, 44 occupations that collectively earn $3 trillion
- Multi-model, manipulation of a variety of formats and computer use
- Subjectivity, expert graders consider correctness and subjective factors e.g. style, aesthetics, relevance.
- Long-horizon, tasks required an average of 7 hours for an expert to complete.
In total there are 1,320 tasks in the full GDPval set and each task (request and work product) received an average of 5 human reviews.
Clearly constructing and running the evaluation was an expensive exercise, particularly with industry expert graders, who were compensated for their time.
Open-Sourcing
OpenAI has open-sourced a gold subset of 220 tasks; the dataset (prompts and reference files) is available at evals.openai.com.
There is also an experimental automated grader.
Let’s take a detailed look at a few of the tasks in the Finance and Insurance sector.
Finance sector
The Finance and Insurance sector included the following occupations:
- Financial and Investment Analysts
- Securities, Commodities and Financial Services Sales Agents
- Customer Service Representatives
- Personal Financial Advisors
- Financial Managers
Each with a number of tasks, a total of 25 tasks in the gold subset.
Let’s look at a few of these, to get a sense of the ‘real-world” nature.

Interesting, the prompt is a task for a Quant, with clear deliverables requiring the creation of python code, visualisations and summary of key findings with the goal of determining the most appropriate pricing method for american style options.
By no means a simple task and one that would take a real Quant a few days of work.
Let’s look at another task.

This one requires construction of a correlation matrix in Excel using historical return series of MSCI indices and then an analysis document with conclusions on why and how to diversify exposure.
And one more, again for a Quant (my own bias here):

This task also includes reference files to use and requests a trading and sales strategy for the energy market focusing on oil and gas for 1H 2025.
Non-trivial indeed and many days of work for an expert in this field.
Time and Cost
Which brings me onto another aspect, the time taken (speed) and cost to get to a quality deliverable. The full paper covers this below:


Showing a 1.25x improvement in speed and 1.5x in cost for o3, while gpt-5 is close to 1.5x for both speed and cost.
To see this in action for myself, I took the first prompt from above, “You are a Quantitative Researcher at a proprietary trading firm. Historically, your desk has….” and entered it into Gemini:

And in less than a minute I had an output with python code and a final recommendation.

While I would need to check this recommendation with a human quant, it seems good to me:
- The code includes all three methods, Binominal, FDM, Monte-Carlo with charts and tables of results and performance comparisons.
- However it only has one set of input values for option parameters (S, K, T, r, q, sigma), when I would have expected many sets from a real quant analysis, covering a range of strikes that are ATM, ITM, OTM, a range of expirys and low vol, high vol.
- In addition as well as price, the three methods should have been evaluated for calculation of greeks (delta, gamma, vega, theta).
Still further prompts could have addressed each of these points.
So using the model and iterating to a good solution, yet still saving time and cost.
Model Weaknesses
The paper also looks into why experts preferred or rejected the AI model produced deliverables.

Failing to fully follow instructions was the most common reason, followed by formatting errors and then accuracy.
The paper looked at reasoning effort and improving prompts and found that both of these improved Win Rate performance.

Improving prompts to check deliverables for correctness, check layouts by rendering images eliminated errors such as black-square artefacts which you may have come across when asking LLMs to produce PDFs.
Future work
Too further improve GDPval, the OpenAI researchers plan to:
- Expand the dataset size from the 44 occupations and 30 tasks per occupation
- Extend the focus from self-contained knowledge work to tasks that involve extensive tacit knowledge, access to personally identifiable information, use of proprietary software tools, or communication between individuals.
- More interactivity and contextual realism, as the current tasks are precisely defined and one shot, while in real life it often takes effort to figure out the full context of a task and understand what to work on.
- Improve the grader performance, so the automated grader has less limitations.
Summary
The conclusions directly from the paper.

There is great expectation that the massive investment in AI will lead to significant gains in productivity that will result in real GDP growth.
As the name GDPval suggests, this effort to evaluate AI model performance in real-world tasks is a welcome step in measuring progress to that goal.


Leave a Reply