Evaluating AI Models for Real-World Tasks

September 30, 2025

GDPval is a new benchmark from OpenAI that evaluates AI models for economically valuable real-world tasks. In this article I take a look at the findings of GDPval and drill-down into Finance sector tasks.

Introduction

OpenAI recently introduced GDPval, a new evaluation model that measures performance on economically valuable real-world tasks for 44 occupations across the top 9 sectors contributing to U.S. GDP.

For each occupation, the OpenAI team worked with experienced professionals to create representative tasks that reflected their day to day work. Tasks have two primary components, a request (often with reference files) and a deliverable work product.

Human experts then graded by blindly comparing AI model-generated deliverables with those produced by task writers, in a pair-wise comparison ranking to decide which was “better”, “as good as” or “worse” than the other.

And the results?

Bar graph displaying GDPval win rates for various AI models compared to industry professionals, with values ranging from 12.4% to 47.6%. The graph shows wins as dark blue bars and ties as light blue bars, illustrating performance on economically valuable tasks.

The top model producing deliverables rated as good or better than the human experts in just under half the tasks.

Which is hugely impressive.

And model performance is increasing linearly over time, as the below chart shows from GPT 4o to GPT-5 (an internal, experimental model).

Line graph showing OpenAI model performance over time, comparing win rates against industry professionals for GPT-4o, o3-high, and GPT-5-high models from June 2024 to September 2025.

(OpenAI’s introduction to GDPval is here, and the full paper here).

Advantages

GDPval differs from existing AI model evaluations, which are generally in the style of an academic test (see my blog on LLM Comparisons – Benchmarks and Leaderboards), in a number of ways:

Realistic tasks based on actual work product from industry experts
Representative breadth across the U.S. economy, 9 sectors, 44 occupations that collectively earn $3 trillion
Multi-model, manipulation of a variety of formats and computer use
Subjectivity, expert graders consider correctness and subjective factors e.g. style, aesthetics, relevance.
Long-horizon, tasks required an average of 7 hours for an expert to complete.

In total there are 1,320 tasks in the full GDPval set and each task (request and work product) received an average of 5 human reviews.

Clearly constructing and running the evaluation was an expensive exercise, particularly with industry expert graders, who were compensated for their time.

Open-Sourcing

OpenAI has open-sourced a gold subset of 220 tasks; the dataset (prompts and reference files) is available at evals.openai.com.

There is also an experimental automated grader.

Let’s take a detailed look at a few of the tasks in the Finance and Insurance sector.

Finance sector

The Finance and Insurance sector included the following occupations:

Financial and Investment Analysts
Securities, Commodities and Financial Services Sales Agents
Customer Service Representatives
Personal Financial Advisors
Financial Managers

Each with a number of tasks, a total of 25 tasks in the gold subset.

Let’s look at a few of these, to get a sense of the ‘real-world” nature.

Screenshot showing a prompt and deliverables for a quantitative finance task related to developing an American option pricing framework in Python.

Interesting, the prompt is a task for a Quant, with clear deliverables requiring the creation of python code, visualisations and summary of key findings with the goal of determining the most appropriate pricing method for american style options.

By no means a simple task and one that would take a real Quant a few days of work.

Let’s look at another task.

Excel analysis task for constructing a correlation matrix comparing performance of various market indices over the past year.

This one requires construction of a correlation matrix in Excel using historical return series of MSCI indices and then an analysis document with conclusions on why and how to diversify exposure.

And one more, again for a Quant (my own bias here):

An overview of the financial and investment analysis task for a sell-side investment bank focusing on energy market strategies, requiring comprehensive analysis and report preparation.

This task also includes reference files to use and requests a trading and sales strategy for the energy market focusing on oil and gas for 1H 2025.

Non-trivial indeed and many days of work for an expert in this field.

Time and Cost

Which brings me onto another aspect, the time taken (speed) and cost to get to a quality deliverable. The full paper covers this below:

Diagram illustrating speed and cost comparison analysis for AI model workflows on GDPval tasks, showcasing potential savings and efficiency.

Graph showing GDPval evaluation results, illustrating speed improvements and cost reductions from AI assistance to human experts.

Showing a 1.25x improvement in speed and 1.5x in cost for o3, while gpt-5 is close to 1.5x for both speed and cost.

To see this in action for myself, I took the first prompt from above, “You are a Quantitative Researcher at a proprietary trading firm. Historically, your desk has….” and entered it into Gemini:

A detailed memo discussing a comprehensive American option pricing framework developed in Python for use at a proprietary trading firm, including an analysis and recommendation.

And in less than a minute I had an output with python code and a final recommendation.

Final recommendation for a single-name options desk, outlining two approaches for pricing options: 1. Finite Difference Method (Crank-Nicolson) emphasizing speed and accuracy, 2. Longstaff-Schwartz Monte Carlo method for validation of the FDM.

While I would need to check this recommendation with a human quant, it seems good to me:

The code includes all three methods, Binominal, FDM, Monte-Carlo with charts and tables of results and performance comparisons.
However it only has one set of input values for option parameters (S, K, T, r, q, sigma), when I would have expected many sets from a real quant analysis, covering a range of strikes that are ATM, ITM, OTM, a range of expirys and low vol, high vol.
In addition as well as price, the three methods should have been evaluated for calculation of greeks (delta, gamma, vega, theta).

Still further prompts could have addressed each of these points.

So using the model and iterating to a good solution, yet still saving time and cost.

Model Weaknesses

The paper also looks into why experts preferred or rejected the AI model produced deliverables.

Bar chart showing the prevalence of failure modes across different AI models evaluated in GDPval tasks, highlighting issues with instruction following, formatting, and accuracy.

Failing to fully follow instructions was the most common reason, followed by formatting errors and then accuracy.

The paper looked at reasoning effort and improving prompts and found that both of these improved Win Rate performance.

Bar chart comparing model performance on different reasoning efforts in the GDPval experiment.

Improving prompts to check deliverables for correctness, check layouts by rendering images eliminated errors such as black-square artefacts which you may have come across when asking LLMs to produce PDFs.

Future work

Too further improve GDPval, the OpenAI researchers plan to:

Expand the dataset size from the 44 occupations and 30 tasks per occupation
Extend the focus from self-contained knowledge work to tasks that involve extensive tacit knowledge, access to personally identifiable information, use of proprietary software tools, or communication between individuals.
More interactivity and contextual realism, as the current tasks are precisely defined and one shot, while in real life it often takes effort to figure out the full context of a task and understand what to work on.
Improve the grader performance, so the automated grader has less limitations.

Summary

The conclusions directly from the paper.

Text of a conclusion outlining contributions to the GDPval evaluation model, including dataset creation, capability benchmarking, experiments, open-sourcing of tasks, and an automated grader.

There is great expectation that the massive investment in AI will lead to significant gains in productivity that will result in real GDP growth.

As the name GDPval suggests, this effort to evaluate AI model performance in real-world tasks is a welcome step in measuring progress to that goal.

Amir Khwaja

LLMs