diff --git a/_blog.yml b/_blog.yml
index 1c971912c9..38830b5764 100644
--- a/_blog.yml
+++ b/_blog.yml
@@ -5437,3 +5437,15 @@
- aiart
- ai art
- community
+
+- local: dabstep
+ title: "DABStep: Data Agent Benchmark for Multi-step Reasoning"
+ thumbnail: /blog/assets/thumbnail.png
+ author: eggie5
+ guest: True
+ date: Feb 4, 2025
+ tags:
+ - llms
+ - reasoning
+ - research
+ - evaluation
\ No newline at end of file
diff --git a/assets/dabstep/thumbnail.png b/assets/dabstep/thumbnail.png
new file mode 100644
index 0000000000..11d9daef9f
Binary files /dev/null and b/assets/dabstep/thumbnail.png differ
diff --git a/dabstep.md b/dabstep.md
new file mode 100644
index 0000000000..28505a104a
--- /dev/null
+++ b/dabstep.md
@@ -0,0 +1,236 @@
+---
+title: "DABStep: Data Agent Benchmark for Multi-step Reasoning"
+thumbnail: /blog/assets/thumbnail.png
+authors:
+- user: eggie5
+ guest: True
+- user: martinigoyanes
+ guest: True
+- user: Friso Kingma
+ guest: True
+- user: andreu-adyen
+ guest: True
+- user: lvwerra
+- user: thomwolf
+---
+
+# DABStep: Data Agent Benchmark for Multi-step Reasoning
+
+Language models are becoming increasingly capable and can solve tasks autonomously as agents. There are many exciting use cases, especially at the intersection of reasoning, code, and data. However, proper evaluation benchmarks on real-world problems are lacking and hinder progress in the field.
+
+To tackle this challenge, Adyen and Hugging Face built the Data Agent Benchmark for Multi-step Reasoning (DABstep) together. DABstep consists of over 450 data analysis tasks designed to evaluate the capabilities of state-of-the-art LLMs and AI agents.
+
+> *Our findings reveal that DABstep presents a significant challenge for current AI models, with the most capable reasoning-based agents achieving only 16% accuracy, highlighting significant progress to be made in the field.*
+
+
+DABstep requires AI models to:
+
+ - dive in details of data and be rigorous (no hallucinations)
+ - reason over free form text and databases
+ - connect with real life use-cases (not just math or code)
+
+In this blog post, we'll cover the design and construction of the benchmark, explore evaluation results, and discuss the significant gap between current models and the ability to solve complex data analysis tasks effectively.
+
+## Motivation
+
+Data analysis is both an art and a science that requires technical skill, domain knowledge and creativity, and thus, it’s rarely straightforward. Even seasoned data analysts face challenges like:
+
+- **Simple but time-consuming tasks**: The sheer volume of even simple tasks often turns straightforward analysis into hours of repetitive work.
+- **Complex context and high cognitive load**: Some tasks require analysis to juggle intricate domain-specific knowledge, making them both time-intensive and mentally draining. For example, (1) reading distributed, nested, and complicated documentation; (2) analyzing data; (3) reasoning over results; and finally, providing recommendations that will steer the direction of the business.
+- **Technical acumen**: analyzing data could be a simple task provided the data is of high availability, high quality, and ready-to-serve. Unfortunately, this is rarely the case, and analysts need technical depth to create pipelines that consume, transform, and serve data. Data analysts often take on tasks pertaining formally to data engineering.
+
+At companies like Adyen, analysts tackle a spectrum of problems, from routine queries to complex workflows requiring creativity, precision, and iterative reasoning. Access to a capable data analysis agent that can automate simple and repetitive tasks and assist with complex tasks would allow analysts to work faster, reduce mental strain, and focus on solving more impactful problems. That would be a pivotal moment for many industries that need data analysis and insights, such as finance.
+
+Recent advancements in *agentic workflows* — where LLMs equipped with tools independently execute multi-step tasks — have shown tremendous promise across domains like coding, [open QA](https://openai.com/index/introducing-deep-research/), [software engineering](https://www.swebench.com), and even [Kaggle competitions](https://openai.com/index/mle-bench/). These systems aren’t just theoretical; they've been driving real-world productivity gains.
+
+So, the question becomes: **Can agentic workflows reshape the way we approach data analysis?**
+
+## Introducing DABstep
+
+Progress in machine learning is fueled by high quality benchmarks that yield reliable progress signals. Thus, we are excited to introduce the Data Agent Benchmark for Multi-step Reasoning (DABstep), a new benchmark for evaluating and advancing agentic workflows in data analysis.
+
+Here's what makes DABstep unique:
+
+ - **Real-world use cases**: Built on 450+ real-world tasks extracted from Adyen’s actual workloads. These tasks are not synthetic toy problems; they reflect challenges analysts face daily, setting DABstep apart from other benchmarks like DS-1000 or DS Bench [^1].
+ - **Balancing structured and unstructured data**: These tasks require advanced data analysis skills to navigate structured data and understand multiple datasets and documents captured in unstructured data.
+ - **Simple setup**: Unlike benchmarks such as SWE-bench or MLE-bench, which require complex configurations, DABstep is simple to use. Generating answers with a model only requires access to a code execution environment, and participants can submit answers directly to a leaderboard for automatic evaluation.
+ - **Factoid evaluation**: Tasks have been designed to be evaluated objectively, and as such, the evaluation of the task output will always map to a binary outcome, right or wrong, without interpretation.
+ - **Multi-step complexity**: DABstep tests systems across a spectrum of analytical tasks, from routine queries to multi-step, iterative workflows. Unlike benchmarks focused on isolated questions, DABstep challenges models to engage in end-to-end agentic reasoning across diverse, practical tasks.
+
+How does DABstep achieve all this and remain a simple to run benchmark? Let’s take a look at its design!
+
+## What's inside the DABstep?
+
+DABstep has been designed for low-barrier usage, quality evaluation and increasing difficulty levels. To this end, we are opening up the following items as part of DABstep: Datasets, Tasks, Evals, Real-time Leaderboard and Baselines.
+
+### Data
+
+One of the biggest challenges analysts must overcome, when working on real-world problems, is balancing domain knowledge and technical skills. To this end, DABstep contains both unstructured and structured data to measure domain knowledge and technical skills respectively.
+
+Table 1 shows a snapshot of some of the dataset we are releasing with the benchmark.
+
+
+| Name | Description |
+| :---- | :---- |
+| payments.csv | Payments dataset of 138k (anonymized) transactions with various signals around fraud and risk use-cases. |
+| payments-readme.md | Documentation for the Payments dataset |
+| acquirer\_countries.csv | Table of Acquiring Banks and respect Countries |
+| fees.json | Extensive dataset composed of 1000 Scheme Fee structures. |
+| merchant\_category\_codes.csv | Table of Merchant Category Codes (MCCs) |
+| merchant\_data.json | Table describing merchants |
+| manual.md | In finance, business contexts are often outlined in extensive handbooks from networks, regulators, and processors. For the first version of this benchmark, we have created a markdown file (manual.md) that distills essential business knowledge into a precise yet simplified format for solving tasks accurately. |
+
+*Table 1: The benchmark is composed of various datasets across various tasks including the financial payments sector*
+
+Some of the structured datasets include CSV and JSON files representing real-world data, such as transaction telemetry and business metadata (e.g., merchant category codes). Additionally, we have unstructured data such as documentation, lengthy manuals, and detailed handbooks that, for example, are issued by networks, regulators, and processors.
+
+All of these datasets were extracted from real-world tasks at Adyen.
+
+### Tasks
+
+Based on the new datasets included in DABstep, we are releasing several tasks with increasing difficulty levels designed to test an AI agent’s accuracy.
+
+Each task contains the following items:
+
+1. A **question** that proposes a challenge to the analysts.
+2. A **level** encapsulating the difficulty of the task.
+3. **Guidelines** on how to format the answer to meet the specifications of the factoid evaluation.
+
+None of the tasks can be solved with 1-shot of code; in other words, they cannot be solved by reasoning alone, but rather, they require sequential steps of iterative problem-solving. For example, at the minimum, the agent must at least know what columns exist in the respective dataset to answer a question. This is contrasted with popular benchmarks like GAIA, MATH and SimpleQA, where it's possible to answer multiple questions with 1-shot of code correctly.
+
+Two example tasks are shown in Figure 1, and an example human-made reference solution is shown in Figure 2.
+
+| Name | Description |
+| :---- | :---- |
+|**Question:** Which card scheme had the highest average fraud rate in 2023?
**Guidance:** Answer must be the name of the scheme.
*\[LLM/Agent Loop…\]*
**Answer:** SwiftCharge| **Question:** For the year 2023, focusing on the merchant Crossfit Hanna, if we aimed to reduce fraudulent transactions by encouraging users to switch to a different Authorization Characteristics Indicator through incentives, which option would be the most cost-effective based on the lowest possible fees?
**Guidance:** Answer must be the selected ACI to incentive and the associated cost rounded to 2 decimals in this format: {card\_scheme}:{fee}.
*\[LLM/Agent Loop…\]*
**Answer:** E:346.49|
+
+*Figure 1: On the left is an example Risk/Fraud question from the Easy Set. Solution requires referencing at least 2 data sources and 3-shots of code. On the right is an example Scheme Fees question from the Hard Set. The solution requires referencing at least 2 data sources and multiple shots of code. The included answers are just for demonstration purposes and are withheld from the dataset.*
+
+#### Levels
+
+The benchmark consists of two difficulty levels:
+
+- **Easy Level**: These tasks serve as warm-ups, helping to verify setups, integrations, and research direction. They typically require only a single structured dataset and minimal contextual knowledge. On average, humans achieve a 62% baseline on these tasks after 3+ hours of work, while a Llama 70B zero-shot prompt can exceed 90% accuracy.
+- **Hard Level**: These tasks demand a more complex approach, involving multiple structured datasets and domain-specific knowledge. Unlike the easy level, they typically cannot be solved with a single-shot code generation and require multiple steps of reasoning.
+
+As an example of a multi-step reasoning problem, The following code shows a snippet of the human-made reference solution to a Hard Level task. Overall, it is broken down into four(4) sequential steps including the development of various support macros. In order to code this solution, the agent would have to have specific domain knowledge and the ability to work in sequential steps of iterative reasoning.
+
+