1. AGI Eval
AGIEval is a human-centric benchmark specifically designed to evaluate the general abilities of foundation models in tasks pertinent to human cognition and problem-solving.
It is derived from 20 official, public, and high-standard admission and qualification exams intended for general human test-takers, such as college entrance exams (e.g. SAT and China’s Gaokao), law school admission tests (e.g. LSAT), math competitions, lawyer qualification tests, and national civil service exams.
The goal of AGIEval is to provide a more meaningful and robust evaluation of how well AI foundation models can handle complex, real-world tasks that require human-like understanding, knowledge, reasoning and decision-making. This contrasts with traditional AI benchmarks that often rely on artificial datasets and do not accurately represent the full scope of human-level capabilities.
With evaluation on exams and problems originally designed for humans, AGIEval aims to better assess the progress of AI systems towards artificial general intelligence (AGI).
2. MMLU (Massive Multitask Language Understanding)
MMLU (Massive Multitask Language Understanding) is a comprehensive benchmark designed to evaluate the multitask accuracy and knowledge of large language models across a wide range of academic and professional domains. It consists of approximately 16,000 multiple-choice questions spanning 57 diverse subjects including STEM fields, humanities, social sciences, law, medicine and more.
The primary goal of MMLU is to test both the depth and breadth of knowledge acquired by AI models during pretraining, focusing on their ability to understand and solve problems in various areas without requiring task-specific fine-tuning. This zero-shot and few-shot evaluation approach makes MMLU more challenging and akin to how humans are assessed on their multidisciplinary knowledge.
A vital feature of MMLU is its emphasis on granularity – the subjects range from elementary-level topics to advanced professional material, allowing fine-grained analysis of a model’s strengths and weaknesses. The benchmark is carefully curated to include questions that test not just factual knowledge but also critical thinking and reasoning skills essential for each domain.
MMLU enables fair comparison of different language models and identification of areas for improvement. Recent large models like GPT-4, Claude 3 Opus and Gemini Ultra have achieved significant progress on MMLU, scoring 80-90%, but still fall short of estimated human expert-level performance of around 90% accuracy. Notably, even state-of-the-art models struggle with complex subjects like law and ethics, highlighting the need for further advancements.
3. BIG-Bench Hard (BBH)
BIG-Bench Hard (BBH) is a carefully curated subset of 23 particularly challenging tasks (comprising 27 subtasks) selected from the larger BIG-Bench benchmark. These tasks were chosen because no prior language model evaluated on BIG-Bench was able to outperform the average human-rater score on them.
The primary goal of BBH is to focus on the current limitations and failure modes of large language models, highlighting areas where they still fall short of human-level performance. By concentrating on this set of difficult tasks, BBH aims to push the boundaries of model capabilities and steer research towards addressing these shortcomings.
Many of the tasks in BBH require complex reasoning, domain expertise, and multi-step problem-solving – capabilities that are essential for general intelligence but have proven challenging for AI systems. The task diversity in BBH spans traditional NLP, mathematics, common sense reasoning, question-answering, and more.
Interestingly, while standard “answer-only” prompting on BBH tasks underperformed average human raters, researchers found that using “chain-of-thought” (CoT) prompting enabled models like PaLM and Codex to surpass human performance on the majority of BBH tasks. This suggests that the apparent limitations exposed by BBH may be more a matter of effective prompting rather than inherent model shortcomings.
Nevertheless, BBH remains a valuable benchmark for advancing language model performance on the most challenging language tasks. By tracking progress on this difficult subset, researchers can identify key areas for improvement and develop new techniques to close the gap between machine and human language understanding.
4. ANLI (Adversarial Natural Language Inference)
ANLI (Adversarial Natural Language Inference) is a large-scale benchmark dataset designed to test the performance of state-of-the-art models on the challenging task of natural language inference. It was created using an iterative, adversarial human-and-model-in-the-loop approach to collect examples that are particularly difficult for AI systems.
The idea behind ANLI is to dynamically create a dataset where humans actively try to find examples that fool the best current models, thus pushing the boundaries of model performance. This contrasts with static datasets like SNLI and MNLI, which can saturate as models become increasingly sophisticated.
ANLI was collected over three rounds, each targeting a progressively stronger model adversary. In each round, human annotators were tasked with writing “hypotheses” that would trick the target model, given a “premise” and a desired inference label (entailment, contradiction, or neutral). If the model misclassified the example, it was validated by other annotators and added to the dataset.
The rounds differed in the complexity of the target models and the diversity of the premise sources. Round 1 used a BERT-Large model and premises from Wikipedia, while later rounds incorporated stronger RoBERTa ensembles and premises from various domains. This iterative process resulted in a dataset of 162,865 examples that are highly challenging for NLI models.
ANLI has emerged as a valuable benchmark for evaluating the robustness of NLI systems and their ability to handle adversarially-selected examples. Analysis of model performance on ANLI has yielded insights into the types of inferences that remain difficult, such as numerical reasoning and lexical inference.
5. HellaSwag
HellaSwag is a dataset and benchmark for evaluating commonsense natural language inference (NLI) in AI systems. It was introduced in 2019 by researchers at the University of Washington and Allen Institute for AI as a more challenging alternative to existing NLI datasets like SWAG.
The idea behind HellaSwag is to test machines on questions that are easy for humans but difficult for even state-of-the-art language models. While humans achieve over 95% accuracy on HellaSwag, the best models at the time of its release struggled to surpass 48%.
HellaSwag consists of around 70,000 multiple-choice questions about grounded situations, with the context derived from video captions. Each question presents a context and four possible next events, only one of which is correct based on commonsense reasoning.
To make the dataset especially challenging, the incorrect choices are generated using “Adversarial Filtering” – an iterative process where a series of discriminative models select machine-generated endings that are likely to fool NLI systems while being clearly implausible to humans. This results in questions that are trivial for humans but very difficult for AI.
The name HellaSwag stands for “Harder Endings, Longer contexts, and Low-shot Activities for Situations With Adversarial Generations”. It reflects the dataset’s emphasis on complex contexts, uncommon scenarios, and adversarially selected endings compared to previous benchmarks.
Through posing a rigorous challenge for commonsense inference, HellaSwag has driven significant progress in NLI capabilities. While humans still outperform machines, the gap has narrowed substantially since 2019, with models like GPT-4 now approaching human-level accuracy. Nevertheless, HellaSwag remains a valuable benchmark for advancing and evaluating commonsense reasoning in AI systems.
6. ARC Challenge
The Abstraction and Reasoning Corpus (ARC) Challenge is a unique benchmark designed to test the reasoning and generalization abilities of AI systems. It was introduced in 2019 by François Chollet, a prominent AI researcher at Google, as part of his influential paper “On the Measure of Intelligence”.
ARC Challenge consists of a set of tasks that involve abstract reasoning about grids of colored squares. Each task provides a small number of demonstration pairs (typically 3) showing an input grid and the corresponding output grid. The goal is to infer the abstract pattern or rule that maps inputs to outputs and apply it to new test inputs.
What makes ARC Challenge particularly difficult is that the tasks are designed to be solvable by humans using core knowledge and fluid intelligence, without relying on any specific training or prior knowledge. In contrast, even state-of-the-art deep learning models struggle on ARC, often scoring near zero.
The challenge consists of 400 tasks for evaluation, with an additional 400 tasks provided for training and development. Performance is measured by the percentage of test inputs for which the model produces the exactly correct output grid. A perfect score of 0 error corresponds to solving all tasks, while current models achieve around 20% accuracy (0.8 error).
ARC Challenge is not aimed at measuring narrow, specialized skills, but rather a general, human-like form of fluid intelligence. By requiring models to learn and generalize from very few examples, it tests the core reasoning abilities that are central to general intelligence.
The tasks in ARC span a wide range of difficulty, from simple pattern recognition to complex multi-step reasoning. They are designed to be approachable for humans while posing a significant challenge for AI systems. Analysis of model performance on ARC has yielded valuable insights into the limitations of current approaches and the need for new techniques to achieve human-level reasoning.
7. ARC Easy
ARC Easy is a companion dataset and benchmark to the more challenging ARC Challenge, both of which were introduced as part of the Abstraction and Reasoning Corpus (ARC) in 2019. ARC Challenge is designed to test advanced reasoning capabilities, but ARC Easy serves as a more accessible entry point for evaluating AI systems on abstract reasoning tasks.
Like ARC Challenge, ARC Easy consists of tasks involving reasoning about grids of colored squares. Each task provides a small number of input-output pairs demonstrating a particular pattern or transformation, and the goal is to infer the underlying rule and apply it to new inputs.
The main difference is that the tasks in ARC Easy are selected to be solvable by most humans without requiring significant effort or specialized knowledge. They represent more straightforward patterns that can typically be identified and generalized from the given examples.
ARC Easy contains 400 tasks for evaluation and 400 tasks for training and development, mirroring the structure of ARC Challenge. Performance is measured by the percentage of test inputs for which the model produces the exactly correct output grid.
ARC Challenge remains difficult even for state-of-the-art models, but ARC Easy has seen substantial progress since its introduction. Recent deep learning approaches have achieved over 90% accuracy on the benchmark, although still falling short of human-level performance.
The purpose of ARC Easy is to provide a more gradual path towards the goal of human-like abstract reasoning. By first tackling the more approachable tasks in ARC Easy, researchers can develop and refine techniques that can then be scaled up to address the harder challenges in ARC Challenge.
At the same time, ARC Easy is far from trivial for AI systems. It still requires the ability to learn and generalize from a very small number of examples, a form of few-shot learning that is crucial for general intelligence but challenging for standard neural networks.
8. BoolQ
BoolQ is a question answering dataset for yes/no questions, containing 15,942 examples. The main feature of BoolQ is that the questions are naturally occurring, meaning they are generated in unprompted and unconstrained settings, rather than being artificially created by annotators.
Each example in BoolQ is a triplet of (question, passage, answer), with the title of the source Wikipedia page provided as optional additional context. The questions cover a wide range of topics and require reading comprehension skills to answer based on the given passage.
The dataset is designed for a text-pair classification setup, similar to existing natural language inference (NLI) tasks. However, the naturally sourced questions in BoolQ tend to be more challenging than traditional NLI datasets, often involving complex reasoning and entailment-like inference.
BoolQ was introduced by researchers at Google in 2019 as a benchmark to test the reading comprehension and inference capabilities of question answering systems. The full dataset contains 9,427 training examples, 3,270 validation examples, and 3,245 test examples.
BoolQ has become a popular benchmark in the NLP community, frequently used to evaluate and compare question answering models. It is available open-source under the Creative Commons Share-Alike 3.0 license and has been integrated into various machine learning platforms and tools.
9. CommonsenseQA
CommonsenseQA is a question answering dataset that targets commonsense knowledge, containing 12,247 multiple-choice questions. The goal of CommonsenseQA is to test the ability of AI systems to reason about everyday situations and events using general background knowledge, rather than just relying on a specific given context.
The questions in CommonsenseQA are created by extracting concepts from ConceptNet, a knowledge graph of commonsense facts. For each question, a source concept is chosen, and several target concepts that have the same semantic relation to the source are selected. Crowdworkers are then asked to write questions that mention the source concept and require commonsense reasoning to discriminate between the target concepts as possible answers.
This process results in questions that tap into complex reasoning and background knowledge, going beyond simple factual or associative information. The questions cover a wide range of commonsense topics, such as object properties, causes and effects, human behaviors, and social conventions.
CommonsenseQA serves as a challenging benchmark to evaluate the commonsense reasoning capabilities of AI systems. While humans achieve around 89% accuracy on the dataset, the best AI models at the time of the dataset’s release in 2019 only reached 56% accuracy, highlighting a significant gap between human and machine commonsense.
The full dataset contains 12,247 questions split into a training set (9,741), a validation set (1,221) and a test set (1,140). Each question has one correct answer and four distractor answers. The dataset is provided in two different split settings – a random split and a question token split.
Through posing a difficult challenge for AI systems, CommonsenseQA has spurred research into improving machines’ commonsense reasoning abilities and bringing them closer to human-level understanding of everyday concepts and situations. The dataset is widely used to benchmark progress on this key aspect of general intelligence.
10. MedQA
MedQA is a large-scale, open-domain question answering dataset designed to test the ability of AI systems to solve real-world medical problems. It consists of multiple-choice questions collected from professional medical board exams like the United States Medical License Exams (USMLE).
The dataset covers three languages: English, simplified Chinese, and traditional Chinese. It contains 12,723 questions in English, 34,251 in simplified Chinese, and 14,123 in traditional Chinese. Each question is accompanied by a set of answer options, of which one or more may be correct.
A main feature of MedQA is that the questions are posed as free-form text, rather than being limited to a specific format or knowledge domain. This makes the task more challenging and closer to real-world medical question answering, where questions can be asked in many different ways.
To help AI systems answer the questions, MedQA also includes a large corpus of medical textbooks and other reference materials. The goal is for machine learning models to leverage this background knowledge to find the correct answers, similar to how a human medical professional would.
MedQA was introduced to spur research on open-domain question answering in the medical field. Answering medical exam questions requires a deep understanding of complex medical concepts and the ability to reason over multiple pieces of information. As such, MedQA serves as a challenging benchmark to evaluate the reading comprehension and inference capabilities of AI systems in the medical domain.
11. OpenBookQA
OpenBookQA is a question answering dataset that aims to promote research in advanced question-answering, probing a deeper understanding of both the topic (with salient facts summarized as an “open book”, also provided with the dataset) and the language it is expressed in.
The dataset contains 5,957 multiple-choice elementary-level science questions, split into 4,957 train, 500 dev, and 500 test questions. Each question is accompanied by 4 answer options, of which exactly one is correct.
A main feature of OpenBookQA is that the questions are designed to require multi-step reasoning, use of additional common and commonsense knowledge, and rich text comprehension. This contrasts with many existing QA datasets where questions can be answered using retrieval or fact-matching techniques.
To help with answering the questions, OpenBookQA also provides an “open book” of 1,326 core science facts. For the training set, a mapping is given from each question to the core fact it was designed to probe. However, answering the questions requires additional broad common knowledge not contained in the book.
In addition to the questions, OpenBookQA includes a collection of 5,167 crowd-sourced common knowledge facts. An expanded version of the dataset provides additional information for each question, including the originating core fact, a human accuracy score, a clarity score, and an anonymized crowd-worker ID.
The questions in OpenBookQA are designed to be answerable by humans while challenging for existing QA systems. The dataset was motivated by the idea of open book exams for assessing human understanding of a subject. By requiring multi-hop reasoning and common sense, OpenBookQA aims to push the boundaries of machine reading comprehension and question-answering.
Experiments have shown that both retrieval-based and co-occurrence based algorithms perform poorly on OpenBookQA, highlighting the need for more sophisticated reasoning techniques. The dataset has become a popular benchmark for evaluating models’ reasoning and common sense abilities.
12. PIQA (Physical Interaction: Question Answering)
PIQA (Physical Interaction: Question Answering) is a question answering dataset that focuses on testing physical commonsense reasoning in natural language. It was introduced by researchers at the Allen Institute for AI, University of Washington, Microsoft Research, and Carnegie Mellon University in 2019.
The idea behind PIQA is to probe whether AI systems can reason about the physical world in the way humans do, by asking questions about everyday situations and interactions with objects. For example, “To apply eyeshadow without a brush, should I use a cotton swab or a toothpick?”
PIQA contains 16,000 training examples, 2,000 development examples, and 3,000 test examples. Each example consists of a goal (the question), two solutions (possible answers), and a label indicating the correct solution. The dataset is set up as a binary choice task, where models must select the most appropriate solution based on physical commonsense reasoning.
The questions in PIQA cover a wide range of physical phenomena and commonsense knowledge, such as object properties, affordances, and plausible interactions. They often focus on atypical uses of objects in addition to their prototypical uses. The dataset was inspired by instructables.com, a website containing crowdsourced instructions for various everyday tasks.
The challenge posed by PIQA is the reporting bias in natural language data – knowledge about the physical world is often implicit or underspecified in text. This makes it difficult for AI systems to learn reliable physical commonsense solely from reading text, without embodied interaction.
Indeed, while humans find PIQA relatively easy (95% accuracy), large pretrained language models struggle on this task (around 75% accuracy). Analysis of model performance on PIQA has yielded insights into the types of physical knowledge that AI systems currently lack, such as understanding object affordances and commonsense interactions.
PIQA has become an important benchmark for evaluating and advancing physical commonsense reasoning capabilities in natural language AI. It complements other commonsense reasoning datasets by focusing specifically on intuitive physics and physical interactions. Improving performance on PIQA is seen as a key challenge on the path to building AI systems with more general, human-like intelligence.
13. Social IQA
Social IQA (Social Interaction QA) is a large-scale question-answering benchmark dataset designed to test social and emotional intelligence in AI systems. It was introduced in 2019 by researchers from the Allen Institute for AI, University of Washington, and other institutions.
The idea behind Social IQA is to probe machines’ commonsense reasoning about social situations and interactions, an important aspect of human intelligence that is often overlooked in other QA benchmarks. The dataset contains 38,000 multiple-choice questions about a wide variety of everyday social scenarios.
Each question in Social IQA is accompanied by a context describing a social situation, and three possible answers. The task is framed as a multiple-choice challenge – models must select the correct answer that demonstrates an understanding of the social dynamics and implications of the situation.
For example, given a context like “Jordan wanted to tell Tracy a secret, so Jordan leaned towards Tracy”, models are asked questions like “Why did Jordan do this?”. The correct answer “Make sure no one else could hear” requires inferring Jordan’s intent from the physical action of leaning in, while incorrect choices might describe irrelevant or implausible motivations.
The questions in Social IQA cover various social and emotional concepts, such as motivations, emotional reactions, social perceptions, and pragmatic inferences. They are authored through crowdsourcing, using a novel framework to collect incorrect answers that mitigates stylistic artifacts arising from cognitive biases.
Social IQA is split into training (33,410 questions), validation (1,954 questions) and test (2,224 questions) sets. Human performance on the benchmark is high (>90%), while state-of-the-art AI models struggle to exceed 80%, revealing a significant gap in social reasoning capabilities.
Social IQA has also been established as a useful resource for transfer learning, with models fine-tuned on the dataset showing improved performance on other social reasoning tasks like the Winograd Schema Challenge. This highlights the importance of social intelligence as a key aspect of general AI.
14. TruthfulQA (MC2)
TruthfulQA (MC2) refers to the multiple-choice version of the TruthfulQA benchmark, which measures whether a language model can discern truthful answers from false ones. In the MC2 setup, each question is presented with multiple candidate answers, and the model must select all of the correct answers.
The main difference between MC2 and the standard TruthfulQA task is that MC2 tests a model’s ability to identify truthful statements, rather than generate them. This makes evaluation more straightforward, as the model’s selections can be easily compared to the ground-truth labels, without the need for human judgments or similarity metrics.
MC2 uses the same set of 817 questions as the original TruthfulQA benchmark, spanning 38 categories including health, law, finance, and politics. For each question, there are multiple candidate answers, some of which are true and some of which are false. The false answers generally reflect common misconceptions or falsehoods that humans might believe.
To score well on MC2, a model must be able to distinguish true statements from plausible-sounding but false ones. This requires not just factual knowledge, but also the ability to spot and avoid the kinds of misconceptions that are prevalent in human-written text on the internet.
The MC2 version of TruthfulQA is also known as the “multi-true” variant, in contrast to the “single-true” MC1 variant where there is only one correct answer per question. Having multiple true answers per question makes MC2 somewhat more challenging, as the model must consider each choice independently rather than simply picking the best one.
Performance on MC2 is evaluated using standard classification metrics like accuracy, precision, and recall. A model’s MC2 accuracy provides a measure of its overall truthfulness and resistance to false beliefs, which is important for ensuring that language models provide reliable and factual information to users.
Although easier to evaluate than the generation version of TruthfulQA, MC2 is still a challenging benchmark, with the best models achieving significantly lower accuracy than humans. Improving MC2 performance is an important goal for making language models more truthful and trustworthy.
15. WinoGrande
WinoGrande is a large-scale dataset of 44,000 problems designed to test commonsense reasoning in AI systems, particularly in the context of pronoun resolution. It was introduced in 2019 by researchers from the Allen Institute for AI and the University of Washington.
WinoGrande is inspired by the classic Winograd Schema Challenge (WSC), but adjusted to improve both the scale and difficulty of the task. Like WSC, WinoGrande problems are presented as a sentence with a pronoun that must be correctly resolved to one of two possible antecedents.
For example: “The trophy doesn’t fit in the suitcase because it is too large. What is too large? Answer 0: the trophy, Answer 1: the suitcase”. Here, answering correctly (“the trophy”) requires commonsense knowledge about the typical sizes of objects.
To construct WinoGrande, the authors employed a carefully designed crowdsourcing procedure to collect sentence-question pairs, followed by a novel adversarial filtering algorithm called AFLite to reduce dataset-specific biases and ensure high quality.
The resulting dataset contains 44,000 problems, split into training sets of varying sizes (160 to 40,398 examples), a validation set (1,267 examples), and a test set (1,767 examples). The different training set sizes allow for evaluating model performance under different data availability conditions.
WinoGrande problems cover a wide range of commonsense reasoning scenarios, including social contexts, physical properties, temporal ordering, and abstract ideas. Solving them requires deep language understanding and inference abilities that go beyond surface-level cues.
In addition to its value as a standalone benchmark, WinoGrande has proven useful as a resource for transfer learning, with models fine-tuned on WinoGrande showing improved performance on related tasks like WSC, DPR, and COPA. This highlights the importance of WinoGrande in driving progress on commonsense reasoning.
16. TriviaQA
TriviaQA is a large-scale reading comprehension dataset containing over 650,000 question-answer-evidence triples. It was introduced in 2017 by researchers from the University of Washington and the Allen Institute for AI.
The main feature of TriviaQA is that it includes 95,000 question-answer pairs authored by trivia enthusiasts, along with independently gathered evidence documents (around six per question on average) that provide high quality distant supervision for answering the questions. This distinguishes TriviaQA from other reading comprehension datasets that rely on crowdsourced or automatically generated questions.
The evidence documents in TriviaQA come from two sources: Wikipedia articles and Web search results. For each question, relevant Wikipedia pages were identified using an entity linking approach, while Web search results were obtained by querying the question text and filtering out trivia websites.
TriviaQA is designed to be a challenging benchmark for reading comprehension systems. Compared to other datasets, TriviaQA questions are more compositional, have greater syntactic and lexical variability between the question and answer-evidence sentences, and require more cross-sentence reasoning to solve. Human performance on TriviaQA is around 80%, while state-of-the-art models at the time of the dataset’s release achieved only 40% accuracy.
The full dataset contains 650,000 question-answer-evidence triples, divided into a training set (95,000 examples), a development set (7,800 examples), and a test set (7,800 examples). The test set is further split into a Wikipedia domain (containing only Wikipedia evidence) and a Web domain (containing only Web search evidence).
TriviaQA has been widely adopted as a benchmark in the QA community and has driven significant advances in reading comprehension models. The dataset is available under the Apache 2.0 open-source license, making it easy for researchers to use and build upon.
17. GSM8K Chain of Thought
GSM8K Chain of Thought is a dataset and benchmark for testing the mathematical reasoning and problem-solving abilities of large language models. It was introduced in 2022 by researchers from Anthropic, Google, and the University of Washington.
The main idea behind GSM8K Chain of Thought is to evaluate a model’s ability to generate a coherent chain of thought or solution steps for solving grade school math word problems. This contrasts with traditional math benchmarks that only assess the final answer, without considering the reasoning process.
The dataset consists of 8,000 high-quality grade school math word problems, along with reference solutions that show the step-by-step reasoning required to arrive at the final answer. These problems cover a wide range of mathematical concepts and skills taught in grades 3-10, such as arithmetic, algebra, geometry, and probability.
To perform well on GSM8K Chain of Thought, models must be able to understand the natural language problem statement, identify the relevant mathematical concepts and operations, and generate a logically coherent sequence of solution steps that lead to the correct answer. This requires a deep understanding of both language and mathematics, as well as strong reasoning and problem-solving skills.
The benchmark is designed to be challenging for current large language models, which often struggle with mathematical reasoning despite their impressive performance on other natural language tasks. Human performance on GSM8K is around 80%, while state-of-the-art models at the time of the dataset’s release achieved only 50-60% accuracy, even with specialized prompting techniques like chain-of-thought prompting.
GSM8K Chain of Thought has quickly become an important benchmark for evaluating the mathematical reasoning capabilities of large language models and driving progress in this area. The dataset is available on GitHub under an Apache 2.0 license.
In addition to its value as a standalone benchmark, GSM8K Chain of Thought has also been used to study the emergent abilities of large language models, such as their capacity for few-shot learning and their ability to generate diverse solution strategies. This has led to new insights into the strengths and limitations of current models and has helped guide the development of more capable and robust systems.
18. HumanEval
HumanEval is a benchmark dataset designed to evaluate the functional correctness of code generated by large language models (LLMs). It was introduced by OpenAI in 2021 as part of their research on the Codex model.
The dataset consists of 164 hand-crafted programming problems, each including a function signature, docstring, body, and several unit tests. On average, there are 7.7 tests per problem. The problems cover a range of difficulty levels, from simple string manipulation to more complex algorithms and data structures.
A main feature of HumanEval is that the problems were carefully curated to avoid overlap with the training data of code generation models like Codex. This allows for a fair evaluation of the model’s ability to generate functionally correct code based on problem descriptions, without relying on memorization.
To evaluate a model on HumanEval, the prompt (function signature and docstring) is fed into the model to generate a candidate solution. The generated code is then run against the corresponding unit tests to assess its functional correctness. The evaluation metric used is pass@k, which measures the percentage of problems for which at least one of k generated samples passes all unit tests.
HumanEval has become a widely adopted benchmark in the code generation community, with an active leaderboard tracking the performance of various LLMs. It has played a crucial role in driving the development of more capable code generation models, such as OpenAI’s InCoder and DeepMind’s AlphaCode.
One limitation of HumanEval is its focus on a single programming language (Python) and a relatively small set of problems. Subsequent work has extended the dataset to other languages (HumanEval-X) and increased the number and diversity of problems.
Despite its limitations, HumanEval remains an important benchmark for assessing the functional correctness of generated code. By providing a standardized set of problems and evaluation metrics, it enables meaningful comparisons between different code generation approaches and tracks progress towards more reliable and robust LLMs for programming tasks.
19. MBPP (Mostly Basic Programming Problems)
MBPP (Mostly Basic Programming Problems) is a large-scale dataset and benchmark for evaluating the code generation capabilities of AI models, particularly on entry-level programming tasks. It was introduced by researchers at Google in 2021 as part of their work on program synthesis with large language models.
The dataset consists of around 1,000 crowd-sourced Python programming problems, designed to be solvable by entry-level programmers. These problems cover a wide range of programming fundamentals, standard library functionality, and basic algorithms.
Each problem in MBPP includes a natural language task description, a reference solution in Python, and three automated test cases to check the functional correctness of generated code. The natural language descriptions are written to be concrete enough that a human could translate them into code without additional clarification.
MBPP was created to enable systematic evaluation of code generation models on a diverse set of realistic programming tasks. By targeting entry-level problems, MBPP aims to assess the models’ ability to synthesize code for common programming patterns and idioms.
A main feature of MBPP is that the problems were carefully constructed to minimize overlap with the training data of existing code generation models. This helps ensure a fair evaluation of the models’ generalization capabilities, rather than their ability to memorize seen solutions.
The full MBPP dataset is openly available and has been integrated into popular machine learning platforms like Hugging Face Datasets. This has made it easy for researchers to use MBPP for evaluating and comparing different code generation approaches.
While MBPP has proven valuable, it also has some limitations. The dataset only covers Python programming, and the problems are mostly standalone functions rather than full programs. Subsequent work has extended MBPP to other programming languages and more complex coding tasks.
Despite its limitations, MBPP remains an important resource for assessing the progress of AI code generation technology. By providing a standardized benchmark of entry-level programming problems, it enables meaningful comparisons between models and helps identify areas for further research and improvement.
What Is a Large Language Model?
A large language model (LLM) is a deep learning algorithm that can understand, summarize, translate, predict and generate human-like text and other content. LLMs learn by analyzing massive text datasets, often scraped from the internet, to recognize patterns and relationships between words and concepts.
The characteristics of LLMs are:
- Trained on huge datasets, typically over 1 billion parameters
- Use transformer neural network architectures to process text
- Can be adapted to many language tasks through prompt engineering or fine-tuning
- Exhibit emergent abilities like reasoning, math, and coding
Popular examples of LLMs include OpenAI’s GPT series (GPT-3, GPT-4), Google’s PaLM and Gemini, Meta’s LLaMA, and Anthropic’s Claude. These foundation models power applications like AI chatbots, writing assistants, and code generators.
How Do Large Language Models Work?
LLMs operate using three main components:
- Data: Enormous text corpora, often petabytes in size, used to train the model. Higher quality data leads to better language understanding.
- Architecture: Typically a transformer neural network with many layers that can process text sequences in parallel and learn contextual relationships. More parameters enable greater complexity.
- Training: Self-supervised learning on the unlabeled text data, allowing the model to acquire linguistic knowledge and reasoning abilities. Fine-tuning adapts the model to specific tasks.
By analyzing statistical patterns across vast amounts of natural language, LLMs build a probabilistic understanding of how words are used in context. This allows them to predict the most likely next word in a sequence and generate coherent text that matches the patterns in their training data.
What Is A Large Language Model benchmark?
A large language model benchmark is a standardized test or set of tasks used to evaluate and compare the performance of different LLMs. Benchmarks provide an objective way to measure a model’s accuracy, capabilities, and limitations on various language understanding and generation skills.
LLM benchmarks typically consist of:
- A dataset of test examples
- Specific tasks or questions to complete
- Metrics for scoring model outputs
- Leaderboards ranking models by performance
For example, a question-answering benchmark might provide a bank of factual questions, grade models on how many they answer correctly, and rank them by accuracy score. A summarization benchmark could evaluate models on their ability to condense articles into key points using metrics like ROUGE.
Why are LLM benchmarks important?
LLM benchmarks serve several key purposes:
- Comparing models to select the best one for a use case
- Tracking progress and identifying areas for improvement
- Holding models accountable to claims and surfacing limitations
- Enabling apples-to-apples evaluation using standard tests
Without benchmarks, it would be difficult to cut through marketing hype and determine which LLMs are actually state-of-the-art. Rigorous testing ensures models are consistently getting better at providing truthful, reliable, and beneficial outputs.
How do LLM benchmarks work?
The process of benchmarking an LLM typically involves:
- Task definition: Selecting language skills to test, like question-answering, language understanding, reasoning, or generation. Tasks should be challenging but solvable.
- Data preparation: Curating a high-quality test dataset that is representative, unbiased, and unseen by the models. Data may come from real-world sources or be purposefully adversarial.
- Prompt design: Crafting instructions for querying the model and eliciting desired behaviors. Prompts are often standardized to ensure fair comparison.
- Model evaluation: Running the LLM on the test examples and measuring performance using relevant metrics. Evaluation may be automated or involve human judgment.
- Results analysis: Examining scores to identify strengths, weaknesses, and differences between models. Analysis can yield insights for improving models and benchmarks.
Benchmarks may assess models in a “zero-shot” manner, without any further training, or provide a small set of examples in a “few-shot” setting. Models are usually restricted from accessing external information during testing.
Performance Metrics and Evaluation Criteria of LLM Benchmarks
LLM benchmarks employ various quantitative and qualitative metrics to evaluate model performance, such as:
Metric | Description |
---|---|
Accuracy | % of responses exactly matching expected answer |
F1 Score | Harmonic mean of precision and recall |
BLEU | Similarity between generated and reference text |
Perplexity | How surprised model is by test data, lower is better |
Exact Match | Binary score if entire output matches expected |
ROUGE | Measures overlap of generated and reference summaries |
Human Evaluation | Qualitative judgment of outputs by raters |
Benchmarks often employ multiple complementary metrics to capture different aspects of model behavior. Evaluation criteria may also consider factors like:
- Truthfulness – Providing factual, reliable information
- Coherence – Staying on topic and making sense
- Fluency – Using natural, grammatical language
- Relevance – Directly addressing the prompt
- Reasoning – Exhibiting logical thinking and inference
- Robustness – Performing well on adversarial or out-of-distribution examples
The most comprehensive benchmarks use a diverse set of tasks and metrics to stress test LLMs and get a complete picture of their skills and boundaries.
How should I read the results of LLM benchmarks?
When interpreting LLM benchmark results, keep the following tips in mind:
- Compare to baselines – Look at how models perform relative to simple baselines, previous SOTAs, and human-level scores. An accuracy of 60% is poor if humans get 90%.
- Consider the task – Some skills are more challenging or impactful than others. An LLM that excels at trivia might still fail at coding or empathy.
- Look at the data – Check how large and diverse the test set is. Results on narrow academic datasets may not transfer to messy real-world applications.
- Examine error cases – Don’t just focus on overall scores. Analyze where models make mistakes to identify weaknesses and potential for harm.
- Beware of hype – Be skeptical of cherry-picked results that make models seem more capable than they are. Demand comprehensive benchmarking.
- Acknowledge limitations – Even SOTA models are narrow and brittle. Benchmarks don’t perfectly reflect real-world performance and can be gamed.
- Use multiple benchmarks – Don’t rely on a single score. Triangulate a model’s true abilities by looking at its full portfolio of benchmark results.
- Think critically – Go beyond leaderboards. Consider the ecological validity, robustness, and ethical implications of benchmarks. Hold models accountable.
Benchmark results provide valuable information but are not the full picture. Use them as a starting point for probing LLMs’ strengths, weaknesses, and suitability for different applications. The most informative evaluation often comes from testing models on your own domain-specific data and use cases.