AI Benchmarks

By Tycho Broux

Advancements in artificial intelligence follow a dynamic cycle of development and dissemination, which often leads to a gap between expert knowledge and public understanding. AI innovations typically originate in academic institutions, corporate labs, or open-source communities. Researchers develop new models, algorithms, or applications. After this results are presented at conferences, published in journals, which links research papers to their code implementations. In the end breakthroughs may be reported in the media, but coverage often focuses on sensational aspects, often oversimplifying complex research.

Problem Statement

In order to push their product forward, companies will emphasize dramatic outcomes or risks. This leads to misconceptions about the AI’s capabilities and limitations. One way of dramatic outcomes is to present seemingly drastically improved bench marking scores. When looking at the AI bench marking scene we can see that it is rapidly developing and may be confusing to new users. There are multiple reasons why the benchmark results might not paint an accurate picture. These include but are not limited to:

Benchmark Saturation: As models improve, many benchmarks are “saturated” (i.e., nearly all models score very high), making it harder to differentiate improvements.
Data Contamination: Models may have seen parts of the test data during training, skewing their results.
Narrow Focus: Standard benchmarks may not capture performance in real-world scenarios or specialized domains. Different types of categories require different types of benchmarks which are typically not comparable. Benchmarks should be chosen depending on the problem you are trying to solve. Think 🍎🍎 to 🍎🍎.
Cost: The results often don’t contain references to training cost and run costs. Keep these factors into account when choosing a model. Especially so if you wish to create a modified version of someone else’s model.
Selective: The displayed results are often specifically chosen to pain the model in question in a positive light. This may not always be desirable or representative of the real world performance of this model.

This article aims to teach you some things on how to better navigate this scene.

Choose your battleground

Clear-cut correctness: In AI, especially within generative models, “clear-cut correctness” refers to outputs that are objectively verifiable—such as factual accuracy in responses or the correctness of a solution to a math problem. However, many AI applications involve subjective elements where correctness isn’t binary. For instance, in scenarios involving tone, style, or specific content preferences, there’s no single correct answer.

Multi-step problem solving: AI models are increasingly tackling tasks that require multi-step reasoning, akin to human problem-solving processes. This involves breaking down complex problems into sequential steps, allowing for more accurate solutions. Techniques like Chain-of-Thought (CoT) prompting encourage models to articulate their reasoning processes step by step. Additionally, frameworks like Model-induced Process Supervision (MiPS) use trained verifiers to evaluate each intermediate step, enhancing the model’s ability to handle complex tasks.

Domain specific challenges: While general-purpose AI models have made significant strides, they often encounter challenges when applied to specialized domains:

Mathematics: AI models can struggle with advanced mathematical reasoning, especially when problems involve complex, domain-specific rules. Research indicates that integrating domain-specific knowledge and structured reasoning processes can enhance performance in mathematical tasks.
Programming: Generating functional code requires understanding programming logic, syntax, and context. Models like OpenAI’s o1 have shown proficiency in coding tasks by employing step-by-step reasoning, which helps in debugging and optimizing algorithms.
Expert-Level Domain Knowledge: In fields like healthcare, legal analysis, or chemistry, AI models must comprehend and apply specialized knowledge. To address this, researchers are developing domain-specific large language models (LLMs) that incorporate expert knowledge, enhancing accuracy and relevance in specialized applications .

Choose your fighter

Using quadrant charts can be a great way of comparing similar models. Keep in mind that the quadrant chart needs to reflect the real world use case. Commonly used axes might be:

Training cost (higher is better)
Running cost (higher is better)
Token Throughput (higher is better)
Time to first token (lower is better)
Latency (lower is better)
Memory requirement (lower is better)
Fit for purpose / answer quality (higher is better)

What about distillation?

In the context of artificial intelligence, model distillation refers to a technique where a smaller, more efficient model (the student) is trained to replicate the behavior of a larger, more complex model (the teacher). This process enables the deployment of AI systems that are faster, less resource-intensive, and more suitable for environments with limited computational capabilities, such as home computers, mobile devices or edge computing platforms.

You can often recognize a distilled model by it’s name. It will include something along the line of fp8 or fp4 in it’s name instead of fp16. This is a reference to the Floating Point Accuracy. The higher the number, the more intelligent the output of the model will be at the cost of more compute and vice versa. A model with a different accuracy will score differently on the same benchmark even though you could consider it to be the same model.

Where to go

The website Papers with Code serves as a comprehensive resource for tracking the latest advancements in machine learning and artificial intelligence. It compiles a vast collection of research papers, each accompanied by their corresponding code implementations, and organizes them according to specific tasks and benchmarks.

Benchmark Tracking: The site provides up-to-date leaderboards for over 5,700 tasks across various domains, including computer vision, natural language processing, and medical imaging. This allows users to monitor which models are currently achieving top performance in specific areas.

Research Accessibility: By linking research papers with their code repositories, the platform facilitates easier reproduction of results and fosters transparency in the research community.

Community Engagement: Researchers, developers, and enthusiasts use the site to stay informed about cutting-edge methods, compare model performances, and identify trends in machine learning research.

Conclusion

The outcome alone doesn’t tell the whole story — take the time to understand the journey, the reasoning, and the process that led to it. Remember that there is no such thing as the best tool, only the right tool for the job.