EvalTree: Profiling Language Model Weaknesses via Hierarchical Capability Trees

Zhiyuan Zeng, Yizhong Wang, Hannaneh Hajishirzi, Pang Wei Koh

2025-03-19

EvalTree: Profiling Language Model Weaknesses via Hierarchical
Capability Trees

Summary

This paper introduces EvalTree, a system to find the weak spots in AI language models and give ideas on how to make them better.

What's the problem?

It's hard to figure out exactly what AI language models are bad at and how to improve them.

What's the solution?

EvalTree creates a tree-like structure that shows the different skills an AI model should have and then identifies the specific areas where the model struggles.

Why it matters?

This work matters because it can help researchers and developers improve AI language models in a more targeted and effective way.

Abstract

An ideal model evaluation should achieve two goals: identifying where the model fails and providing actionable improvement guidance. Toward these goals for Language Model (LM) evaluations, we formulate the problem of generating a weakness profile, a set of weaknesses expressed in natural language, given an LM's performance on every individual instance in a benchmark. We introduce a suite of quantitative assessments to compare different weakness profiling methods. We also propose a weakness profiling method EvalTree. It constructs a capability tree where each node represents a capability described in natural language and is linked to a subset of benchmark instances that specifically evaluate this capability; it then extracts nodes where the LM performs poorly to generate a weakness profile. On the MATH and WildChat benchmarks, we show that EvalTree outperforms baseline weakness profiling methods by identifying weaknesses more precisely and comprehensively. Weakness profiling further enables weakness-guided data collection, and training data collection guided by EvalTree-identified weaknesses improves LM performance more than other data collection strategies. We also show how EvalTree exposes flaws in Chatbot Arena's human-voter-based evaluation practice. To facilitate future work, we release our code and an interface that allows practitioners to interactively explore the capability trees built by EvalTree.

View Paper