TabReX : Tabular Referenceless eXplainable Evaluation
Tejas Anvekar, Juhna Park, Aparna Garimella, Vivek Gupta
2025-12-19
Summary
This paper introduces a new way to check how good large language models are at creating tables, going beyond simple text comparisons to actually understand if the table's structure and information are correct.
What's the problem?
Currently, evaluating tables created by AI is difficult because existing methods either treat tables like plain text, missing important relationships between cells, or require a 'perfect answer' table to compare against, which isn't realistic since there are often multiple correct ways to present information. This makes it hard to reliably assess how well these models are doing and pinpoint specific errors.
What's the solution?
The researchers developed a system called TabReX that represents both the original information and the generated table as networks of connected ideas. It then uses another AI to match up these networks and score the table based on how well it maintains both the structure and the facts from the original source. This method allows for flexible evaluation and can even highlight specific cells that are incorrect, offering a detailed analysis.
Why it matters?
This work is important because it provides a more trustworthy and explainable way to evaluate AI-generated tables. By moving beyond simple comparisons, it allows developers to better understand where models succeed and fail, leading to improvements in the quality and reliability of these systems, especially as they are used in more complex applications requiring structured data.
Abstract
Evaluating the quality of tables generated by large language models (LLMs) remains an open challenge: existing metrics either flatten tables into text, ignoring structure, or rely on fixed references that limit generalization. We present TabReX, a reference-less, property-driven framework for evaluating tabular generation via graph-based reasoning. TabReX converts both source text and generated tables into canonical knowledge graphs, aligns them through an LLM-guided matching process, and computes interpretable, rubric-aware scores that quantify structural and factual fidelity. The resulting metric provides controllable trade-offs between sensitivity and specificity, yielding human-aligned judgments and cell-level error traces. To systematically asses metric robustness, we introduce TabReX-Bench, a large-scale benchmark spanning six domains and twelve planner-driven perturbation types across three difficulty tiers. Empirical results show that TabReX achieves the highest correlation with expert rankings, remains stable under harder perturbations, and enables fine-grained model-vs-prompt analysis establishing a new paradigm for trustworthy, explainable evaluation of structured generation systems.