SynFinTabs: A Dataset of Synthetic Financial Tables for Information and Table Extraction
Ethan Bradley, Muhammad Roman, Karen Rafferty, Barry Devereux
2024-12-06
Summary
This paper talks about SynFinTabs, a new dataset of synthetic financial tables designed to help improve the process of extracting information from financial documents using AI.
What's the problem?
Extracting tables from images of documents is a difficult task for AI because there aren't enough labeled examples, especially in the financial domain. Most existing datasets focus on scientific tables, which have different layouts and styles. This lack of diverse data makes it hard for AI models to learn how to recognize and interpret financial tables accurately.
What's the solution?
The authors created SynFinTabs, which includes 100,000 synthetic financial tables with detailed annotations. They generated these tables using advanced techniques that ensure they look and function like real financial documents. Additionally, they developed a model called FinTabQA that can answer questions about the tables, demonstrating how well the dataset can train AI systems to extract information from financial documents.
Why it matters?
This research is important because it provides a large and diverse dataset that can help improve AI systems used in finance. By making it easier for models to learn from high-quality examples, SynFinTabs can enhance the accuracy of table extraction and information retrieval, which is crucial for analyzing financial data and making informed decisions.
Abstract
Table extraction from document images is a challenging AI problem, and labelled data for many content domains is difficult to come by. Existing table extraction datasets often focus on scientific tables due to the vast amount of academic articles that are readily available, along with their source code. However, there are significant layout and typographical differences between tables found across scientific, financial, and other domains. Current datasets often lack the words, and their positions, contained within the tables, instead relying on unreliable OCR to extract these features for training modern machine learning models on natural language processing tasks. Therefore, there is a need for a more general method of obtaining labelled data. We present SynFinTabs, a large-scale, labelled dataset of synthetic financial tables. Our hope is that our method of generating these synthetic tables is transferable to other domains. To demonstrate the effectiveness of our dataset in training models to extract information from table images, we create FinTabQA, a layout large language model trained on an extractive question-answering task. We test our model using real-world financial tables and compare it to a state-of-the-art generative model and discuss the results. We make the dataset, model, and dataset generation code publicly available.