Generating Skyline Datasets for Data Science Models

Mengying Wang, Hanchao Ma, Yiyang Bian, Yangxin Fan, Yinghui Wu

2025-02-21

Generating Skyline Datasets for Data Science Models

Summary

This paper talks about MODis, a system that helps create better datasets for training AI models by optimizing them for multiple goals, instead of focusing on just one measure of quality. It's like designing study materials that help a student excel in all subjects, not just one.

What's the problem?

When creating datasets for AI models, most methods focus on optimizing the data for a single quality measure, which can lead to bias and make the AI less effective for certain tasks. This is like preparing a student for a math test but ignoring their performance in science or history, which limits their overall abilities.

What's the solution?

The researchers developed MODis, a framework that combines data sources to create 'skyline datasets' optimized for multiple user-defined goals. It uses algorithms to carefully select and refine data based on how well it supports the desired performance measures. MODis also includes techniques to avoid bias and ensure the datasets are diverse and balanced.

Why it matters?

This matters because it helps AI models perform better across different tasks by providing high-quality, well-rounded training data. MODis improves the efficiency of data preparation and ensures that AI systems are trained in a way that makes them more versatile and reliable for real-world applications.

Abstract

Preparing high-quality datasets required by various data-driven AI and machine learning models has become a cornerstone task in data-driven analysis. Conventional data discovery methods typically integrate datasets towards a single pre-defined quality measure that may lead to bias for downstream tasks. This paper introduces MODis, a framework that discovers datasets by optimizing multiple user-defined, model-performance measures. Given a set of data sources and a model, MODis selects and integrates data sources into a skyline dataset, over which the model is expected to have the desired performance in all the performance measures. We formulate MODis as a multi-goal finite state transducer, and derive three feasible algorithms to generate skyline datasets. Our first algorithm adopts a "reduce-from-universal" strategy, that starts with a universal schema and iteratively prunes unpromising data. Our second algorithm further reduces the cost with a bi-directional strategy that interleaves data augmentation and reduction. We also introduce a diversification algorithm to mitigate the bias in skyline datasets. We experimentally verify the efficiency and effectiveness of our skyline data discovery algorithms, and showcase their applications in optimizing data science pipelines.

View Paper