GoodCore: Data-effective and Data-efficient Machine Learning through Coreset Selection over Incomplete Data

Research Project

GoodCore: Data-effective and Data-efficient Machine Learning through Coreset Selection over Incomplete Data

Abstract

Given a dataset with incomplete data (e.g., missing values), training a machine learning model over the incomplete data requires two steps. First, it requires a data-effective step that cleans the data in order to improve the data quality (and the model quality on the cleaned data). Second, it requires a data-efficient step that selects a core subset of the data (called coreset) such that the trained models on the entire data and the coreset have similar model quality, in order to improve the training efficiency. The first-data-effective then-data-efficient methods are too costly, because they are expensive to clean the whole data; while the first-data-efficient-then-data-effective methods have low model quality, because they cannot select high-quality coreset for incomplete data.

In this paper, we investigate the problem of coreset selection over incomplete data for data-effective and data-efficient machine learning. The essential challenge is how to model the incomplete data for selecting high-quality coreset. To this end, we propose the GoodCore framework towards selecting a good coreset over incomplete data with low cost. To model the unknown complete data, we utilize the combinations of possible repairs as possible worlds of the incomplete data. Based on possible worlds, GoodCore selects an expected optimal coreset through gradient approximation without training ML models. We formally define the expected optimal coreset selection problem, prove its NP-hardness, and propose a greedy algorithm with an approximation ratio. To make GoodCore more efficient, we further propose optimization methods that incorporate human-in-the-loop imputation or automatic imputation method into our framework. Experimental results show the effectiveness and efficiency of our framework with low cost.

Project members

Nan TANG

Associate Professor

Yuyu LUO

Assistant Professor

Publications

GoodCore: Data-effective and Data-efficient Machine Learning through Coreset Selection over Incomplete Data. Chengliang Chai, Jiabin Liu, Nan Tang, Ju Fan, Dongjing Miao, Jiayi Wang, Yuyu Luo, and Guoliang Li. [Best of SIGMOD 2023 Papers]

Project Period

2023

Research Area

Data-driven AI & Machine Learning

Keywords

coreset selection, data-centric AI, machine learning; data cleaning