GoodCore: Data-effective and Data-efficient Machine Learning through Coreset Selection over Incomplete Data
Abstract
Given a dataset with incomplete data (e.g., missing values), training a machine learning model over the incomplete data requires two steps. First, it requires a data-effective step that cleans the data in order to improve the data quality (and the model quality on the cleaned data). Second, it requires a data-efficient step that selects a core subset of the data (called coreset) such that the trained models on the entire data and the coreset have similar model quality, in order to improve the training efficiency. The first-data-effective then-data-efficient methods are too costly, because they are expensive to clean the whole data; while the first-data-efficient-then-data-effective methods have low model quality, because they cannot select high-quality coreset for incomplete data.
In this paper, we investigate the problem of coreset selection over incomplete data for data-effective and data-efficient machine learning. The essential challenge is how to model the incomplete data for selecting high-quality coreset. To this end, we propose the GoodCore framework towards selecting a good coreset over incomplete data with low cost. To model the unknown complete data, we utilize the combinations of possible repairs as possible worlds of the incomplete data. Based on possible worlds, GoodCore selects an expected optimal coreset through gradient approximation without training ML models. We formally define the expected optimal coreset selection problem, prove its NP-hardness, and propose a greedy algorithm with an approximation ratio. To make GoodCore more efficient, we further propose optimization methods that incorporate human-in-the-loop imputation or automatic imputation method into our framework. Experimental results show the effectiveness and efficiency of our framework with low cost.
Publications
GoodCore: Data-effective and Data-efficient Machine Learning through Coreset Selection over Incomplete Data. Chengliang Chai, Jiabin Liu, Nan Tang, Ju Fan, Dongjing Miao, Jiayi Wang, Yuyu Luo, and Guoliang Li. [Best of SIGMOD 2023 Papers]
Project Period
2023
Research Area
Data-driven AI & Machine Learning
Keywords
coreset selection, data-centric AI, machine learning; data cleaning