Data-effective and data-efficient ML
Abstract
Data-effective machine learning (ML) (a.k.a. data-centric AI) aims at obtaining high-quality training data to release the value of AI, because it is well-known that dirty data may severely degrade the performance of ML models. Data-efficient ML focuses on making the training process more efficient. A commonly used strategy is to select a core subset of training data (or coreset) to represent the entire dataset such that ML models trained on the coreset can achieve similar performance to the ML models trained on the entire dataset. Apparently, users desire both data-effective ML (for training better ML models) and data-efficient ML (for saving training cost).
Publications
1. GoodCore: Data-effective and Data-efficient Machine Learning through Coreset Selection over Incomplete Data. Chengliang Chai, Jiabin Liu, Nan Tang, Ju Fan, Dongjing Miao, Jiayi Wang, Yuyu Luo, and Guoliang Li.
2. Efficient Coreset Selection with Cluster-based Methods. Chengliang Chai, Jiayi Wang, Nan Tang, Ye Yuan, Jiabin Liu, Yuhao Deng, and Guoren Wang.
3. Coresets over Multiple Tables for Feature-rich and Data-efficient Machine Learning. Jiayi Wang, Chengliang Chai, Nan Tang, Jiabin Liu, and Guoliang Li.
Project Period
2023-Present
Research Area
Data-centric AI
Keywords
data quality, data-centric AI