Research Project

Data-effective and data-efficient ML

Abstract

Data-effective machine learning (ML) (a.k.a. data-centric AI) aims at obtaining high-quality training data to release the value of AI, because it is well-known that dirty data may severely degrade the performance of ML models. Data-efficient ML focuses on making the training process more efficient. A commonly used strategy is to select a core subset of training data (or coreset) to represent the entire dataset such that ML models trained on the coreset can achieve similar performance to the ML models trained on the entire dataset. Apparently, users desire both data-effective ML (for training better ML models) and data-efficient ML (for saving training cost).

Project members

Nan TANG

Associate Professor

Yuyu LUO

Assistant Professor

Publications

1. GoodCore: Data-effective and Data-efficient Machine Learning through Coreset Selection over Incomplete Data. Chengliang Chai, Jiabin Liu, Nan Tang, Ju Fan, Dongjing Miao, Jiayi Wang, Yuyu Luo, and Guoliang Li.
2. Efficient Coreset Selection with Cluster-based Methods. Chengliang Chai, Jiayi Wang, Nan Tang, Ye Yuan, Jiabin Liu, Yuhao Deng, and Guoren Wang.
3. Coresets over Multiple Tables for Feature-rich and Data-efficient Machine Learning. Jiayi Wang, Chengliang Chai, Nan Tang, Jiabin Liu, and Guoliang Li.

Project Period

2023-Present

Research Area

Data-centric AI

Keywords

data quality, data-centric AI