Fill in the Missing: Toward Efficient and Effective Imputation of Large-Scale Incomplete Datasets
The Hong Kong University of Science and Technology (Guangzhou)
数据科学与分析学域
PhD Thesis Examination
By Mr. Jun WANG
ABSTRACT
In the rapidly evolving landscape of data-driven industries, the ubiquitous missing values make the collected datasets incomplete, compromising their integrity and posing significant challenges to accurate downstream analysis and decision-making. This thesis systematically identifies three principal challenges in missing data imputation for large-scale incomplete datasets: the scalability limitations of deep generative imputation algorithms, the selection bias introduced by Missing Not At Random (MNAR) data, and the compromised performance of downstream tasks, such as classification, when faced with incomplete datasets.
To address these challenges, we present three methodological innovations. Firstly, the Scalable Imputation System (SCIS) estimates an appropriate sample size to speed up the training of deep generative imputation models under accuracy-guarantees for large-scale incomplete datasets. This breakthrough is particularly advantageous for large-scale applications where deep generative imputation models falter due to computational demands. Secondly, we introduce the novel Counterfactual Contrastive Learning (CounterCLR) framework, designed to combat the selection bias caused by MNAR data in recommender systems. By integrating a causality-based prediction network with a contrastive learning objective, CounterCLR effectively mitigates bias while improving the model’s generalization capabilities under sparse data. Lastly, we theoretically investigate the generalization error, robustness, and model calibration of square loss for classification tasks in the presence of fully observed data. This establishes a baseline performance and provides insights into model behavior under ideal conditions before extending our analysis to the incomplete data scenarios.
Extensive experiments are conducted to demonstrate the effectiveness of these methodologies. By advancing the state of missing data imputation, this thesis aims to enhance the accuracy and reliability of data-driven decision-making across various sectors, ultimately contributing to more robust analytical practices in incomplete datasets.
TEC
Chairperson: Prof Hai-Ning LIANG
Prime Supervisor: Prof Wenjia WANG
Co-Supervisor: Prof Fugee TSUNG
Examiners:
Prof Dianpeng WANG
Prof Wei WANG
Prof Xinlei HE
Prof Zhilu LAI
日期
10 October 2024
时间
14:30:00 - 16:30:00
地点
E3-201, GZ Campus
Join Link
Zoom Meeting ID: 940 9567 8179
Passcode: dsa2024