Fill in the Missing: Toward Efficient and Effective Imputation of Large-Scale Incomplete Datasets

Final Defense

Fill in the Missing: Toward Efficient and Effective Imputation of Large-Scale Incomplete Datasets

The Hong Kong University of Science and Technology (Guangzhou)

Data Science and Analytics Thrust

PhD Thesis Examination

By Mr. Jun WANG

ABSTRACT

In the rapidly evolving landscape of data-driven industries, the ubiquitous missing values make the collected datasets incomplete, compromising their integrity and posing significant challenges to accurate downstream analysis and decision-making. This thesis systematically identifies three principal challenges in missing data imputation for large-scale incomplete datasets: the scalability limitations of deep generative imputation algorithms, the selection bias introduced by Missing Not At Random (MNAR) data, and the compromised performance of downstream tasks, such as classification, when faced with incomplete datasets.

To address these challenges, we present three methodological innovations. Firstly, the Scalable Imputation System (SCIS) estimates an appropriate sample size to speed up the training of deep generative imputation models under accuracy-guarantees for large-scale incomplete datasets. This breakthrough is particularly advantageous for large-scale applications where deep generative imputation models falter due to computational demands. Secondly, we introduce the novel Counterfactual Contrastive Learning (CounterCLR) framework, designed to combat the selection bias caused by MNAR data in recommender systems. By integrating a causality-based prediction network with a contrastive learning objective, CounterCLR effectively mitigates bias while improving the model’s generalization capabilities under sparse data. Lastly, we theoretically investigate the generalization error, robustness, and model calibration of square loss for classification tasks in the presence of fully observed data. This establishes a baseline performance and provides insights into model behavior under ideal conditions before extending our analysis to the incomplete data scenarios.

Extensive experiments are conducted to demonstrate the effectiveness of these methodologies. By advancing the state of missing data imputation, this thesis aims to enhance the accuracy and reliability of data-driven decision-making across various sectors, ultimately contributing to more robust analytical practices in incomplete datasets.

TEC

Chairperson: Prof Hai-Ning LIANG

Prime Supervisor: Prof Wenjia WANG

Co-Supervisor: Prof Fugee TSUNG

Examiners:

Prof Dianpeng WANG

Prof Wei WANG

Prof Xinlei HE

Prof Zhilu LAI

Date

10 October 2024

Time

14:30:00 - 16:30:00

Location

E3-201, GZ Campus

Join Link

Zoom Meeting ID:
940 9567 8179

Passcode: dsa2024