Fill in the Missing: Toward Efficient and Effective Imputation of Large-Scale Incomplete Datasets

Thesis Proposal Examination

Fill in the Missing: Toward Efficient and Effective Imputation of Large-Scale Incomplete Datasets

The Hong Kong University of Science and Technology (Guangzhou)

Data Science and Analytics Thrust

PhD Thesis Proposal Examination

By Mr. Jun WANG

Abstract

In the rapidly evolving landscape of data-driven industries, the ubiquitous missing values make the collected datasets incomplete, compromising their integrity and posing significant challenges to accurate downstream analysis and decision-making. This proposal systematically identifies three principal challenges in missing data imputation for large-scale incomplete datasets: the scalability limitations of deep generative imputation algorithms, the selection bias introduced by Missing Not At Random (MNAR) data, and the compromised performance of downstream tasks, such as classification, when faced with incomplete datasets.

To address these challenges, we present three methodological innovations. Firstly, the Scalable Imputation System (SCIS) estimates an appropriate sample size to speed up the training of deep generative imputation models under accuracy-guarantees for large-scale incomplete datasets. This breakthrough is particularly advantageous for large-scale applications where deep generative imputation models falter due to computational demands. Secondly, we introduce the novel Counterfactual Contrastive Learning (CounterCLR) framework, designed to combat the selection bias caused by MNAR data in recommender systems. By integrating a causality-based prediction network with a contrastive learning objective, CounterCLR effectively mitigates bias while improving the model’s generalization capabilities under sparse data. Lastly, we theoretically investigate the generalization error, robustness, and model calibration of square loss for classification tasks in the presence of fully observed data. This establishes a baseline performance and provides insights into model behavior under ideal conditions before extending our analysis to incomplete data scenarios.

Preliminary experiments are conducted to demonstrate the effectiveness of these methodologies. Comprehensive future work plans are also outlined to refine and validate these approaches further. By advancing the state of missing data imputation, this proposal aims to enhance the accuracy and reliability of data-driven decision-making across various sectors, ultimately contributing to more robust analytical practices in incomplete datasets.

TPE Committee

Chairperson: Prof. Xiaowen CHU

Prime Supervisor: Prof Wenjia WANG

Co-Supervisor: Prof Fugee TSUNG

Examiner: Prof Jia LI

Date

12 June 2024

Time

13:30:00 - 14:45:00

Location

E1-150