A Study on Robustness and Generalization for Visual and Multimodal Learning
The Hong Kong University of Science and Technology (Guangzhou)
Data Science and Analytics Thrust
Thesis Proposal Examination
By Mr WANG, Xiasi
Abstract
Recent advances in visual and multimodal learning have achieved impressive performance on standard benchmarks, yet their reliability in real-world applications is challenged by issues of robustness and generalization. This thesis addresses these challenges through three integral pillars of reliable deep learning: (1) robustness to adversarial attacks, (2) adaptation to domain shift and unseen categories, and (3) generalization via representation learning. First, we identify a previously overlooked vulnerability in Vision-Language Models (VLMs)—their susceptibility to adversarial manipulation of inference efficiency. We introduce VLMInferSlow, a black-box evaluation framework for assessing such efficiency robustness. Second, we address the problem of open-set domain adaptation, where models encounter both distribution shifts and unknown categories. We develop Activate and Adapt (ADA), a two-stage framework that adapts models to classify known categories while identifying unknown ones. Third, we propose Multi-View Entropy Bottleneck (MVEB), an objective for self-supervised learning that improves generalization by learning minimal sufficient representations through the elimination of superfluous information between views. Collectively, these works provide multifaceted solutions for building more reliable visual and multimodal learning systems.
TPE Committee
Chair of Committee: Prof. Sihong Xie
Prime Supervisor: Prof. Yuan Yao
Co-Supervisor: Prof. Nevin L. Zhang
Examiner: Prof. Wenjia Wang
Date
25 September 2025
Time
15:00:00 - 16:00:00
Location
W2-202 (HKUST-GZ)