A Study on Robustness and Generalization for Visual and Multimodal Learning

论文开题审查

A Study on Robustness and Generalization for Visual and Multimodal Learning

The Hong Kong University of Science and Technology (Guangzhou)

数据科学与分析学域

论文开题审查

By Mr WANG, Xiasi

摘要

Recent advances in visual and multimodal learning have achieved impressive performance on standard benchmarks, yet their reliability in real-world applications is challenged by issues of robustness and generalization. This thesis addresses these challenges through three integral pillars of reliable deep learning: (1) robustness to adversarial attacks, (2) adaptation to domain shift and unseen categories, and (3) generalization via representation learning. First, we identify a previously overlooked vulnerability in Vision-Language Models (VLMs)—their susceptibility to adversarial manipulation of inference efficiency. We introduce VLMInferSlow, a black-box evaluation framework for assessing such efficiency robustness. Second, we address the problem of open-set domain adaptation, where models encounter both distribution shifts and unknown categories. We develop Activate and Adapt (ADA), a two-stage framework that adapts models to classify known categories while identifying unknown ones. Third, we propose Multi-View Entropy Bottleneck (MVEB), an objective for self-supervised learning that improves generalization by learning minimal sufficient representations through the elimination of superfluous information between views. Collectively, these works provide multifaceted solutions for building more reliable visual and multimodal learning systems.

TPE Committee

Chair of Committee: Prof. Sihong Xie

Prime Supervisor: Prof. Yuan Yao

Co-Supervisor: Prof. Nevin L. Zhang

Examiner: Prof. Wenjia Wang

日期

25 September 2025

时间

15:00:00 - 16:00:00

地点

W2-202 (HKUST-GZ)