AI-Powered Data Systems for Multimodal Analytics

摘要
We live in a world overflowing with data, and the emergence of AI, such as Large Language Models (LLMs), is revolutionizing data analytics. However, directly using AI to process massive and complex data is neither effective nor scalable.
In this talk, I share my work on building AI-native systems to analyze multimodal data at scale, focusing on tables and complex documents. On one hand, when analyzing tables, AI is often used to prepare data, such as cleaning and enriching, and this becomes prohibitively expensive when the data scale is large. I present a set of database techniques to support scalable AI computations without sacrificing accuracy. On the other hand, when analyzing documents, current approaches typically treat them as plain text and ignore underlying structures, leading to limited accuracy and performance. In this regard, I present our work called data structuring that explores varying degrees of structures in unstructured documents and uses them to optimize query processing for efficient document analytics. Finally, I’ll share my vision for building data systems for multimodal analytics, including aspects of trustworthy systems, optimization with hardware, and co-optimization among different data modalities.
演讲者简介
Yiming Lin is a postdoctoral researcher at the University of California, Berkeley, and he received his Ph.D. from the University of California, Irvine. His research interests span document analytics, query processing and optimization, and data preparation, with a current focus on building data systems for multimodal analytics powered by AI. His work has had real-world impact: document analytics help public defenders, journalists, and the California Police Department process over 30,000 pages. His efforts on scalable table ingestion drive multiple high-quality smart space applications, and have been deployed at six sites for five years, including universities, industries, nursing homes, and the U.S. Navy. He has a number of publications and serves on the program committees of the premier database conferences VLDB, SIGMOD, and ICDE.
日期
06 February 2026
时间
10:00:00 - 11:00:00
地点
E3-314
Join Link
Zoom Meeting ID: 635 003 6325
Passcode: dsat