Hang Hua, Advancing Generative AI for Multimodal Intelligence

When: Friday, 3/7 11 am – 12 pm; Where: Tyler Hall 055.
Abstract:
Generative AI is transforming how machines interact with and augment human capabilities. However, achieving artificial general intelligence (AGI) requires addressing significant challenges in retrained language models (PLM) and multimodal large language models (MLLMs), including the brittleness of language model fine-tuning, imbalanced vision-language capabilities, fine-grained visual perception, and inefficiencies in processing long video sequences. In this talk, I will present my efforts to tackle these challenges through innovative models and benchmarks. First, I will introduce LNSR an efficient language model fine-tuning framework for improving language models’ robustness and generalization. Second, I will present PromptCap, a task-aware captioning framework that enables pure language models to perform complex visual reasoning by converting visual content into task-specific language representations. Next, I will discuss FineCaption, a mask-aware encoder designed for fine-grained regional image understanding and compositional captioning, which has been successfully applied in industry production systems. I will also highlight benchmarks such as MMComposition, which evaluates compositional perception, reasoning, and probing dimensions in MLLMs, providing insights into their limitations and potential directions for improvement.

Looking ahead, I aim to advance multimodal LLMs as embodied agents capable of integrating text, images, audio, and video to interact naturally and effectively within real-world settings. Furthermore, I plan to explore the use of LLMs to enhance accessibility, such as improving mobile user interfaces for individuals with disabilities by providing adaptive, assistive navigation and voice-guided interactions. By addressing these critical areas, my research seeks to bridge gaps in reasoning, accessibility, and safety, enabling AI systems to better serve society and unlock transformative applications across diverse domains.

Bio:
Hang Hua is a Ph.D. candidate in Computer Science at the University of Rochester, advised by Professor Jiebo Luo. His research focuses on advancing generative AI, with a particular emphasis on multimodality, large language models (LLMs), diffusion models, and video-language understanding. Hang has published extensively in top-tier venues such as NeurIPS, CVPR, ICCV, AAAI, NAACL, and ECCV, and developed state-of-the-art models like PromptCap and FineCaption, which address critical challenges in vision-language reasoning and fine-grained regional image understanding. His contributions also include building influential benchmarks like MMComposition and CompositionCap, advancing the evaluation of multimodal large language models. Beyond academia, Hang has rich industrial experience from research internships at Adobe Research, Microsoft Research, Snap Research. In addition to his research, he actively reviews for leading conferences, co-organizes AI workshops, and holds M.S. and B.S. degrees from Peking University and South China University of Technology, respectively.