[Talk] Chenyang Zhong: Faithful and Efficient Synthetic Data Generation via Penalized Optimal Transport Network

When: Wednesday, March 4, 3:00 PM
Where: Pharmacy 240

Abstract
The generation of synthetic data whose distributions faithfully emulate the true data-generating mechanism is of critical importance in modern statistics and data science, with applications ranging from systematic model evaluation to augmenting limited datasets. While Wasserstein Generative Adversarial Networks have shown promise in this area, they often suffer from mode collapse. This pathological phenomenon results in generated samples that fail to capture the full complexity of the true data distribution, particularly in the tails and minor modes. Such limitations can lead to serious consequences for downstream analyses and decision-making processes.

To address this challenge, we propose the Penalized Optimal Transport Network (POTNet), a novel deep generative model that effectively mitigates mode collapse. POTNet utilizes a robust and interpretable marginally-penalized Wasserstein distance, leveraging low-dimensional marginal information to guide the alignment of joint distributions. By employing a primal-based framework, our approach eliminates the need for a critic network, thereby circumventing training instabilities inherent in adversarial approaches and avoiding the need for extensive parameter tuning. We demonstrate, both theoretically and empirically, that POTNet achieves superior performance in accurately capturing underlying data structures and attenuating mode collapse compared to existing methods. Furthermore, POTNet exhibits remarkable computational efficiency, enabling scalable synthetic data generation for large-scale applications.

Bio
Chenyang Zhong is a Term Assistant Professor in the Department of Statistics at Columbia University. He received his PhD in Statistics from Stanford University, where he was advised by Persi Diaconis. His research develops inference methodologies with statistical and computational guarantees to address emerging challenges in statistics, data science, and machine learning. His work focuses on generative modeling, optimal transport, Bayesian modeling and inference, variational inference, Markov chain Monte Carlo, and statistical machine learning.