A Distributed and Scalable Approach for Global Representation Learning with EHR Applications

Speaker: Zebin Wang

Date: Sept 30, 11:45 am – 12:45 pm

Abstract: Classical probabilistic graphical models face fundamental challenges in modern data environments, which are characterized by high dimensionality, source heterogeneity, and stringent data-sharing constraints. In this work, we revisit the Ising model, a well-established member of the Markov Random Field (MRF) family, and develop a distributed framework that enables scalable and privacy-preserving representation learning from large-scale binary data with inherent low-rank structure. Our approach optimizes a non-convex surrogate loss function via bi-factored gradient descent, offering substantial computational and communication advantages over conventional convex approaches. We evaluate our algorithm on multi-institutional electronic health record (EHR) datasets from 58,248 patients across the University of Pittsburgh Medical Center (UPMC) and Mass General Brigham (MGB), demonstrating superior performance in global representation learning and downstream clinical tasks, including relationship detection, patient phenotyping, and patient clustering. These results highlight a broader potential for statistical inference in federated, high-dimensional settings while addressing the practical challenges of data complexity and multi-institutional integration.

Biographical Sketch: I am a 5th year Ph.D. student advised by Professor Tianxi Cai, SCD in the Department of Biomedical Informatics in Harvard Medical School, Boston Massachusetts. I received my Bachelor of Science degree in Mathematics and Applied Mathematics from Fudan University, China in 2018. After my graduation, I obtained the Master of Science (SCM) degree in Biostatistics from the Johns Hopkins School of Public Health, Baltimore, Maryland in 2020. My recent research interests focus on the discovery of scalable, privacy-preserving, and semantically rich electronic health record (EHR) representations in multi-modal and multi-institutional settings. Furthermore, I have conducted several studies on disease phenotyping and patient prognosis classification, with particular emphasis on AD (Alzheimer’s Disease) and RCC (Renal Cell Carcinoma) cohorts. During my spare time, I enjoy watching and playing numerous types of sports.

Location: LOV 307 (In Person Only)

Leave a Reply