Presentations – MSSISS 2026

Presentations

2026 Presentations


Order of presentation may vary, alphabetical order shown.

Session I

March 27th, 9:00am – 10:30am
West conference rooms

Moderated by
Milad Hoseinpour
PhD Student, EECS


Allison Grimsted

PhD Student
Industrial & Operations Engineering

Sample-Path Clustering Optimization for Context-Sensitive Markov Decision Processes

Abstract

We address the problem of estimating transition probability matrices from longitudinal sample-path data generated by multiple heterogeneous stochastic systems with latent class structure. Existing sequence clustering methods generally rely on estimating a transition probability matrix for each sample-path to define a similarity measure. However, in many applications, the sample-paths are too short to reliably determine their transition structure, which can result in misleading similarity measures. To address this problem, we propose an unsupervised clustering approach to partition sample-paths into a parsimonious set of clusters, each associated with its own transition probability matrix, capturing contextual variation in the system state transition probabilities. Our clustering method uses a variant of the set partitioning model to maximize the likelihood that the sample-paths were generated by the assigned class transition probability matrices. Statistical variation is controlled using minimum sample size constraints for each cluster derived from simultaneous confidence intervals for multinomial proportions. The resulting optimization model is a mixed-integer nonlinear program that is NP-hard. We propose a decomposition-based approach (column generation) with provable optimality bounds. The resulting clusters of sample-paths are used to construct context-specific transition probability matrices for Markov decision processes (MDPs). As a benchmark, we use a single “mean-value” MDP based on pooling all sample-paths. Computational experiments on a U.S. highway bridge maintenance case study illustrate the benefits of context-sensitive MDPs over their mean-value counterparts.


Curtiss Engstrom

PhD Student
Survey and Data Science

Combining Information from Multiple National Health Surveys: Comparing Multilevel Regression with Poststratification Modeling and the Finite-Population Bayesian Bootstrap

Abstract

Survey researchers often collect data on the same population by drawing multiple samples from that population. There may be substantial overlap in the questions used in each of the surveys administered to these samples. When researchers attempt to combine multiple survey datasets with significant overlap in the questions used, they must incorporate the complex sample design features from each sample into their combining procedures. Failure to account for these varying design features could introduce bias into the estimates and incorrect population inferences based on the combined datasets.
Recent literature proposes multiple methods for incorporating survey-specific complex sample design features and weights when combining information from multiple surveys. One such method features the use of multilevel regression and poststratification (MRP) modeling to incorporate the complex sample design features to obtain estimates from survey data sets, then using a combining rule to combine the estimates. A second approach involves generating and combining estimates from multiple synthetic populations from each survey dataset using the weighted finite-population Bayesian bootstrap (WFPBB) approach to incorporate the complex sample design features.
Using data from the 2018 National Survey on Drug Use and Health (NSDUH) and the 2018 National Health Interview Study (NHIS), we compared combined estimates of the prevalence of lifetime use of 100 cigarettes, past-month cigarette use, and lung cancer screening eligibility rates among sexual identity subgroups generated from both methods. We found that using MRP generates combined estimates with narrower confidence intervals when compared to WFPBB. We used benchmarks from the 2018 Behavioral Risk Factor Surveillance System (BRFSS), as well as from published estimates, to estimate the bias in the combined estimates of proportions and means for the smoking outcomes. We discuss the trade-offs and practical applications of each method before concluding with a recommendation for the most suitable method of the two combining methods.


Yue Yu

PhD Student
Statistics

Undersmooth Black-box Model for Functional Estimation

Abstract

We study functional estimation using black-box models through a model-agnostic undersmoothing framework. The proposed procedure Rep operates by augmenting the original dataset through replicating a proportion of samples multiple times, and subsequently applying the black-box algorithm to the augmented dataset. This construction automatically induces undersmoothing and removes the need for manual hyperparameter tuning.
We provide several empirical demonstrations (including neural network based learners) showing that compared to the plug-in estimator, the proposed algorithm Rep improves the estimation accuracy of functional estimation without requiring explicit expressions for the associated influence functions. Furthermore, we develop a theoretical analysis in two representative settings, the Nadaraya–Watson estimator and the random feature model, establishing that replication provides explicit prescriptions for the replication proportion and number of copies, and yields optimal convergence rates for functional estimation. In the classical nonparametric regression setting, we extend Rep with a Lepski-style method that adapts to unknown structural features of the regression function.


Matt Raymond

PhD Student
Electrical Engineering & Computer Science

Positive-Unlabeled Guidance for Diffusion Sampling

Abstract

Diffusion models are often trained on data that can be categorized into classes, and a common challenge is guided sampling – drawing new samples from a specific class. This problem is well-studied when labeled training data is available, with classifier guidance and classifier-free guidance being two popular approaches. In this work, we consider the more challenging setting where only positive and unlabeled (PU) data are available, where the positive class is the one to be sampled. We introduce “prior-free” PU (PFPU) guidance, a sampling approach that can be implemented without knowledge of the positive proportion π in the unlabeled data. Compared to an existing method, PFPU guidance achieves comparable or better performance, is invariant to π, requires less positive data for accurate sampling, does not require retraining the full diffusion model, and still works when the classes having overlapping support.


Hanbin Lee

PhD Student
Statistics

Parameterizing the genetic architecture under stabilizing selection

Abstract

Recently, stabilizing selection has emerged as a central means of understanding complex trait architecture.    A typical approach is to derive the distribution of genome-wide association study (GWAS) summary statistics from a population genetic model and offer a post-hoc interpretation of GWAS results which is derived from standard pipelines agnostic to the underlying evolutionary process. In this paper, we work in the opposite direction: we study how complex traits should be analyzed in the first place under stabilizing selection. To this end, we propose a general framework to derive variance component models from evolutionary principles. We show that this framework reduces to a linear mixed model with a novel frequency-dependent architecture under the equilibrium limit of stabilizing selection. Next, we account for pleiotropy and show that functional annotations act on variance components through trait-fitness genetic correlation.

Session II

March 27th, 9:00am – 10:30am
East conference rooms

Moderated by
Yuting Duan
PhD Student, Biostatistics


Chengyu Cui

PhD Student
Statistics

Beyond Vintage Rotation: Bias-Free Sparse Representation Learning with Oracle Inference

Abstract

Rotation methods are an important tool for learning sparse and interpretable low-dimensional latent representations. Despite nearly a century of widespread use across many fields, rigorous guarantees for valid inference for the learned representation remain lacking. In this paper, we identify a surprisingly prevalent phenomenon that suggests a reason for this gap: for a broad class of vintage rotations, the resulting estimators exhibit a non-estimable bias. Because this bias is independent of the data, it fundamentally precludes the development of valid inferential procedures, including the construction of confidence intervals and hypothesis testing.
To address this challenge, we propose a novel bias-free rotation method within a general representation learning framework based on latent variables.
We establish an oracle inference property for the learned sparse representations: the estimators achieve the same asymptotic variance as in the ideal setting where the latent variables are observed. To bridge the gap between theory and computation, we develop an efficient computational framework and prove that its output estimators retain the same oracle property.
Our results provide a rigorous inference procedure for the rotated estimators, yielding statistically valid and interpretable representation learning.


Abhiti Mishra

PhD Student
Statistics

Overlapping Clustering for Multivariate Functional Data

Abstract

Data from fields such as neuroscience and environmental sciences are often modeled as multivariate functional data and may belong to several clusters simultaneously. Existing clustering techniques in functional data analysis typically do not account for overlapping cluster memberships. We propose an approach based on latent factor models, with functional factors and a real-valued loading matrix that encodes cluster memberships. Under mild conditions, we establish identifiability of the loading matrix up to permutation, ensuring that the overlapping cluster structure is recoverable up to label switching. We develop a procedure for estimating both the number of clusters and the associated cluster memberships. This involves solving an infinite-dimensional regression problem in operators, whose solution can be characterized using the inner product on the space of Hilbert–Schmidt operators and expressed in terms of real-valued matrices. This framework allows asymptotic analysis, and we establish a central limit theorem to facilitate statistical inference on overlapping cluster memberships. We demonstrate the performance of our method using numerical studies and an application to functional magnetic resonance imaging data.


Tim White

PhD Student
Statistics

Distributionally Robust Neural Posterior Estimation

Abstract

Neural posterior estimation (NPE) is an amortized variational inference procedure in which a deep neural network is trained to map data from a stochastic simulator to a distribution over the simulator’s parameters. Although NPE produces accurate and well-calibrated posterior approximations when validated on data produced by the simulator, its inferences for real observations can be unreliable if the simulator’s implicit generative model is misspecified with respect to the real data-generating process. To date, efforts to address model misspecification have focused on reconciling the marginal distributions of the simulated and real observations, either by penalizing discrepancies between the two or relating them through an explicit error model. As an alternative, we frame NPE under model misspecification as a distributionally robust optimization problem that considers perturbations to the joint distribution over parameters and observations. This distributionally robust neural posterior estimation (DRNPE) procedure minimizes the worst-case loss over joint distributions that lie within an ambiguity set around the simulator. The resulting objective takes the form of an exponential tilting of the NPE loss that automatically upweights training examples with low variational density. At the DRNPE optimum, the expected forward Kullback-Leibler divergence between the true and approximate posteriors is bounded above when the true joint distribution lies within the ambiguity set, a result not generally guaranteed by NPE under model misspecification. We demonstrate that DRNPE produces credible intervals with nominal or conservative coverage on several benchmark problems where NPE is overconfident, and we characterize the robustness-efficiency tradeoff controlled by the size of the ambiguity set.


Jaeshin Park

PhD Student
Industrial & Operations Engineering

A Surrogate-Based Dynamic Parameter Calibration Framework for Digital Twins

Abstract

Digital twins require accurate and timely parameter calibration to remain consistent with their physical systems under evolving operating conditions. In large-scale deployments, repeated recalibration using high-fidelity simulators is often infeasible due to computational, communication, and privacy constraints. We propose a surrogate-based offline–online dynamic parameter calibration framework that enables uncertainty-aware sequential updating without repeated simulator evaluations. The framework separates offline surrogate construction from online parameter updating, formulates calibration as a state-space estimation problem, and explicitly incorporates a discrepancy component to address offline–online mismatch. A concrete realization using B-spline regression and a discrepancy-aware dual Kalman filter is presented and demonstrated through numerical experiments and a real-world case study involving an electric vehicle climate subsystem digital twin.


Peiyao Cai

PhD Student
Statistics

Expected Shortfall Regression with Deep Neural Networks under Dependence

Abstract

Expected shortfall (ES), defined as the conditional mean of a random variable beyond a given quantile level, is a coherent and informative measure of tail risk and plays a central role in downside risk modeling for financial time series. We develop a flexible two-step deep learning framework for estimating conditional ES in nonlinear and temporally dependent settings. The procedure first estimates the conditional quantile using a general nonparametric or machine-learning method, and then fits a deep neural network to estimate the conditional ES via least squares regression with surrogate responses constructed from the first stage. We study the theoretical properties of the resulting ES estimators based on both feedforward and recurrent neural networks, allowing for weakly dependent time series data. We establish the $\ell_2$ generalization error bounds and show that the proposed estimators achieve near-optimal convergence rates under $\beta$-mixing conditions. Simulation studies and empirical applications to stock index returns and macroeconomic time series demonstrate that the proposed approach effectively captures complex tail dynamics and delivers accurate downside risk forecasts.


Session III

March 27th, 1:00pm – 2:40pm
West conference rooms

Moderated by
Rongbo Zhu
PhD Student, IOE


Stefan Eng

PhD Student
Biostatistics

Causal Network Discovery Using Network Empirical Shrinkage Mendelian Randomization (NESMR

Abstract

Mendelian Randomization (MR) is a causal inference method in which genetic variants are used as instrumental variables. Network MR extends the principles of MR to inference of linear structural equations describing relationships between multiple traits. Often in network MR, both network structure and effect sizes must be estimated. Previous network MR methods rely on genetic correlation (GenomicSEM and MrDAG), or take a two-stage approach, first estimating total effects and then transforming these to direct effects (Graph-cML and inspre). Neither of these approaches provide an overall model likelihood that can be used for model selection. Additionally, genetic correlation based methods cannot distinguish graph structures within the same Markov class, and it is challenging to estimate the variability of direct effect estimates using two-stage methods. We developed network empirical shrinkage Mendelian randomization (NESMR), which models the entire network simultaneously, allowing for model-based comparison of different graph structures. We identify the best fitting graph structure using an efficient Metropolis-Hastings algorithm. We show in simulations that NESMR estimates direct effects consistently, while controlling false positives at the nominal rate. Our MH algorithm reliably identifies the highest likelihood graph, with the full algorithm requiring on average 50 iterations to find the best graph for 10 traits. We apply NESMR to a set of 10 coronary artery disease (CAD) related traits. The network discovered by NESMR is consistent with results from clinical trials and physiological understanding of CAD risk. Importantly, NESMR accurately recovers known temporal ordering, placing childhood BMI earliest in the network and CAD last.


Gabriel Durham

PhD Student
Statistics

Assessing Time-Varying Peer Effects in Mobile Health Studies

Abstract

We study adaptive interventions in which group formation itself is a treatment component. Motivated by competition-based mobile health applications, we consider settings where participants are repeatedly reshuffled into new groups over time. Optimizing such interventions requires understanding how outcomes depend on the evolving composition of peers assigned at each decision point. We introduce a design-based framework for defining and assessing time-varying peer effects under repeated group assignment, providing estimands and inferential tools that directly support principled group-formation design.


Tai Yang

PhD Student
Biostatistics

PsyPRS: A Repository with Polygenic Risk Scores for Mental Health Phenotypes

Abstract

Background
Polygenic risk scores (PRS) attempt to summarize genetic susceptibility for complex traits in a single score. Their predictive performance depends strongly on both the discovery GWAS and the PRS construction method. As the number of available GWAS per phenotype grows, a key practical challenge is systematically identifying the top-performing GWAS–method combination for prediction or risk stratification models in a target cohort.

Method
We established PsyPRS, a PRS generation and evaluation workflow that integrates LLM-assisted GWAS metadata curation with downstream PRS construction built upon the GenoPred pipeline (Pain et al., Bioinformatics, 2024). From a curated panel of over 350 GWAS spanning over 40 mental health and related phenotypes, PsyPRS integrates 10 widely used PRS methods (incl. LDpred2, MegaPRS, and PRS-CSx). The workflow includes reference-based genetic ancestry inference of the target cohort and PRS adjustment to improve portability across ancestry groups and datasets. We evaluate PRS in the Michigan Genomics Initiative (N=88,971) and report standardized comparative metrics for benchmarking performance.

Results
Preliminary benchmarking for major depressive disorder (MDD) indicates that the best yet moderate discrimination was achieved by a PRS using MegaPRS and the 2025 Psychiatric Genomics Consortium MDD meta-analysis GWAS, attaining an AUC of 0.648 (95% CI 0.644–0.653). This PRS showed a graded risk gradient, with higher odds of MDD in the top 1% relative to the 40–60% reference stratum (OR = 2.28). Ongoing work extends this benchmarking to the full GWAS panel, producing more than 3,000 PRSs and a comprehensive set of performance metrics.

Conclusion
PsyPRS enables scalable benchmarking of GWAS–method combinations for psychiatric PRS and helps evaluate their suitability for risk prediction and stratification. We expect this resource to accelerate reproducible PRS research and cross-cohort validation.


Neo Kok

Master’s Student
Biostatistics

Impact of Missing Data and Monitoring Duration on Downstream Analyses in Continuous Glucose Monitoring

Abstract

Objective: Consensus guidelines recommend at least 14 consecutive days of CGM monitoring with 70% completeness to represent 90-day glycemic exposure. This study quantifies bias and uncertainty introduced into downstream analyses by using CGM metrics from incomplete or reduced monitoring, relative to a 90-day complete profile.
Research Design and Methods: Based on 1,010 complete 90-day CGM profiles from individuals with type 1 diabetes, we simulated incomplete profiles by varying monitoring duration (7-90 days) and data completeness (10%–100%). Consensus CGM metrics were computed on both incomplete and complete profiles to quantify measurement error. This error was propagated into two downstream regression models: (a) CGM metric is an outcome for a binary treatment (clinical trial setting); (b) CGM metric is an explanatory variable (covariate) for another continuous outcome. Bias
was quantified using observed-to-true effect size ratios, and uncertainty by the sample size increase required to maintain precision.
Results: In the clinical trial setting, treatment effects remain unbiased but lose precision; for Time In Range (TIR), 14 days required 52 additional participants versus 90 days. When the CGM metric is a covariate, associations with outcome are attenuated (biased towards zero, up to 14% for TIR) and less precise (requiring 16% more participants for TIR).
Conclusions: Representing 90 days of glycemic exposure with 14 days can lead to bias and loss of precision in downstream analyses. We recommend study protocols require at least 30 days of CGM monitoring with 70% completeness. If 30 days is not feasible, studies should plan for increased sample sizes.


Jaylin Lowe

PhD Student
Statistics

Leveraging Large Language Models to Improve Precision in Randomized Controlled Trials

Abstract

Large language models (LLMs) are increasingly used in statistical research and applications. However, they are also notorious for unreliable or biased information. Here, we explore whether LLMs can be used to improve the precision of randomized controlled trials (RCTs) in a safe and rigorous way. Following similar work on leveraging observational data, we incorporate LLM predictions in an RCT analysis. While this method of improving precision is not new, the value of using LLM predictions in this manner is an open question. We discuss how useful LLM predictions are and how different datasets and prompts impact their usefulness.

LLM predictions add little value when the RCT already includes highly predictive covariates. However, if few such covariates exist or the data is well-suited for LLMs—like text—LLM predictions become more beneficial. Familiar, easy-to-predict outcome variables also help.

Our basic approach asks the LLM to predict outcomes for each observation, but this often produces overly similar results. Instead, we ask the LLM to compare pairs of observations and predict which will have a higher outcome. We use the selection frequency as a covariate. We can also extract additional covariates from the LLM, such as writing quality or creativity in text-based RCTs. We combine all covariates to generate a final prediction for each observation, achieving greater precision than either the single prediction or standard covariate adjustment without the LLM predictions.


Session IV

March 28th, 1:00pm – 2:40pm
East conference rooms

Moderated by
Weiyushi Tian
PhD Student, Survey and Data Science


Gabriel Ponte

PhD Student
Industrial & Operations Engineering

On the relationship between MESP and 0/1 D-Opt and their upper bounds

Abstract

We establish strong connections between two fundamental nonlinear 0/1 optimization problems coming from the area of experimental design, namely maximum entropy sampling and 0/1 D-Optimality. The connections are based on maps between instances,
and we analyze the behavior of these maps.
Using these maps, we transport basic upper-bounding methods between these two problems, and we are able to establish new domination results and other inequalities relating various basic upper bounds.
Further, we establish results relating how different branch-and-bound schemes based on these maps compare.
Additionally, we observe some surprising numerical results, where bounding methods that did not seem promising in their direct application to real-data MESP instances, are now useful for MESP instances that come from 0/1 D-Optimality.


Prakhar Bansal

PhD Student
Physics

Statistical Tests of Cosmic Expansion Using DESI and Multi-Probe Data

Abstract

Recent results from the Dark Energy Spectroscopic Instrument (DESI) have generated excitement by providing hints of mild tensions between independent cosmological datasets and the standard cosmological model. These developments create an opportunity to apply modern statistical tools to test whether such discrepancies reflect new physical effects or arise from modeling assumptions and data systematics. In this talk, I present two complementary analyses that use Bayesian inference and likelihood-based model comparison to study late-time cosmic expansion using DESI DR 2 baryon acoustic oscillation measurements combined with other cosmological datasets.
In the first project, we investigate evidence for departures from the standard $\Lambda$CDM expansion history. To improve computational efficiency, we validate a compressed representation of the CMB dataset, demonstrating that a likelihood originally defined on roughly 7,000 data points can be reduced to three summary statistics while still recovering parameter constraints with satisfactory accuracy. Using Bayesian model comparison based on maximum-a-posteriori estimates and likelihood-ratio statistics, we find a robust preference for a 3–4% enhancement in the expansion rate near redshift $z \simeq 0.7$. At the same time, we show that the current DESI data are well described by a simple two-parameter dark energy model, and that introducing a more flexible six-parameter parametrization does not lead to meaningful improvement in goodness of fit.
In the second project, we examine proposed late-time solutions to the Hubble tension by testing multiple phenomenological extensions to $\Lambda$CDM. We quantify their statistical support using Bayesian posterior sampling, goodness-of-fit metrics, and parameter-count-adjusted significance estimates. This framework not only enables principled model comparison, but also allows us to identify a common physical feature present in the best-performing models.
Together, these studies demonstrate how modern statistical tools, including Bayesian sampling, likelihood compression, and principled model comparison, are essential for extracting robust physical conclusions from next-generation cosmological datasets.


Chendi Zhao

PhD Student
Survey and Data Science

Balancing Fairness and Disclosure Limitation when Creating Synthetic Data for Survey Research

Abstract

The growing demand for open data has increased the need for data release strategies that protect respondent confidentiality while preserving statistical validity. Synthetic data, in which some original values are replaced with model-generated values, offers a promising solution by reducing re-identification risk. Partial synthetic data further targets privacy protection by selectively replacing high-risk records. However, standard synthetic data generation methods are typically optimized to preserve overall population-level statistics. When high-risk records disproportionately belong to minority or underrepresented groups, these methods can distort subgroup distributions, raising fairness concerns in downstream analyses.

This study proposes two fairness-constrained approaches for synthetic data generation within statistical modeling frameworks commonly used in survey research. The first approach uses a Bayesian framework that modifies prior distributions by penalizing discrepancies between observed and synthetic subgroup statistics. The second approach incorporates fairness constraints directly into the synthetic data generation model through offset terms. Both methods aim to improve subgroup-level accuracy while maintaining meaningful confidentiality protection.

We illustrate these methods using data from the 2023 American Community Survey Public Use Microdata Sample (ACS-PUMS), focusing on preserving health insurance coverage rates across gender, race/ethnicity, and education subgroups. We synthesize gender, race/ethnicity, and education while retaining the original insurance status, and compare the proposed methods to a standard synthetic model without fairness adjustments. Results show that both approaches substantially reduce subgroup disparities. The analyses reveal a clear trade-off between fairness and privacy: strengthening fairness constraints improves subgroup accuracy but gradually reduces privacy protection. We identify a threshold beyond which gains in fairness diminish while losses in privacy become more pronounced. Overall, the findings demonstrate that fairness-aware synthetic data generation can produce datasets that are more equitable while preserving confidentiality and analytic utility, offering practical guidance for data producers seeking to release secure and fair synthetic data.


Milad Hoseinpour Valoujaei

PhD Student
Electrical Engineering & Computer Science

Outage Identification from Electricity Market Data: Quickest Change Detection Approach

Abstract

Power system outages expose market participants to significant financial risk unless promptly detected and hedged. We develop an outage identification method from public market signals grounded in the parametric quickest change detection (QCD) theory. Parametric QCD operates on stochastic data streams, distinguishing pre- and post-change regimes using the ratio of their respective probability density functions. To derive the density functions for normal and post-outage market signals, we exploit multi-parametric programming to decompose complex market signals into parametric random variables with a known density. These densities are then used to construct a QCD-based statistic that triggers an alarm as soon as the statistic exceeds an appropriate threshold. Numerical experiments on a stylized PJM testbed demonstrate rapid line outage identification from public streams of electricity demand and price data.


Yilun Zhu

PhD Student
Electrical Engineering & Computer Science

Domain Generalization Under Posterior Drift

Abstract

Domain generalization (DG) is the problem of generalizing from several distributions (or domains), for which labeled training data are available, to a new test domain for which no labeled data is available.
For the prevailing benchmark datasets in DG, there exists a single classifier that performs well across all domains.
In this work, we study a fundamentally different regime where the domains satisfy a \emph{posterior drift} assumption, in which the optimal classifier might vary substantially with domain. We establish a decision-theoretic framework for DG under posterior drift, and investigate the practical implications of this framework through experiments on language and vision tasks.


Session V

March 27th, 3:00pm – 4:30pm
West conference rooms

Moderated by
Milad Hoseinpour
PhD Student, EECS


Kate Jensen

PhD Student
Physics

Comparative Computational Models for Diffusion and Reconfiguration in DNA-Based Meso-structures

Abstract

We present a comparative computational analysis of transport and reconfiguration in programmable DNA-based meso-structures, emphasizing how distinct modeling frameworks capture different physical mechanisms. For multilayer core–shell DNA superlattices, finite element (FEM) simulations quantify mesoscale diffusion through anisotropic confinement, revealing shell-thickness–dependent kinetic barriers that decouple release rate from crystal size. Complementary coarse-grained Brownian dynamics simulations in HOOMD-Blue resolve nanoparticle trajectories and local mean-square displacements, linking microscopic random walks to effective diffusion coefficients. To probe structural reconfiguration, we apply a Monte Carlo framework for patchy hard tetrahedra with tunable color-encoded interactions and rotational degrees of freedom, capturing directional self-assembly and angular constraints. By juxtaposing continuum, particle-based, and discrete assembly models, this work highlights the nuanced strengths and limitations of each approach in describing programmable mesoscale architectures and guiding the design of reconfigurable soft-matter systems.


Jiwoo Han

PhD Student
Statistics

Maximin Relative Improvement: Fair Learning as a Bargaining Problem

Abstract

When deploying a single predictor across multiple subpopulations, we propose a fundamentally different approach: interpreting group fairness as a bargaining problem among subpopulations. This game-theoretic perspective reveals that existing robust optimization methods such as minimizing worst-group loss or regret correspond to classical bargaining solutions and embody different fairness principles. We propose relative improvement, the ratio of actual risk reduction to potential reduction from a baseline predictor, which recovers the Kalai–Smorodinsky solution. Unlike absolute-scale methods that may not be comparable when groups have different potential predictability, relative improvement provides axiomatic justification including scale invariance and individual monotonicity. We establish finite-sample convergence guarantees under mild conditions.


Aaron Abkemeier

PhD Student
Statistics

pypomp: Inference for partially observedMarkov process models in Python with JAX

Abstract

Methods for fitting and evaluating partially observed Markov process (POMP) mechanistic models are useful in a variety of fields such as epidemiology, ecology, and finance, but they are limited by their computational demands. In response, we introduce pypomp, a high-performance Python package for statistical inference using POMP models. Built to handle complex stochastic dynamic systems, it implements plug-and-play algorithms for likelihood-based inference including sequential Monte Carlo and iterated filtering. By utilizing parallel processing on graphical processing unit (GPU) hardware, pypomp is able to run these algorithms faster and more efficiently than its predecessor, the R package pomp. Models that took weeks to fit and evaluate in pomp now take only days in pypomp, facilitating the use of previously impractical models. The package offers a comprehensive interface that streamlines the entire pipeline of model construction, fitting, and analysis, allowing practitioners to develop and test large, complex models with minimal implementation overhead.


Ziheng Wei

Master’s Student
Statistics

Off-Policy Evaluation for Missingness-Aware Policies in MDPs with Rewards Missing Not at Random

Abstract

We investigate off-policy evaluation in finite-horizon Markov decision processes
when rewards are missing not at random, which breaks ignorability and induces
selection bias even after conditioning on states and actions. To address this,
we formalize a reward-dependent propensity model and use future states as shadow
variables to identify the full-data conditional mean reward. Motivated by proximal causal inference, we further introduce a bridge function that recovers the conditional mean reward without explicitly modeling the MNAR mechanism, and estimate it via a min-max procedure that avoids double sampling. Building on these identification results, we propose an FQE-style estimator that propagates the recovered rewards while allowing target policies to depend on past missingness indicators. Finally, we establish consistency and finite-sample error bounds for our OPE estimator, and show through simulations the strong performance of our method compared to existing benchmarks.


Xiong Zeng

PhD Student
Electrical Engineering & Computer Science

System Identification Under Bounded Noise: Optimal Rates Beyond Least Squares

Abstract

System identification is a fundamental problem in control and learning, particularly in high-stakes applications where data efficiency is critical. Classical approaches, such as the ordinary least squares estimator (OLS), achieve an $O(1/\sqrt{T})$ convergence rate under Gaussian noise assumptions, where $T$ is the number of samples. This rate has been shown to match the lower bound. However, in many practical scenarios, noise is known to be bounded, opening the possibility of improving sample complexity. In this work, we establish the minimax lower bound for system identification under bounded noise, proving that the $O(1/T)$ convergence rate is indeed optimal. We further demonstrate that OLS remains limited to an $\Omega(1/\sqrt{T})$ convergence rate, making it fundamentally suboptimal in the presence of bounded noise. Finally, we instantiate two natural variations of OLS that obtain the optimal sample complexity.


Session VI

March 27th, 3:00pm – 4:30pm
East conference rooms

Moderated by
Yanlin Tong
PhD Student, Biostatistics


Yalei Zhao

Master’s Student
Biostatistics

Kernel-Weighted Spline Smooth Models for Spatial Cell Colocalization

Abstract

Understanding how spatial proximity between cell types influences cellular behavior is a central question in spatial transcriptomics, yet many existing approaches rely on fixed distance thresholds or coarse summary measures. We develop a kernel-weighted spline modeling framework to quantify distance-dependent spatial cell colocalization in a flexible and interpretable manner. For each focal cell, kernel-smoothed local densities of neighboring cell types are constructed as continuous functions of intercellular distance and incorporated into a generalized linear model via penalized spline smooths. This formulation allows spatial effects to vary smoothly across distance while controlling model complexity. Applying the proposed method to cerebellar spatial transcriptomics data reveals structured, non-monotonic distance effects that are not captured by traditional colocalization metrics.


Mengqi Lin

PhD Student
Statistics

Characterizing Identifiability in Boolean Graphical Models

Abstract

Boolean graphical models, including prominent subfamilies such as Boolean matrix decompositions and cognitive diagnosis models, find broad applications ranging from social sciences to engineering. Despite their flexibility, a key challenge lies in establishing the identifiability of their graphical structures, which specify how latent variables influence observed variables. Existing identifiability conditions typically rely on the strong assumption of pure nodes, which may be unrealistic in many applications. We develop a novel approach leveraging the Hasse diagram to represent the distribution of observed variables and transform identifiability into a graph isomorphism challenge. Based on this, we establish {\it sufficient and necessary} graphical identifiability conditions that do not require pure nodes. We further derive equivalent algebraic conditions and develop an efficient Boolean satisfiability (SAT)-based verification algorithm. Our results substantially broaden the class of identifiable and interpretable Boolean graphical models by removing the pure-node requirement, yielding new theoretical insights, while also providing practitioners with a concrete and easily implementable tool to assess model identifiability.


Mohammad Aamir Sohail

PhD Student
Electrical Engineering & Computer Science

A Quantum-Like Framework for Modeling and Inferring Gene Regulatory Networks

Abstract

We introduce a novel approach for modeling and inferring gene regulatory networks from gene expression data, called quantum-like gene regulatory network inference, grounded in the axioms of quantum information science. We present a new Hamiltonian-learning framework based on time-resolved measurement data from a fixed local informationally complete positive-operator-valued measure (IC-POVM) quantum measurement, and its application to inferring gene regulatory networks (GRNs). We introduce the quantum Hamiltonian-based gene-expression model (QHGM), in which gene interactions are encoded as a parameterized Hamiltonian that governs gene expression evolution over pseudotime, an approximation analogue of physical evolution time. We construct a Hamiltonian for GRNs by defining biologically interpretable interaction terms using tensor products of computational basis states. In addition, we design an IC-POVM whose outcomes provide a discrete representation of gene expression in cells. To recover the QHGM parameters, we develop a scalable variational learning algorithm based on empirical risk minimization. We derive finite-sample recovery guarantees and establish upper bounds on the number of time and measurement samples required for accurate inference, which scale polynomially with network size. Our algorithm efficiently recovers network structure with over 95% accuracy across synthetic benchmarks, spanning a wide range of sample sizes and network sparsity levels, outperforming classical state-of-the-art methods. Moreover, the QHGM reveals novel, biologically plausible regulatory connections in single-cell RNA sequencing data from Glioblastoma, the more aggressive form of brain cancer, highlighting its potential in cancer research. This framework opens new directions for applying quantum-like modeling to biological systems beyond the limits of classical inference.


Feifan Jiang

PhD Student
Statistics

Statistical Inference for Latent Space Models on Network Data with Edge Covariates

Abstract

Latent space models (LSMs) provide a powerful framework for analyzing network data by embedding nodes in a latent space. Incorporating covariate information via edge covariates offers an important generalization that strengthens both the interpretability and practical utility of the model. However, we find that coefficient estimates obtained through maximum likelihood estimation exhibit asymptotic bias due to third-order geometric effects and errors in latent variable estimation. To address this issue, we propose a plug-in bias correction estimator that enables valid, unbiased statistical inference for effects of edge covariates. We establish theoretical guarantees, including consistency and asymptotic normality, under various network structures. Extensive simulations and real-world data analysis demonstrate that our method effectively reduces estimation bias and improves the accuracy of inference. Our findings contribute to the statistical methodology for LSMs, providing a principled framework for unbiased parameter estimation in network models with edge covariates.


CHIA-YU HSU

PhD Student
Electrical Engineering & Computer Science

Classifier-Based Anytime-Valid Hypothesis Testing

Abstract

In this paper, we propose a classifier-based anytime-valid test for sequential hypothesis testing. We study a composite problem with a simple null and multiple alternatives: \[H_0: P_0,\qquadH_1: \{P_\theta:\theta\in\{1,\ldots,L\}\},\]
where $L\ge 1$. Our main contribution is an anytime-valid procedure that attains a level-$\alpha$, power-one guarantee in the sense of~\citep{Darling1968} by leveraging an offline-trained classifier. The classifier is trained using samples drawn from $\{P_\theta\}_{\theta=0}^{L}$ and is then used to construct an anytime-valid sequential test. We also provide an analysis of the expected stopping time of the proposed procedure. Moreover, we show that our test is capable of identifying the true underlying distribution almost surely.
Empirically, we verify our theoretical results and demonstrate that the test remains effective under mismatches between the training and testing data distributions.
To the best of our knowledge, this is the first work to develop a single-stream sequential anytime-valid hypothesis test that does not require any structural assumption for the underlying data distributions.


Poster

March 27th, 5:00pm – 6:15pm
Assembly Hall


Chendi Zhao

PhD Student
Survey and Data Science

Impact of Data Disclosure Methods on Small Area Estimation

Abstract

The public release of microdata poses substantial privacy risks, prompting statistical agencies to adopt disclosure limitation methods such as data perturbation, which aims to reduce re-identification risk while preserving key statistical properties. Although prior research has examined the effects of perturbation on general analytic validity, much less is known about its implications for small area estimation (SAE), which relies heavily on auxiliary covariates and is therefore particularly sensitive to distortions introduced by perturbation. This study evaluates how three commonly used perturbation methods, including random swapping, the post-randomization method, and synthetic data generation, affect the accuracy of area-level estimates produced by SAE. Using data from the American Community Survey Public Use Microdata Sample, we perturb key demographic and socioeconomic covariates and estimate average household income and poverty rates using the Fay–Herriot model. Perturbations are applied under multiple disclosure risk thresholds and at different geographic scales, and a 100\% synthetic condition in which all covariates are synthesized is also examined. Results show that SAE accuracy depends strongly on both the proportion of perturbed records and the geographic scale at which perturbation is implemented. The post-randomization method provides the greatest confidentiality protection but is associated with notable losses in estimation accuracy, whereas random swapping and synthetic data generally achieve more favorable privacy–utility trade-offs. Notably, the 100\% synthetic condition produces highly accurate and stable SAE estimates across geographic levels. These findings offer practical guidance for agencies seeking to balance respondent confidentiality and analytic utility when releasing microdata for small area estimation.


Tom Liu

PhD Student
Biostatistics

Spatially-Regularized Cell Type Deconvolution via Total Variation Penalized Likelihood

Abstract

Spatial transcriptomics profiles gene expression with spatial context, but per-spot cell-type deconvolution ignores spatial coherence and yields noisy maps. We formulate spatial deconvolution as an empirical Bayes penalized-likelihood problem, combining a flexible spot-level likelihood (building on our previous framework) with a weighted anisotropic total-variation penalty as a prior on spatial gradients, with edge-specific hyperparameters learned from data. To stabilize estimation, we optionally compress correlated cell types into a low-rank dictionary. We solve this via alternating optimization: FISTA for mixture coefficients and dictionary, and linear programming for adaptive edge weights. A smooth soft-L1 approximation to total variation yields efficient gradient-based updates. On mouse cerebellum Slide-seq data, our method recovers known laminar organization with sharp boundaries. We provide an open-source R/Rcpp implementation with practical runtimes and results closely matching convex baselines.


Owen Yoo

Undergraduate Student
Statistics

Temporally Preserving Wearable Heart Rate Metrics for Cardiovascular Outcomes

Abstract

Wearable devices, such as Fitbits and Apple Watches, enable the collection of heart rate (HR) data in frequent intervals, providing opportunities for both personal management and applications within clinical and research settings. However, common approaches tend to simplify these temporally rich data into simple aggregate summaries. To address the loss of temporal information within HR data for wearables, we first introduce innovative temporal metrics and workflow motivated by continuous glucose monitoring through the R package ihr. ihr provides quantification and visualization of HR trends and variability, developing a framework for investigating the complexity that arises from information collected from wearable devices. We then investigate the information that ihr metrics provide in the context of cardiovascular health in the NIH’s All of Us research program’s Fitbit data (n = 19038). Our results suggest that the Mean Amplitude of Heart Rate Excursion (MAHE), which averages fluctuation in an individual’s HR activity, is statistically significant (Hazard Ratio: 1.045, 95% CI (1.013, 1.078)) with the recurrent risk of myocardial infarction and strokes in a Cox proportional hazard model. Our analysis also models the historically utilized Heart Rate Reserve and Resting Heart Rate in separate Cox models, which find no significance, exhibiting the strength of our new metrics. By enabling a novel use of heart rate data, ihr and its associated temporally preserving metrics provide an opportunity to support new research directions in mobile health and improve the integration of wearable devices into clinical practices.


Evelyn Brodeur

Undergraduate Student
Statistics

Cumulative Gain Metrics for Survival Prediction in Organ Transplantation

Abstract

Cumulative gain metrics have long been used as measures of performance for classification models with binary health outcomes, describing the ability to identify high-risk groups from broader patient populations. Here, we develop novel extensions of cumulative gain metrics for time-to-event data structures to evaluate the efficacy of survival models in predicting the mortality risk of organ transplant candidates. Risk scores such as the Model for End-Stage Liver Disease (MELD) and Estimated Post-Transplant Survival score (EPTS) are utilized in clinical settings to decide the priority of organ recipients and predict long-term survival of patients with chronic disease. The validation of these models is essential for ensuring accurate treatments, reducing deaths on transplant waitlists, and reliably estimating patient outcomes. To assess these survival scores, we develop three versions of cumulative gains for survival data. The first method binarizes the survival outcome at a selected timepoint. Our second approach derives a new cumulative gains formula in terms of estimated survival functions. The third method uses a sequence of longitudinal binary outcomes to assess time-dependent risk scores. These methods vary in their handling of censored data, response to changes in risk, and ability to discriminate events from non-events. We apply each approach to national liver transplant registry data and calculate the proportion of total possible gains captured to analyze the performance of recently updated MELD scores and transplant policy.


Sam Rosenberg

PhD Student
Statistics

Improved Sensitivity Analysis of Weak Nulls in a Matched Observational Study of Gonadectomy Outcomes in Golden Retrievers

Abstract

Motivated by a matched observational study of the effect of early-life gonadectomy on musculoskeletal and sarcoma outcomes in golden retrievers, we develop a novel statistic to address a common problem in observational studies with binary outcomes: when testing a weak null hypothesis, the reported robustness of such an observational study drastically drops relative to testing a sharp null hypothesis. To address this, we present a reweighting technique in which we reweight a given statistic on a stratum-wise basis with weights dependent on a given level of unmeasured confounding. This reweighting procedure anticipates and mitigates the impact of pathological but feasible allocations of unobserved potential outcomes that can impede the robustness of a sensitivity analysis. For the setting of testing weak null hypotheses on the risk difference with matched pair data, we derive an analytic comparison between the design sensitivity of the prevailing approach and our proposed reweighting procedure, as well as a closed-form design sensitivity for the conventional test statistic. We demonstrate through a combination of real data examples and simulations that the robustness of the reweighting procedure carries across datasets, to the more general variable ratio matching setting, and to other parameters such as the risk ratio. By applying this reweighting procedure, one can often strengthen the reported robustness of results from matched observational studies with binary outcomes. Our analysis finds that many of the existing findings in the literature remain true after accounting for both observed covariates through matching and unobserved confounders through sensitivity analysis, providing further credence for the current clinical recommendations on age of gonadectomy in male and female golden retrievers.


Cecilia Pirozzi

Undergraduate Student
EEB/ Math

Modeling the relationship between cognitive performance, behavior, and fitness in wild wasps

Abstract

Cognition has major effects on the lives of animals. We are all familiar with cognitive processes – learning, memory, and problem-solving often guide our own behavior. Importantly, individuals of every species vary in their cognition and behavior. Such variation is of interest to evolutionary biologists because it likely directly or indirectly influences survival and reproduction (i.e., fitness). Polistes paper wasps provide a unique opportunity to study the fitness consequences of individual variation in cognition because they have short lifespans, large social groups, and live on open nests which together facilitate the study of natural behavior and reproductive success in the field. Polistes also show individual variation in multiple cognitive tests in the laboratory. However, establishing links between cognitive performance, behavior, and fitness is statistically challenging. Our lab measures multiple areas of cognition and documents natural behavior and reproductive success of hundreds of wild wasps. The resulting quantitative data is difficult to analyze and interpret because it has multiple independent and dependent variables. The aim of this project is to employ structural equation modeling (SEM) in a novel way to determine the strength and direction of the interactions between cognitive, behavioral, and fitness variables. Specifically, we will use path analysis to include the direct and indirect effects to better understand how variables may mediate each other.. This model takes into account that variables can be both a predictor and an outcome, allowing for a more complex analysis of interactions. Our approach will help elucidate the effects of cognition and behavior on fitness, and it will provide context for understanding how cognitive traits evolve.


Anika Misra

Master Student
Computer Science

Multimodal Diffusion Classification for Hate Meme Detection

Abstract

In today’s digital world, memes are widely used to express humor and personal views, but they also can also be used to spread hate and harm. Due to the advanced interactions between textual and visual modalities, classifying and detecting hate memes is a challenging problem. In this project, we study the statistical contribution of visual information to hate meme classification in a multimodal setting. We frame hate meme detection as a binary classification problem (1 for hateful, 0 for non-hateful) and investigate how image-based signals can provide a measurable predictive value beyond textual features alone. Specifically, we combine a diffusion-based probabilistic image classifier with a text-based binary classifier in a weighted ensemble framework.

Compared to a text-only baseline, we find a modest but consistent improvement in classification accuracy, despite the image classifier exhibiting relatively weak performance on its own. The result suggests that even weak individual classifiers can contribute to a stronger ensemble method when weighed correctly. Our findings highlight the importance of statistical model combination and weighting strategies in multimodal classification, as well as the dominance of textual information in hate meme detection. In general, these methods may be used for automated content moderation to promote safety in online spaces, and also for understanding interactions between modalities in statistical learning.


Ethan Schubert

PhD Student
Statistics

Efficient Weighted Low Rank Estimation with Multivariate Correlated Outcomes

Abstract

In multivariate regression, low-rank constraints are often applied when it is believed the conditional mean structure follows a low-dimensional subspace. Doing so is a form of regularization that improves estimation precision and facilitates interpretation. When the sampling covariance is separable between covariates and outcomes, the low-rank structure can be efficiently estimated via singular value decomposition (SVD) of the fitted values. However, in non-linear models or under heteroscedasticity or non-independence, separability does not hold and the SVD may become inefficient. We propose a two-stage framework for efficiently estimating the low-rank conditional mean structure under non-separability, with the mean and covariance structures estimated using generalized estimating equations (GEE) followed by one of two low-rank approximation algorithms. The first algorithm is a weighted low-rank approximation algorithm (WLRA) which directly optimizes a loss function using all covariance information. The second algorithm utilizes a Kronecker approximation to the non-separable sampling covariance matrix, allowing the SVD to be partially adapted to the covariance information. Through a simulation study, we demonstrate both approaches surpass existing methods when the observations are correlated. We also find that the Kronecker approximation algorithm is more stable and computationally efficient. We then apply these methods to growth data from the Dogon longitudinal study.


Joe Pennacchio

PhD Student
Statistics

On the Benefits of Long-Range Dependence and Multi-View Methods for Spectral Classification

Abstract


As microplastics become increasingly prevalent in our environment, it has become critical to identify and classify them so that society can work towards reducing their prevalence. Our focus in this work is on selecting the optimal model framework for spectral classification—though we primarily focus on our simultaneously-obtained IR and Raman lab spectra, we also plan to consider additional lab and/or environmental datasets. The present literature primarily focuses either on a few traditional approaches (KNN, PLS, SVM; etc.) or on a select deep learning approach (but not both); however, few works comprehensively compare across methods, or emphasize the deeper intuition behind why specific methods are optimally tailored to spectra. We first consider some of the traditional methods, starting with Hit-Quality Index, which is a library-matching approaches that resembles KNN. Next, we consider deep learning architectures. A combination of long-range dependence, across wavenumbers, and locality, within flat regions and peaks, should be incorporated within the model for optimal classification. Transformers emphasis the former, and CNNs emphasize the latter. For optimal classification, we demonstrate that combining the strengths of both within a reshaped 2D CNN or a Hybrid CNN/Transformer architecture performs best. We also consider a variety of other deep learning architectures (autoencoders, SWIN Transformer; etc.). Finally, we leverage multi-view methods that jointly consider IR and Raman spectra, and we obtain an incremental improvement using these methods. We consider all of the above models both when training solely with the original spectra and also when training with augmented copies thereof (with further improves classification accuracy).



Jialin He

PhD Student
Statistics

Modeling hyper-graphs via a two-stage model

Abstract

Recent research has shown growing interest in modeling hypergraphs, which capture polyadic interactions among entities beyond traditional dyadic relations. In many applications, entities in the hypergraph are often observed along with clustering structures and covariates, such as in clinical phenotyping and social network analysis. In these settings, jointly capturing group-level cluster patterns and individual attributes is essential for reliable inference. However, most existing latent space models treat latent positions as independent parameters and do not incorporate observed covariates into the latent space. This results in limited interpretability and poor generalization to new subjects.

To address these challenges, we propose a novel latent embedding framework. The model integrates cluster-specific intercepts and individual heterogeneity parameters with a bilinear mapping from observed covariates to latent embeddings. Parameters are estimated via maximum likelihood using an efficient projected gradient descent algorithm. We investigate the identifiability conditions for the latent embeddings and associated parameters, and we establish the consistency and asymptotic normality of the proposed estimators. Extensive simulation studies and real data demonstrate the computational efficiency of the algorithm and validate the theoretical results.


Yizhou Gu

PhD Student
Statistics

Subgroup Inference on Network-Linked Data

Abstract

We propose a novel approach to perform subgroup analysis for network-linked data. Subgroup analysis is an important task in clinical trials and social experiments. Especially when the experiment is costly and no global effect was found, it is of interest to identify a subgroup that potentially responds differently to the experimental treatment. In this paper, we develop a method to incorporate previously underutilized network information in subgroup analysis, conduct formal inference on subgroup identification, and find factors predictive of subgroup membership, be it covariates $X$ or network information. The utility of this method is demonstrated through an application to microfinance data, showing the effect of network information in subgroup analysis.


Daniel Zou

PhD Student
Statistics

Optimizing Score Thresholds in Score-Explained Heterogenous Treatment Effect Models

Abstract

We investigate the setting of score-explained heterogenous treatment effect models, specifically those where the treatment assignment is determined by a threshold on some score. Recent work in this area focuses on estimating average treatment effects or heterogenous treatment effects. Instead, we are interested in optimizing the score threshold to maximize utility for the organization. We propose an algorithm for this optimization and establish asymptotic normality results for our estimator.


Maria Fields

PhD Student
IOE

Cultural Aspects of General Artificial Intelligence

Abstract

Cultural attitudes and societal norms greatly influence the acceptance and utilization of new technologies such as Generative Artificial Intelligence (GenAI). This study compares opinions toward GenAI between three large Spanish-speaking countries (Mexico, Spain, Argentina) on three continents and four large English-speaking countries (the US, UK, India, and Nigeria) on four continents. One hundred forty-two text sources were collected via the internet, containing text materials discussing GenAI in these countries. Thematic Analysis of these texts was used to identify their major themes, and the results revealed the top 8 themes for the Spanish-speaking countries and the English-speaking countries, respectively. The results showed similarities and differences both between Spanish- and English- speaking countries and within each language group. This research provides concrete evidence that it is important to consider cultural aspects of GenAI as one of the human factors issues that need urgent attention.


Chang Li

PhD Student
IOE

Active Transfer Learning for Efficient Emulation of Vehicle Crash Computer Models

Abstract

Emulators of vehicle crash computer models are essential for evaluating injury responses across diverse crash scenarios. Injury responses often exhibit a shared, interpretable trend across vehicle types (e.g., SUVs, sedans) driven by underlying human biomechanics; however, substantial heterogeneity persists due to vehicle-specific designs (e.g., restraint systems) and structural differences. This combination of cross-vehicle subject-level commonality and vehicle-specific deviation motivates efficient emulation for new vehicles. Rather than rebuilding a surrogate and its design of experiments from scratch for each vehicle, we aim to reuse transferable information while strategically sampling only what is needed to learn the new vehicle’s unique effects. In this research, we develop an active transfer learning framework that leverages existing simulation data from a prior vehicle to accelerate emulation for a new vehicle. We adopt an additive model that decomposes the injury-response process into (i) a transferable subject-specific component that captures cross-vehicle physiological trends and (ii) a non-transferable vehicle-specific component that captures design-induced deviations. Building on this structure, we design a sampling strategy that prioritizes informative simulations for the new vehicle while reusing knowledge learned from the prior vehicle. A case study demonstrates that the proposed approach can substantially reduce the number of subject evaluations required to achieve a target prediction accuracy for the new vehicle. These results highlight the potential of transfer-enabled sampling for vehicle crash computer models to improve the efficiency and scalability of virtual testing in future vehicle safety development workflows.


Léo Laborieux

PhD Student
Ecology and Evolutionary Biology

Phylogeny, ecology, and allometry: hierarchical controls of insect elemental composition

Abstract

The elemental composition of organisms links physiology to biogeochemical cycles, reflecting interdependent ecological, physiological, and evolutionary determinants whose relative contributions remain poorly quantified. Integrating a global stoichiometric database of 223 species with a manually curated phylogenetic backbone, we demonstrate hierarchical control of body composition in insects. We use phylogenetic generalised least-squares regression averaged over 1000 alternate evolutionary scenarios and partition explained variance to show that phylogeny accounts for ~50% of variation in C:N:P ratios, followed by temperature, body mass, trophic group and habitat. Our work offers a general framework for the statistical disentanglement of evolutionary, ecological, and physiological influences on integrative organismal traits, highlighting new avenues to link macroevolutionary patterns to underlying processes.

Muskaan Mittal

Undergraduate Student
Computer Science and Electrical Engineering

SCaMEr: Identifying Strategic Misreporting Rates in the Absence of Ground Truth Data

Abstract

Strategic agents are often incentivized to misreport their features to obtain favorable outcomes from machine learning models. While prior research utilizes causal inference to estimate misreporting rates for binary features, these existing methods rely on the restrictive assumption of having access to a ground truth dataset. In this work, we introduce SCaMEr, a stitched causal misreporting estimator that allows us to relax this requirement by leveraging two datasets with directional misreporting: one where agents misreport features in only one direction, and another where they only misreport in the opposite direction. We give the conditions under which the misreporting rate is identifiable using causal effect estimation by integrating these two complementary data sources. For scenarios where exact identification is not possible, we provide sensitivity analysis bounds for the misreporting rate. We empirically validate our findings using a real-world Medicare dataset, demonstrating that our method can serve as an auditing tool to help the U.S. government recover billions of dollars annually from insurers’ misreporting of medical diagnoses.

Steven Leone

Master’s Student
Statistics

Bayesian Methods for Inference on Wildfires in Australia and the US

Abstract

Wildfires in the United States and Australia burn upwards of ten million acres of land annually. Emitted carbon dioxide and smoke aerosol from wildfires damage the environment, and more than 46 million acres burned in the 2019-2020 Australian Wildfire season. Predicting and forecasting wildfires can allow for proactive resource allocation, better public safety, and better decision making to marginalize these damages. NASA’s Fire Information for Resource Management System (FIRMS) contains real time data of fires across the globe, in Australia and other regions. 153,916 and 156,417 fires were captured through NASA’s MODIS instrument in 2020 in the United States and Australia, respectively. Statisticians and other researchers have shown that we can successfully predict wildfires from various predictors, including vegetation data. Other studies show that dry conditions and human activity are the leading causes of the wildfires. We create hierarchical models to predict wildfires given land and atmospheric conditions to assess and compare likely causes across continents. Hierarchical models allow us to take advantage of interpretable coefficients, so that we can determine the best predictors of wildfires and study the transferability of wildfire climate relationships across continents. We apply a small amendment to the Hamiltonian Monte Carlo Method, introducing a severity weight to errors after taking the gradient to give us further insights into which predictors lead to the brightest wildfires. In total, we assess a Spatial Hierarchical Model, a Brightness Sensitive Spatial Hierarchical Model, and a Mixed Effects Hierarchical Model. Finally, we assess our models compared to state of the art, achieving 100%  and 88.7 accuracy in wildfire prediction with a Bayesian multilevel (hierarchical) logistic regression model in Australia and the United States on complete data, respectively, an improvement over the 96.9% accurate Bayesian logistic regression model in Australia in 2022, and outperforming deep neural networks with up to 89.4% accuracy trained on similar data in 2025.

Yuqi Sun

Undergraduate Student
Computer Science and Electrical Engineering

Road Sign Classification with Denoising Pipeline Approach

Abstract

Reliable traffic sign recognition is critical for autonomous vehicles. However, real-world images are affected by noise, which degrades the’ quality. This research proposes a denoiser-classifier pipeline approach, which uses a pre-trained color image denoiser, and compares it against an end-to-end approach (single classification models). This paper evaluates the pipeline approach and the singular model’s image recognition performance on the German Traffic Sign Recognition Benchmark (GTSRB) dataset under Gaussian noise with standard deviation in pixel value units from 0 to 100. Tests on CNN and ResNet-18 show that the pipeline improves CNN models’ performance under moderate noise but offers limited to no gains for ResNet-18, which generally performs better end-to-end. This study provides insights into the importance of model choice for noise-resilient recognition systems.
Keywords—Classification, Pipeline, Noise, Image Processing, Computer Vision

Manan Arora

Master’s Student
College of Engineering

Physics-Informed Machine Learning in Environmental Modeling: A Review for Climate Change Mitigation

Abstract

The evolution of environmental modeling has seen a transformative shift with the advent of physics-informed machine learning (PIML). This approach bridges the gap between traditional numerical methods and pure data-driven models, offering enhanced accuracy and efficiency in simulating complex environmental systems. PIML incorporates physical laws into neural networks, addressing limitations of both conventional and purely statistical approaches.
This paper presents a comprehensive cross-domain analysis of PIML applications, from groundwater dynamics to wildfire propagation, air pollution prediction to ocean temperature modeling. Our review reveals three key findings: First, PIML approaches consistently demonstrate superior performance in extrapolation scenarios where future conditions differ from historical observations—a critical capability for climate change modeling. Second, the unified mathematical framework of Physics-Informed Neural Networks (PINNs) enables integration of processes operating at different spatial and temporal scales, creating more comprehensive models for climate assessment. Third, while conventional neural networks may achieve lower errors on specific training distributions, physics-informed models deliver superior noise rejection, physical consistency, and temporal stability—essential properties for reliable climate projections.
We place special emphasis on PIML’s crucial role in climate change mitigation efforts, an aspect often overlooked in individual studies. Our specific contributions include a detailed examination of PIML implementation across environmental domains, a practical workflow for implementing PINNs, and a roadmap addressing challenges in standardization, computational scalability, and multiscale modeling. Through case studies ranging from hydrological forecasting in ungauged regions to accelerated wildfire propagation simulation, we demonstrate how these hybrid approaches can significantly advance our environmental modeling capabilities while requiring less computational resources than traditional methods.

Haley GIpson

Undergraduate Student
Physics

Exploring Empirical Models for the Galaxy–Halo Connection

Abstract

We investigate empirical models of the galaxy–halo connection, which describes the relationship between galaxies and their host dark matter haloes. Dark matter haloes form hierarchically and define the large-scale structure of the universe. Because galaxies form and evolve within these haloes, their properties are expected to be closely correlated with halo properties. Galaxy clustering is therefore a key observational probe of the galaxy–halo connection. While cosmological theory robustly predicts the clustering of dark matter from first principles, no comparably predictive framework exists for galaxy clustering, necessitating additional modeling of how galaxies trace the underlying dark matter distribution.

Empirical models address this challenge by statistically linking observable galaxy properties to dark matter haloes in cosmological simulations. In this work, we focus on extensions to abundance matching, a widely used technique that reproduces galaxy clustering by matching the most massive galaxies to the most massive haloes. Conditional abundance matching extends this framework by incorporating a secondary halo and galaxy property, such as scale factor and galaxy color, that is known to strongly correlate with clustering. We explore the MultiCAM framework, which generalizes this approach by allowing multiple halo properties, including the full mass accretion history of each halo, to jointly inform galaxy–halo assignments. We assess the ability of MultiCAM to provide a more flexible and physically motivated model of galaxy clustering and evaluate its potential for improving constraints on the galaxy–halo connection.

Xiaoyu Qiu

PhD Student
Statistics

Uncertainty Quantification of Plug-and-Play Diffusion Priors for Inverse Problems

Abstract

Plug-and-play diffusion priors (PnPDP) have become a powerful paradigm for solving inverse problems in scientific and engineering domains. Yet, current evaluations of reconstruction quality emphasize point-estimate accuracy metrics on a single sample, which do not reflect the stochastic nature of PnPDP solvers and intrinsic uncertainty of inverse problems, critical for scientific tasks. This creates a fundamental mismatch: in inverse problems, the desired output is typically a posterior distribution and most PnPDP solvers induce a distribution over reconstructions, but existing benchmarks only evaluate a single reconstruction, ignoring distributional characterization such as uncertainty. To address this gap, we conduct a systematic study to benchmark the uncertainty quantification (UQ) of existing diffusion inverse solvers. Specifically, we design a rigorous toy model simulation to evaluate uncertainty behavior of various PnPDP solvers, and propose a UQ-driven categorization. Through extensive experiments on toy simulations and diverse real-world scientific inverse problems, we observe uncertainty behaviors consistent with our taxonomy and theoretical justification, providing new insights for evaluating and understanding the uncertainty for PnPDPs.

Santosh Desai

Master’s Student
LSA

State-Aware Sequential Conformal Prediction for Nonstationary Time Series

Abstract

Conformal prediction provides distribution-free coverage guarantees under exchangeability, but time series violate this through temporal dependence and abrupt regime shifts. Recent sequential methods—Adaptive Conformal Inference (ACI) and Sequential Predictive Conformal Inference (SPCI)—adapt via recency-based weighting, yet fail to leverage latent regime structure inherent in many datasets and degrade when structural breaks render temporal decay unreliable.

We introduce a state-aware sequential conformal framework integrating Hidden Markov Models (HMMs) with weighted empirical quantile estimation. Motivated by exact validity results for HMM-based conformal prediction (Nettasinghe et al., ICML 2023) and state-aware change-point adaptations (arXiv:2509.02844), our method pairs fast point forecasters (ETS/ARIMA, LightGBM on lags) with absolute residual scores $s_t = |y_t – \hat{y}_t|$. Prediction intervals use $C_{t+1}(\alpha) = [\hat{y}_{t+1} \pm q_{t+1}(1-\alpha)]$, where $q_{t+1}$ derives from a weighted empirical CDF with similarity-in-state weights: $w_i(t+1) \propto \sum_k \mathbb{P}(z_{t+1}=k \mid \mathcal{H}_t) \pi_i(k)$, emphasizing past residuals from regimes matching the predicted next state.

To balance model-based HMM structure within model-agnostic conformal frameworks, we incorporate information-theoretic regularization via entropy-based weight tempering and effective sample size (ESS) monitoring for graceful degradation when state inference is unreliable. Under weak dependence, split conformal achieves approximate coverage with bounded error (Oliveira et al., JMLR 2024); we extend this by positing regime-conditional stationarity.

We benchmark on synthetic HMM-switching processes and real datasets (UCI Electricity, ETT, Monash). Results show significantly reduced coverage lag post-break while maintaining tighter intervals during stable periods versus ACI/SPCI. Gains correlate with regime identifiability (low posterior entropy); high-entropy cases gracefully revert to baseline. Computational overhead remains viable for real-time forecasting via rolling-buffer optimizations.

This work demonstrates that state-aware relevance weighting improves the coverage–sharpness tradeoff under regime shifts, bridging HMM-conformal validity, dependent-data bounds, and change-point methods in a computationally efficient, diagnostically transparent framework.

Suvro Mukherjee

Master’s Student
Statistics

A Deep Learning Approach to Analyzing Attendance Patterns in Delhi Schools

Abstract

Air pollution’s effects on bureaucratic performance remain understudied, particularly in high-pollution contexts in the Global South. This project investigates the relationship between PM2.5 concentrations and teacher and student absenteeism in Delhi’s government schools, extending prior work through novel data infrastructure and advanced statistical methods. We developed automated scrapers that continuously collect historical student data and teacher attendance records, with monthly updates ensuring a growing longitudinal dataset. Methodologically, we employ time series clustering via variational autoencoders (VAEs) to uncover latent patterns in attendance behavior across schools and time periods. This unsupervised approach allows identification of heterogeneous responses to pollution exposure that conventional regression methods might obscure, potentially revealing distinct behavioral regimes or school-level typologies. By combining two decades of administrative data with deep learning techniques, this work contributes to literature on climate impacts, urban governance, and public service delivery while demonstrating how generative models can complement causal inference in policy research.

Jiaqi Huang

Other
Statistics

A Diffusion-Engression Framework for Irregular and Informative Longitudinal Data

Abstract

Longitudinal data such as Electronic Health Records (EHRs) are valuable yet privacy‑sensitive, and existing de‑identification methods remain prone to leakage, motivating the need for high‑fidelity synthetic data. However, generating realistic longitudinal records is challenging due to their irregular observation times and informative measurement patterns. We propose the Diffusion‑Engression Framework (DEF), a novel generative model that defines a finite‑step forward process from the data distribution to a known distribution and learns a reverse trajectory via sequential engression models. This approach enables exact sampling with a fixed, small number of steps—overcoming the long sampling chains of typical diffusion models—while providing theoretical guarantees through an error upper bound in energy distance between the learned and true distributions. Experiments on both simulated and real data demonstrate that DEF effectively captures complex temporal dependencies and preserves statistical fidelity, offering a provably sound and practical solution for privacy‑aware synthetic data generation.

Maksim Sviridov

Master’s Student
CSE

Learning Composable NBA Player Embeddings from Possession-Level Data

Abstract

In modern sports analytics, NBA player evaluation often relies on box score statistics, which can vary widely depending on role, lineup, and game context. This makes it difficult to separate individual player impact from surrounding factors. In this work, I explore a latent variable approach for learning NBA player embeddings from possession-level data, where player representations are learned only through their participation in multi-player lineups. Each possession is modeled using the offensive and defensive players on the court, along with contextual information such as shooter, assister, primary defender, and shot location. Player embeddings are constrained to compose additively into lineup-level representations, encouraging interpretability and enabling direct analysis of players, lineups, and substitutions. I evaluate the learned embeddings based on their ability to capture distinct offensive and defensive roles and their sensitivity to lineup changes. This framework provides a foundation for composable player evaluation using real NBA data.

Colin

Undergraduate Student
Physics

3 Satellite Analysis of the Solar Wind

Abstract

Understanding changes in the magnetic field in the solar wind, known as discontinuities, is crucial to predicting how space weather events may affect life on Earth. Two types of discontinuities are widely recognized: tangential and rotational. The classification of discontinuities into these categories is dependent on the normal to the plane on which the discontinuity propagates, among other factors. Different methods for estimating the normal vectors of discontinuities have been proposed. This presentation examines the use of three spacecraft to estimate normals. Detection of discontinuities by 3 spacecraft in the solar wind allows for estimations of their orientation. The Tsurutani-Smith method was used to identify over 1500 discontinuities per spacecraft in the solar wind during periods when Wind, ACE and DSCOVR were less than 70 RE apart. Several statistical methods, based on the work of Malispina and Gosling, were then used to down-select events to include only those observed by all three spacecraft. Using this method, ‘36’ events were down-selected. Discontinuity normals were estimated using exact spacecraft positions, solar wind velocity, and relative timing of the events. Minimum Variance Analysis was also performed at the boundary of each event for each of the 3 spacecraft and compared to the timing-derived method. Disagreement between MVA and timing-derived normals is found. This discrepancy affects ratios of events classified as rotational discontinuities to events classified as tangential discontinuities, thus highlighting a potential misclassification of events under MVA-reliant methods.

Raghunath Ranga Sudharshan

Other
Computational Medicine and Bioinformatics

Quantum Hamiltonian Framework for Spatial Cell Interaction Inference in Colorectal Cancer Multiplex Imaging Data

Abstract

Understanding cellular communication networks in the tumor microenvironment is essential for decoding cancer progression dynamics. We present a quantum-inspired framework for inferring cell-cell interaction networks from multiplex immunofluorescence (mIF) imaging data of colorectal cancer tissues.
Our approach transforms mIF images into quantitative spatial data by partitioning tissue sections into patches and representing each cell type’s spatial distribution through kernel density estimation (KDE). This yields cell type density profiles capturing local microenvironment composition. We establish pseudo-temporal ordering using tumor progression quantiles, enabling dynamic network reconstruction across disease stages without requiring longitudinal measurements.
The method employs a 14-qubit parameterized quantum circuit with Hamiltonian evolution to model regulatory networks, where qubit interactions represent putative cell-cell communication pathways. Network weights are inferred by optimizing a negative log-likelihood objective using Positive Operator-Valued Measure (POVM) predictions against observed spatial configurations across 50 pseudo-time points and 2,499 samples. The framework leverages JAX for hardware-accelerated automatic differentiation and PennyLane for quantum circuit simulation, achieving scalable inference on high-dimensional spatial data.
Optimization converges within 3,000 steps using adaptive learning rates and stochastic mini-batch sampling strategies. The learned adjacency matrices reveal directional interaction strengths between cell types, potentially uncovering immune-tumor communication axes and stromal remodeling patterns along tumor progression trajectories.
This quantum-inspired probabilistic framework offers a principled approach for systems-level analysis of spatial cellular ecosystems in cancer. The methodology generalizes to other multiplexed imaging platforms and spatial transcriptomics technologies, providing new insights into tissue organization and cell communication networks driving disease progression.