Presentations – MSSISS 2025

Presentations

2025 Presentations


Order of presentation may vary, alphabetical order shown.

Session I

March 27th, 9:15am – 12:00pm
East Conference Room

Moderated by
Alexander Kagan
PhD Student, Department of Statistics


Tom Liu

PhD Student
Biostatistics

Test-negative designs with various reasons for testing: statistical bias and solution

Abstract

Test-negative designs are widely used for post-market evaluation of vaccine effectiveness, particularly in cases when randomized trials are not feasible. Differing from classical test-negative designs where only healthcare-seekers with symptoms are included, recent test-negative designs have involved individuals with various reasons for testing, especially in an outbreak setting. While including these data can increase sample size and hence improve precision, concerns have been raised about whether they introduce bias into the current framework of test-negative designs, thereby demanding a formal statistical examination of this modified design. In this article, using statistical derivations, causal graphs, and numerical demonstrations, we show that the standard odds ratio estimator may be biased if various reasons for testing are not accounted for. To eliminate this bias, we identify three categories of reasons for testing, including symptoms, mandatory screening, and case contact tracing, and characterize associated statistical properties and estimands. Based on our characterization, we show how to consistently estimate each estimand via stratification. Furthermore, we describe when these estimands correspond to the same vaccine effectiveness parameter, and, when appropriate, propose a stratified estimator that can incorporate multiple reasons for testing and improve precision. Lastly, we applied the method to data from Michigan Medicine, which demonstrated differences due to reasons of testing. The performance of our proposed method is demonstrated through simulation studies and real world data application.


Gabriel Patron

PhD Student
Statistics

Recommendations beyond catalogs: diffusion models for personalized generation

Abstract

Modern recommender systems follow the guiding principle of serving the right user, the right item at the right time. One of their main limitations is that they are typically limited to items already in the catalog. We propose Recommendations Beyond Catalogs, REBECA, a new class of probabilistic diffusion-based recommender systems that synthesize rather than retrieve items from the catalog, creating new items tailored to individual tastes from scratch. REBECA combines efficient training in embedding space with a novel diffusion prior that only requires users’ past ratings of items. We evaluate REBECA using real data and established metrics for image generation, while also introducing new metrics that measure the degree of personalization in a generative recommender system.


Jaylin Lowe

PhD Student
Statistics

Power Calculations for Randomized Controlled Trials with Auxiliary Observational Data

Abstract

Recent methods have sought to improve precision in randomized controlled trials (RCTs) by utilizing data from large observational datasets for covariate adjustment. For example, consider an RCT aimed at evaluating a new algebra curriculum, in which a few dozen schools are randomly assigned to treatment (new curriculum) or control (standard curriculum), and are evaluated according to subsequent scores on a state standardized test. Suppose that in addition to the RCT data, standardized test scores are also publicly available for all other schools in the state. Although not part of the RCT, these observational test scores could be used to increase precision in the RCT. Specifically, an outcome prediction model can be trained on the auxiliary data and the resulting predictions can be used as an additional covariate. With these methods, the desired power is often achieved with a smaller RCT. The necessary sample size depends on how well a model trained on the observational data generalizes to the RCT, which is typically unknown. We discuss methods for obtaining a range of reasonable sample sizes for designing such an RCT and demonstrate its use on an example from education research. The range is created by dividing the observational data into subgroups, and calculating the necessary sample size if the RCT sample were to resemble each subgroup. These subgroups can be defined by covariate values or by how well the observational data is expected to help. In this way, we are able to generate a range of plausible sample sizes. Computational efficiency is a potential concern for our computation of auxiliary predictions, and we show how this issue can be addressed more efficiently without significantly affecting the results.


Natasha Boreyko

PhD Student
Finance

Do Bankers Provide Social Value? Evidence from Municipal Bonds

Abstract

This paper explores the value added by individual bankers in the bond underwriting municipal market. I examine whether the expertise and relationships that these bankers develop are easily substitutable or whether they provide unique advantages for underwriters, bond issuers, and investors. By focusing on career moves and job changes of individual bankers, I analyze how banker switches affect underwriting outcomes (e.g., bond yields, refinancing success, and credit downgrades). A key identification strategy uses the quasi-exogenous shock from the 2021 Texas underwriter ban, which barred five of the largest underwriters from issuing municipal bonds in the state, resulting in widespread banker movements from the affected underwriting firms.


Nolan Feeny

PhD Student
Finance

A Spatial Analysis of Mass Shootings in the Continental US

Abstract

Over 5,000 mass shootings* occurred in the continental United States over the past decade. Extant research has studied mass shootings at a national scale using criminology, social, and strain theories at state- or city-level granularity. Most of these theories have provided qualitative analyses with the goal of understanding factors that are associated with gun violence to potentially aid decision makers that could update policies regarding the issue. This project develops a quantitative statistical analysis using a multi-stage spatial approach, integrating data from various sources and scales. Demographic data is included at both county- and census tract- levels, while state-level gun law data is also incorporated in order to account for multiple factors that could impact the amount of gun violence. The goal of this project is to infer greater understanding of the relationship between demographic, policy-based, and spatial covariates and the counts of mass shootings that occur. We also aim to improve existing predictive methods and provide results that can serve as a decision-making aid for policymakers regarding gun violence. This expands work done on smaller-scale case studies of cities and neighborhoods, developing a more comprehensive framework that is applied across the entire continental United States.

*Using the Gun Violence Archive’s definition of a mass shooting, as 4+ victims injured or killed


Neil Anand Mankodi

Master
LSA

Machine Learning Models To Guide Preventive Interventions For Type 2 Diabetes In India

Abstract

Introduction and Objective: The rising global prevalence of type 2 diabetes (T2DM) is linked to obesity. However, in India, fewer than half of adults with T2DM are obese. We used machine learning (ML) to identify novel risk factors and targets for prevention in an Indian population.

Methods: We used multiple ML models (i.e. Logistic Regression, Decision Tree, XGBoost etc.) to study T2DM using data on 7067 participants and 997 variables from the cross-section Indian Migration Study, conducted in 4 regions of India from 2005-2007. T2DM was defined per ADA. The classification metrics (F1 and AUC) were chosen as primary optimization metrics due to the imbalanced nature of the diabetes classes. Key risk factors were identified through coefficient scores and feature importance analyses, and their effects were visualized using SHAP dependency plots.

Results: The most accurate model achieved an F1 score of 0.60 and an AUC of 0.885 for T2DM. Prior to the addition of genetic (SNP) data, F1 score was 0.48. Top features associated with T2DM were: regular medication use, prior tuberculosis, greater intake of glutamic acid, arginine and leucine, and lower intake of carbohydrates and protein. Having the rs10811661_TT SNP was also associated with T2DM. Thyroid disease and living in eastern India were inversely associated with T2DM.

Conclusion: High-dimensional ML models trained on epidemiological data may help guide preventive interventions in India and identify individuals at high risk for T2DM.


Htay-Wah Saw

PhD student
Survey and Data Science

Measuring Air Quality with Wearable Devices

Abstract

Publicly available pollution data are mostly regional-level data such as those collected by EPA’s weather stations. Such data are likely to miss substantial differences in individual exposures to pollution, both inside the home, at work, or elsewhere. To address this lack of granularity, we have asked some 900 respondents (balanced across education, race & ethnicity, household income) to the Understanding America Study (UAS) to wear an air quality monitor (Atmotube) (https://atmotube.com/atmotube-pro) continuously for at least one year. The air quality monitor collects pollution and weather data at 1-minute intervals and is Bluetooth enabled so that it communicates with a smartphone app.

In addition, we have conducted monthly surveys of the respondents’ home characteristics (heating and cooling types; cooking stoves, proximity of busy roads) and of their whereabouts in 30-minute episodes during the previous 24 hours (home, work, motor vehicle, other). More recently, we have also asked for consent to link the GPS coordinates recorded by the Atmotube to their survey and air quality data.

By merging in modelled pollution data at the 1km2 level, we are able to disentangle the effects of local air quality and micro-climates such as inside one’s home, at work, or when traveling with a motor vehicle. At MSSISS 2025, we provide descriptive results of how air quality varies by respondents’ location, socio-economic, and housing characteristics. Furthermore, to gain insight into individual exposure to air quality, we will decompose individual pollution exposure into its various components: regional air quality and variation by individuals’ location during the day. We have rich background information on our respondents, including their health and cognitive outcomes, which allows us to analyze how exposure to pollution is associated with these substantive outcomes.

Session II

March 27th, 9:15am – 12:00pm
West Conference Room

Moderated by
Sergio Martinez
PhD Student, Survey and Data Science


Yueying Hu

PhD Student
Biostatistics

The Impact of Census-Tract Level Mortgage Discrimination on Cognitive Function: Accounting for Measurement Error in Small-Area Data via Joint Modeling

Abstract

Racial disparities in cognitive health reflect entrenched structural inequalities. This study investigates the association between census-tract level mortgage discrimination, operationalized as the Mortgage Density Index Ratio (MDIR), and cognitive outcomes among racially diverse older adults. Using data from the Michigan Cognitive Aging Project (MCAP), a cohort of 644 participants was analyzed across six cognitive domains, taking into account individual demographics and neighborhood characteristics.
Hypersegregation, driven in part by historical redlining and contemporary racial discrimination in housing and lending, introduces instability in ratio indices like MDIR, particularly in census tracts with extreme racial imbalances. To address this, we employed a joint modeling approach that simultaneously estimates cognitive outcomes and latent mortgage rates for Black and White households, effectively mitigating measurement error. This method identified a significant association between MDIR and processing speed only among Non-Hispanic Black participants, with a one-unit MDIR increase corresponding to a 0.47 SD improvement in processing speed (95% CI: 0.04-0.92). That is, a more equitable mortgage lending environment is associated with faster cognitive processing. Traditional regression methods, in contrast, failed to detect such effects.
Simulations further demonstrated the advantages of joint modeling in managing measurement error, showing notably lower bias and greater robustness in small- to moderate- sized census tracts compared to traditional regression approaches. These findings underscore the importance of advanced statistical methods in quantifying structural racism and highlight the disproportionate effects of mortgages discrimination on cognitive outcomes among Black adults.


Lingxuan Kong

PhD Student
Biostatistics

Adaptive Risk-Weighted Learning for Optimizing Kidney Transplant Decisions Under Resource Constraints

Abstract

Dynamic decision rules dictate individualized treatments based on evolving treatment and covariate history, optimizing clinical decision-making and forming a crucial part of personalized medicine. However, decision rules for maintained treatments are still underdeveloped. Organ transplantation, a maintained treatment with a significant impact on patient survival, lacks distinct and applicable decision rules at both the individual and population levels. The shortage of donor organs complicates the decision-making process, necessitating efficient stage-wise allocation amidst competition for donors among the at-risk population and balancing future treatment opportunities with currently available sub-optimal treatments. In this paper, we present Adaptive Risk-Weighted Learning (ARWL) as a novel stage-wise decision-making platform under resource constraints. ARWL constructs decision trees by imputing counterfactual outcomes of possible treatment combinations at the end of the study using inverse probability-weighted estimators. Pseudo-labels of treatment assignment are then generated to maximize the desired population outcomes. The algorithm is implemented in a one-step simultaneous fitting manner to account for correlations among decision rules at each stage, thereby avoiding the accumulation of fitting errors. ARWL effectively addresses the issue of death censoring at each stage caused by improper previous decisions. Overall, ARWL is efficient in allocating limited resources, robust to model misspecifications and variations in treatment availability, easy to interpret and apply, and adaptable to multiple scenarios. We apply ARWL to develop prioritizing rules for end-stage renal disease patients on the deceased donor kidney transplantation waitlist. This approach helps physicians identify urgent or suitable candidates for receiving a kidney transplant.


Mengqi Lin

PhD Student
Statistics

Controlling the False Discovery Proportion in Observational Studies with Hidden Bias

Abstract

We propose an approach to exploratory data analysis in matched observational studies. We consider the setting where a single intervention is thought to potentially impact multiple outcome variables, and the researcher would like to investigate which of these causal hypotheses come to bear while accounting not only for the possibility of false discoveries, but also the possibility that the study is plagued by unmeasured confounding. For any candidate set of rejected hypotheses, our method provides sensitivity intervals for the false discovery proportion (FDP), the proportion of rejected hypotheses that are actually true. For a set $\cR$ containing $|\cR|$ outcomes, the method describes how much unmeasured confounding would need to exist for us to believe that the proportion of true hypotheses is $0/|\cR|$,$1/|\cR|$,…, all the way to $|\cR|/|\cR|$. Moreover, the resulting confidence statements intervals are valid simultaneously over all possible choices for the rejected set, allowing the researcher to look in an ad hoc manner for promising subsets of outcomes that maintain a large estimated fraction of correct discoveries even if a large degree of unmeasured confounding is present. The approach is particularly well suited to sensitivity analysis, as conclusions that some fraction of outcomes were affected by the treatment exhibit larger robustness to unmeasured confounding than the conclusion that any particular outcome was affected. In principle, the method requires solving a series of quadratically constrained integer programs. That said, we show not only that a solution can be obtained in reasonable run time, but also that one can avoid running the integer program altogether with high probability in large samples. We illustrate the practical utility of the method through simulation studies and a data example.


Marianthie Wank

PhD Student
Biotatistics

Bayesian Estimation of Dynamic Treatment Regimes from a Partially Randomized, Patient Preference, Sequential, Multiple Assignment, Randomized Trial

Abstract

As healthcare shifts towards patient-centered care, incorporating patient treatment preferences in clinical trials has become increasingly relevant. The Partially Randomized, Patient Preference, Sequential Multiple Assignment Randomized Trial (PRPP-SMART) combines a Partially Randomized Patient Preference (PRPP) trial with a Sequential, Multiple Assignment, Randomized Trial (SMART), allowing participants to either receive their preferred treatment or be randomized when no treatment preference exists, at multiple points in the trial. In this paper, we introduce a novel Bayesian method to estimate dynamic treatment regimes (DTRs), or tailored treatment guidelines over the course of care, embedded in PRPP-SMARTs. Our Bayesian Joint Stage Model (BJSM) leverages information sharing between preference and randomized participants and across stages of the trial to estimate DTR effects. We compare our BJSM method to weighted and replicated regression models (WRRM), the current standard for analyzing PRPP-SMART data, and show that our method provides more efficient DTR effect estimates with negligible bias. Our results indicate that BJSM is a promising alternative for analyzing PRPP-SMART data.


Yue Yu

PhD Student
Statistics

The Root Finding Problem Revisited: Beyond the Robbins-Monro procedure

Abstract

We introduce Sequential Probability Ratio Bisection(SPRB), a novel stochastic approximation algorithm that adapts to the local behavior of the (regression) function of interest around its root. We establish theoretical guarantees for SPRB’s asymptotic performance, showing that it achieves the optimal convergence rate and minimal asymptotic variance even when the target function’s derivative at the root is small (at most half the step size), a regime where the classical Robbins-Monro procedure typically suffers reduced convergence rates. Further, we show that if the regression function is discontinuous at the root, Robbins-Monro converges at a rate of 1/n whilst SPRB attains exponential convergence. As part of our analysis, we derive a non-asymptotic bound on the expected sample size and establish a generalized Central Limit Theorem under random stopping times. Remarkably, SPRB automatically provides confidence intervals that do not explicitly require knowledge of the convergence rate. We demonstrate the practical effectiveness of SPRB through simulation results.


Xiaoyu Qiu

PhD Student
Statistics

Clarifying the role of the Mantel-Haenszel Risk Difference Estimator in Randomized Clinical Trials

Abstract

The Mantel-Haenszel (MH) risk difference estimator is widely used for binary outcomes in randomized clinical trials. This estimator computes a weighted average of stratum-specific risk differences and traditionally requires the stringent assumption of homogeneous risk difference across strata. In our study, we relax this assumption and demonstrate that the MH risk difference estimator consistently estimates the average treatment effect. Furthermore, we rigorously study its properties under two asymptotic frameworks: one characterized by a small number of large strata and the other by a large number of small strata. Additionally, a unified robust variance estimator that improves over the popular Greenland’s and Sato’s variance estimators is proposed, and we prove that it is applicable across both asymptotic scenarios. Our findings are thoroughly validated through simulations and real data applications.


Peiyao Cai

PhD Student
Statistics

Estimation and Inference for the Joint Autoregressive Quantile-Expected Shortfall Model

Abstract

Expected shortfall is defined as the truncated mean of a random variable that falls below a specified quantile level.
This statistic is widely recognized as an important risk measure.
Motivated by the empirical observation of clustering patterns in financial risks, we consider a joint autoregressive model for both conditional quantile and expected shortfall in this manuscript.
Existing estimation methods for such models typically rely on minimizing a nonlinear and nonconvex joint loss function, which is challenging to solve and often yields inefficient estimators.
We employ a weighted two-step estimation approach to estimate the proposed models.
Our proposed estimator has greater efficiency compared to those obtained by existing methods both theoretically and numerically, for a general class of location-scale family time series.
Our empirical results on stock market data indicate that the proposed models effectively capture the clustering patterns and leverage effects on conditional expected shortfall.


Tara Radvand

PhD Student
Business

Who Wrote This? Zero-Shot Statistical Tests for LLM-Generated Text Detection using Finite Sample Concentration Inequalities

Abstract

Verifying the provenance of content is crucial to the function of many organizations, e.g., educational institutions, social media platforms, firms, etc. This problem is becoming increasingly difficult as text generated by Large Language Models (LLMs) becomes almost indistinguishable from human-generated content. In addition, many institutions utilize in-house LLMs and want to ensure that external, non-sanctioned LLMs do not produce content within the institution. In this paper, we answer the following question: Given a piece of text, can we identify whether it was produced by LLM A or B (where B can be a human)? We model LLM-generated text as a sequential stochastic process with complete dependence on history and design
zero-shot statistical tests to distinguish between (i) the text generated by two different sets of LLMs, A (in-house) and B (non-sanctioned), and also (ii) LLM-generated and human-generated texts.
We prove that the type I and type II errors for our tests decrease exponentially in the text length. In designing our tests, we derive concentration inequalities on the difference between log-perplexity and the average entropy of the string under A. Specifically, for a given string, we demonstrate that if the string is generated by A, the log-perplexity of the string under A converges to its average entropy under A, except with an exponentially small probability in string length. We also show that if B generates the text, then, except with an exponentially small probability in string length, the log-perplexity of the string under A converges to the average cross-entropy of B and A. Lastly, we present preliminary experimental results using open-source LLMs to support our theoretical results. Practically, our work enables guaranteed finding of the origin of harmful LLM-generated text for a text of arbitrary size, which can be useful for combating misinformation as well as compliance with emerging AI regulations.

Session III

March 27th, 1:00pm – 3:00pm
Amphitheater

Moderated by
Ziyu Liu
PhD Student, Biostatistics


Tim White

PhD Student
Statistics

Neural posterior estimation for inferring weak lensing shear and convergence from pixels

Abstract

Inferring the coherent distortion of imaged galaxies due to weak gravitational lensing is a challenging inverse problem involving pixelization, instrument bias, and a low signal-to-noise ratio. Most traditional approaches to this task produce point estimates of weak lensing shear and convergence by measuring, averaging, and calibrating galaxy ellipticities under an assumed parametric model of intrinsic morphologies, a procedure that is subject to image noise, selection bias, and model misspecification. We propose an alternative, Bayesian approach to weak lensing inference that aims to jointly estimate shear and convergence maps from multiband images using a type of amortized variational inference called neural posterior estimation (NPE). NPE is a likelihood-free method that is well suited for estimating shear and convergence because it implicitly marginalizes out nuisance latent variables that might render other posterior inference techniques intractable. It is computationally efficient due to its utilization of deep learning, and it provides estimates of posterior uncertainty that can be propagated to downstream cosmological analyses. When evaluated using synthetic images from the LSST-DESC DC2 Simulated Sky Survey, the proposed algorithm produces posterior shear and convergence maps that are consistent with the ground truth and more accurate than a baseline estimator based on weighted averages of galaxy ellipticities. These results demonstrate that NPE can be a viable alternative to existing weak lensing inference procedures, though efforts to characterize potential model misspecification and develop a more flexible family of variational distributions are necessary before it can be applied to Stage IV astronomical surveys.


Gabriel Durham

PhD Student
Statistics

Precision in Real Time: Leveraging Generative Models in Online Learning for Adaptive Interventions

Abstract

Intervention designers across disciplines use online reinforcement learning (RL) agents to personalize interventions by selecting actions and observing user responses/outcomes. The feedback generated from this action/response cycle informs more targeted, individualized interventions, whether in the form of mobile health nudges, marketing ads, or other personalized strategies. Recently, researchers have explored use of generative models to augment such interventions by incorporating detailed contextual information regarding the targeted user. However, integrating generative models muddies the action/outcome feedback loop, as the generative model output is a noisy representation of the query that solicited it. In this presentation, we present \textit{Noisy Action Thompson Sampling (NATS)}, a novel framework, for addressing such concerns. NATS is a two-stage modification of \quotes{standard} Thompson sampling, and is inspired by classical instrumental variable-based approaches from causal inference. Our empirical results show that this new approach can outperform standard Thompson sampling by better leveraging the causal mechanics of the intervention. We also establish theoretical guarantees for this new approach, and consider incorporation of LLM-generated text into personalized Just-in-Time Adaptive Interventions (pJITAIs) as an illustrative example.


Mengqi Lin

PhD Student
Statistics

Identifiability of Boolean graphical models

Abstract

Boolean graphical models—including prominent subfamilies such as cognitive diagnosis models and Boolean matrix decompositions—find broad applications ranging from social sciences to engineering. Despite their flexibility, a key challenge lies in establishing the identifiability of their graphical structures, which determine how latent variables affect observed data. Existing work often relies on the strong assumption of pure nodes—observed variables that depend directly on only one latent variable. While mathematically convenient, these assumptions may be unrealistic in many real-world settings. To address this, we develop a novel graphical approach using Hasse diagrams, which transforms the identifiability problem into a graph isomorphism challenge. Building on this perspective, we propose sufficient and necessary conditions for identifiability that do not require pure nodes; rather, the graphical structure is identifiable precisely when the corresponding graphical representation is unique.


Daniele Bracale

PhD
Statistics

Optimal Non-Linear Online Learning under Sequential Price Competition

Abstract

We consider price competition among multiple sellers over a selling horizon of $T$ periods. In each period, sellers simultaneously offer their prices and subsequently observe their respective demand that is unobservable to competitors. The realized demand of each seller depends on the prices of all sellers following a private unknown non-linear model. We propose a semi-parametric least-squares estimation, which does not require sellers to communicate demand information. We show that our policy, when employed by all sellers, leads at a rate $O(T^{-2/7})$ to the Nash equilibrium prices that sellers would reach if they were fully informed. Meanwhile, each seller achieves an order of $T^{5/7}$ regret relative to a dynamic benchmark policy.


Felipe Maia Polo

PhD Student
Statistics

Sloth: Scaling laws for LLM skills to predict multi-benchmark performance across families

Abstract

Most popular benchmarks for comparing LLMs rely on a limited set of prompt templates, which may not fully capture the LLMs’ abilities and can affect the reproducibility of results on leaderboards. Many recent works empirically verify prompt sensitivity and advocate for changes in LLM evaluation. In this paper, we consider the problem of estimating the performance distribution across many prompt variants instead of finding a single prompt to evaluate with. We introduce PromptEval, a method for estimating performance across a large set of prompts borrowing strength across prompts and examples to produce accurate estimates under practical evaluation budgets. The resulting distribution can be used to obtain performance quantiles to construct various robust performance metrics (e.g., top 95% quantile or median). We prove that PromptEval consistently estimates the performance distribution and demonstrate its efficacy empirically on three prominent LLM benchmarks: MMLU, BIG-bench Hard, and LMentry. For example, PromptEval can accurately estimate performance quantiles across 100 prompt templates on MMLU with a budget equivalent to two single-prompt evaluations. Our code and data can be found at https://github.com/felipemaiapolo/prompt-eval.


Xiaoyang Song

PhD Student
Industrial & Operations Engineering

SEE-OOD: Supervised Exploration for Enhanced Out-of-Distribution Detection

Abstract

Current techniques for Out-of-Distribution (OoD) detection predominantly rely on quantifying predictive uncertainty and incorporating model regularization during the training phase, using either real or synthetic OoD samples. However, methods that utilize real OoD samples lack exploration and are prone to overfit the OoD samples at hand. Whereas synthetic samples are often generated based on features extracted from training data, rendering them less effective when the training and OoD data are highly overlapped in the feature space. In this work, we propose a Wasserstein-score-based generative adversarial training scheme to enhance OoD detection accuracy, which, for the first time, performs data augmentation and exploration simultaneously under the supervision of limited OoD samples. Specifically, the generator explores OoD spaces and generates synthetic OoD samples using feedback from the discriminator, while the discriminator exploits both the observed and synthesized samples for OoD detection using a predefined Wasserstein score. We provide theoretical guarantees that the optimal solutions of our generative scheme are statistically achievable through adversarial training in empirical settings. We demonstrate that the proposed method outperforms state-of-the-art techniques on various computer vision datasets and exhibits superior generalizability to unseen OoD data. Furthermore, we showcase its practical effectiveness in detecting new surface defects using a real-world 3D point cloud dataset of manufacturing defects.

Session IV

March 28th, 9:00am – 12:00pm
East Conference Room

Moderated by
Thejasvi Dhanireddy
PhD Student, Biostatistics


Leyuan Qian

PhD Student
Biostatistics

Smooth Tensor Decomposition for Ambulatory Blood Pressure Monitoring Data

Abstract

Ambulatory blood pressure monitoring (ABPM) is widely used to track blood pressure and heart rate over periods of 24 hours or more. Most existing studies rely on basic summary statistics of ABPM data, such as means or medians, which obscure temporal features like nocturnal dipping and individual chronotypes. To better characterize the temporal features of ABPM data, we propose a novel smooth tensor decomposition method. Built upon traditional low-rank tensor factorization techniques, our method incorporates a smoothing penalty to handle noise and employs an iterative algorithm to impute missing data. We also develop an automatic approach for the selection of optimal smoothing parameters and ranks. We apply our method to ABPM data from patients with concurrent obstructive sleep apnea and type II diabetes. Our method explains temporal components of data variation and outperforms the traditional approach of using summary statistics in capturing the associations between covariates and ABPM measurements. Notably, it distinguishes covariates that influence the overall levels of blood pressure and heart rate from those that affect the contrast between the two.


Kevin Christian

PhD Student
Statistics

Causal Inference with Text-Based Treatments and Outcomes

Abstract

Causal inference traditionally deals with categorical or scalar variables, but many fields (e.g. epidemiology) now face causal questions involving text data—from treatment variables like nurse applicant biographies to outcome variables like patient survey responses. Text-based causal inference poses unique challenges. The average treatment effect (ATE) query fails, as we cannot meaningfully subtract texts or satisfy the overlap assumption with high-dimensional text treatments. We propose new causal queries for causal inference with text-based treatments or outcomes. The key idea
is to identify text features that maximize contrast between treatment (or outcome) values. For text-based outcomes, we measure the most salient differences between treated and control groups using a coding function that converts text to scalars. For text-based treatments, we identify treatment features with the strongest effect on outcomes using a specialized text classifier. We develop semiparametrically efficient estimators for the queries and their variants that models heterogeneity. Finally, empirical studies demonstrate the contrast-maximizing queries effectively address text-based causal questions.


Hanbin Lee

PhD Student
Statistics

tslmm: complex trait modelling with ancestral recombination graphs

Abstract

Exemplified by the succinct tree sequence implemented in tskit library, ancestral recombination graph (ARG) is a powerful tool for storing and analyzing enormous genomics datasets.
Tree sequences have become a standard tool in evolutionary genetics, and new applications are currently being explored to overcome analytic challenges posed by the rapid expansion of the data.
To this end, we developed a novel complex trait model, ARG-LMM, to describe generations of complex traits on an ARG and implemented the model in tslmm library.
ARG-LMM is based on evolutionary genetic principles, providing a simple but consistent description of how complex traits are distributed in a sample of individuals.
The model subsumes existing quantitative genetics models by considering all past DNA coalescence, mutation, and recombination events encoded in an ARG.
This model formulation provides a framework to study the effect of evolutionary forces on genomes and corresponding phenotypes.
tslmm can work with ARGs encoding whole genomes of hundreds of thousands of individuals through efficient graph traversal algorithms with an O(N) time complexity, a step change over existing algorithms that require at least O(N^2) operations.
In summary, this work presents a new quantitative genetic model (ARG-LMM) based on evolutionary theory and a new software (tslmm) for scalable analysis of complex traits with large genomic datasets.


Sahana Rayan

PhD Student
Statistics

Learning to Partially Defer for Sequences

Abstract

In the Learning to Defer (L2D) framework, a prediction model can either make a prediction or defer it to an expert, as determined by a rejector. Current L2D methods train the rejector to decide whether to reject the entire prediction, which is not desirable when the model predicts long sequences. We present an L2D setting for sequence outputs where the system can defer specific outputs of the whole model prediction to an expert in an effort to interleave the expert and machine throughout the prediction. We propose two types of model-based post-hoc rejectors for pre-trained predictors: a token-level rejector, which defers specific token predictions to experts with next token prediction capabilities, and a one-time rejector for experts without such abilities, which defers the remaining sequence from a specific point onward. In the experiments, we also empirically demonstrate that such granular deferrals achieve better cost-accuracy tradeoffs than whole deferrals on Traveling salesman solvers and News summarization models.


Yilei Zhang

PhD Student
Statistics

A Bayesian Method for Learning Mixture Models of Non-parametric Components

Abstract

Mixture models are widely used in modeling heterogeneous subpopulations in data. Mixture models of parametric components (e.g., Gaussian mixture models) have been thoroughly studied on both statistical and algorithmic fronts. However, in the face of increasing complexity of large scale data, parametric assumptions such as Gaussianity are often unrealistic, and very little literature has been found on learning the mixture models of non-parametric components. In an effort to fill this gap, we first address the identifiability issue in mixture models of non-parametric components. Building on this, we establish a framework using mixture of Dirichlet processes to learn such models, and develop an efficient MCMC algorithm to implement our method. Our method can learn each component density without resorting to solving the mixing measure, thus provides a sample-efficient framework for learning subpopulation properties from data. We also show the posterior contraction rate of our component density estimator is of an almost polynomial order, which is a significant improvement from the logarithm convergence rate of solving mixing measures. This substantiates the sample-efficiency and applicability of our method in learning non-parametric component densities.


Shihao Wu

PhD Student
Statistics

Denoising Diffused Embeddings: a Generative Approach for Hypergraphs

Abstract

Hypergraph data, which capture multi-way interactions among entities, are becoming increasingly prevalent in the big data eta. Generating new hyperlinks from an observed, usually high-dimensional hypergraph is an important yet challenging task with diverse applications, such as electronic health record analysis and biological research. This task is fraught with several challenges. The discrete nature of hyperlinks renders many existing generative models inapplicable. Additionally, powerful machine learning-based generative models often operate as black boxes, providing limited interpretability.
Key structural characteristics of hypergraphs, including node degree heterogeneity and hyperlink sparsity, further complicate the modeling process and must be carefully addressed. To tackle these challenges, we propose Denoising Diffused Embeddings (DDE), a general generative model architecture for hypergraphs. DDE exploits potential low-rank structures in high-dimensional hypergraphs and adopts the state-of-the-art diffusion model framework. Theoretically, we show that when true embeddings are accessible, DDE exactly reduces the task of generating new high-dimensional hyperlinks to generating new low-dimensional embeddings. Moreover, we analyze the implications of using estimated embeddings in DDE, revealing how hypergraph properties–such as dimensionality, node degree heterogeneity, and hyperlink sparsity–impact its generative performance. Simulation studies demonstrate the superiority of DDE over existing methods, in terms of both computational efficiency and generative accuracy. Furthermore, an application to a symptom co-occurrence hypergraph derived from electronic medical records uncovers interesting findings and highlights the advantages of DDE.


Sayan Chakrabarty

PhD Student
Statistics

Network Bootstrap Using Overlapping Partitions

Abstract

Bootstrapping network data efficiently is a challenging task. The existing methods tend to make strong assumptions on both the network structure and the statistics being bootstrapped, and are computationally costly. This paper introduces a general algorithm, SSBoot, for network bootstrap that partitions the network into multiple overlapping subnetworks and then aggregates results from bootstrapping these subnetworks to generate a bootstrap sample of the network statistic of interest. This approach tends to be much faster than competing methods as most of the computations are done on smaller subnetworks. We show that SSBoot is consistent in distribution for a large class of network statistics under minimal assumptions on the network structure, and demonstrate with extensive numerical examples that the bootstrap confidence intervals produced by SSBoot attain good coverage without substantially increasing interval lengths in a fraction of the time needed for running competing methods.


Gabriel Ponte

PhD Student
Industrial & Operations Engineering

Good and Fast Row-Sparse ah-Symmetric Reflexive Generalized Inverses

Abstract

We present several algorithms aimed at constructing sparse and structured sparse (row-sparse) generalized inverses, with application to the efficient computation of least-squares solutions, for inconsistent systems of linear equations, in the setting of multiple right-hand sides and a rank-deficient constraint matrix. Leveraging our earlier formulations to minimize the 1- and 2,1- norms of generalized inverses that satisfy important properties of the Moore-Penrose pseudoinverse, we develop efficient and scalable ADMM algorithms to address these norm-minimization problems and to limit the number of nonzero rows in the solution. We establish a 2,1-norm approximation result for a local-search procedure that was originally designed for 1-norm minimization, and we compare the ADMM algorithms with the local-search procedure and with general-purpose optimization solvers.


Session V

March 28th, 9:00am – 12:00pm
West Conference Room

Moderated by
Yuheng Huang
PhD Student, IOE


Unique Subedi

PhD Student
Statistics

On the Benefits of Active Data Collection in Operator Learning

Abstract

We study active data collection strategies for operator learning when the target operator is linear and the input functions are drawn from a mean-zero stochastic process with continuous covariance kernels. With an active data collection strategy, we establish an error convergence rate in terms of the decay rate of the eigenvalues of the covariance kernel. We can achieve arbitrarily fast error convergence rates with sufficiently rapid eigenvalue decay of the covariance kernels. This contrasts with the
passive (i.i.d.) data collection strategies, where the convergence rate is never faster than linear decay (~ 1/n). In fact, for our setting, we show a non-vanishing lower bound for any passive data collection strategy, regardless of the eigenvalues decay rate of the covariance kernel. Overall, our results show the benefit of active data collection strategies in operator learning over their passive counterparts.


Yichao Chen

PhD Student
Statistics

Community Detection for Signed Networks

Abstract

Community detection, discovering the underlying communities within a network from observed connections, is another fundamental problem in network analysis, which has been extensively studied across various domains. In the context of signed networks, not only the connections but also their signs play a crucial role in community identification. Particularly, the empirical evidence of balance theory in real-world signed networks makes it a compelling property for this purpose. In this work, we will propose a novel balanced stochastic block model, which has a hierarchical community structure induced by balance theory. We will also develop a fast maximum pseudo likelihood estimation approach for community detection with exact recovery. Our proposed method can be used to detect meaningful node clusters that are of potential interest for downstream applications.


Seamus Somerstep

PhD Student
Statistics

CARROT: A Cost Aware Rate Optimal Router

Abstract

With the rapid growth in the number of Large Language Models (LLMs), there has been a recent interest in LLM routing, or directing queries to the cheapest LLM that can deliver a suitable response. Following this line of work, we introduce CARROT, a Cost AwaRe Rate Optimal rouTer that can select models based on any desired trade-off between performance and cost. Given a query, CARROT selects a model based on estimates of models’ cost and performance. Its simplicity lends CARROT computational efficiency, while our theoretical analysis demonstrates minimax rate-optimality in its routing performance. Alongside CARROT, we also introduce the Scalable Price-aware Routing ( SPROUT) dataset to facilitate routing on a wide spectrum of queries with the latest state-of-the-art LLMs. Using SPROUT and prior benchmarks such as Routerbench and open-LLM-leaderboard-v2 we empirically validate CARROT’s performance against several alternative routers.


Yilun Zhu

PhD Student
Electrical Engineering and Computer Science

Classification Under Label Noise: Ignorance Is Bliss

Abstract

We establish a new theoretical framework for learning under multi-class, instance
dependent label noise. This framework casts learning with label noise as a form
of domain adaptation, in particular, domain adaptation under posterior drift. We
introduce the concept of relative signal strength (RSS), a pointwise measure that
quantifies the transferability from noisy to clean posterior. Using RSS, we establish
nearly matching upper and lower bounds on the excess risk. Our theoretical
findings support the simple Noise Ignorant Empirical Risk Minimization (NI-ERM)
principle, which minimizes empirical risk while ignoring label noise. Finally, we
translate this theoretical insight into practice: by using NI-ERM to fit a linear
classifier on top of a self-supervised feature extractor, we achieve state-of-the-art
performance on the CIFAR-N data challenge.


Mohammad Aamir Sohail

PhD Student
Electrical Engineering and Computer Science

Quantum Algorithm for Gene Regulatory Network Inference

Abstract

Large-scale biological networks, such as gene regulatory networks (GRNs), often exhibit non-classical statistical features that are not captured by conventional probabilistic models. This is analogous to the wave-particle duality in quantum mechanics, where interference phenomena violate the classical law of total probability. My research addresses these challenges through two thrusts. First, I develop quantum-like statistical methods that incorporate multi-way interactions, capturing both direction and strength of regulatory interactions using hypergraph-based modeling. Second, I leverage quantum computing to overcome the exponential computational complexity inherent in classical optimization approaches for network analysis. By harnessing superposition and entanglement, my work seeks to efficiently analyze these networks, enabling precise identification of key regulatory pathways. This approach promises robust insights into complex biological systems and potential breakthroughs in disease mechanisms and therapeutic strategies. Ultimately, these quantum-like methods hold promise for improving the accuracy and scalability of biological network inference, in fields such as cancer biology.


Qiyuan Chen

PhD Student
Industrial & Operations Engineering

Online Learning of Optimal Sequential Testing Policies

Abstract

This paper studies an online learning problem focused on finding optimal testing policies for a sequence of subjects. Although applying all candidate tests can lead to more informed decisions, it may be advantageous—especially when tests are correlated and costly—to perform only a subset of tests for a subject and make decisions with partial information. While one could treat this as a Markov Decision Process (MDP) if the joint distribution of test outcomes were known, in practice, that distribution is typically unknown and must be learned online during the testing of subjects. However, each subject that is not fully tested produces a record with missing data, which introduces bias in the learning of the joint distribution. As a result, this problem is fundamentally more challenging than conventional episodic MDP learning. Theoretically, we show that the minimax regret must scale at least as \(\tilde{\Theta}(d T^{\frac{2}{3}})\), highlighting the intrinsic difficulty introduced by missing data. To gain both theoretical and practical insights, we then examine a special case of the problem—the cost-sensitive online Maximum Entropy Sampling Problem (MESP)—where the reward of one’s decision is unaffected by missingness. Taking advantage of this special structure, we propose an iterative elimination algorithm that achieves a cumulative regret of \(\tilde{O}(d^3 \sqrt{T})\), offering a notable improvement over the general case. Numerical experiments confirm these theoretical findings in both settings. Our work advances the understanding of the exploration-exploitation trade-off in the presence of missing data and provides useful guidance for practitioners seeking to optimize sequential testing policies online.


Junchao Tang

PhD Student
Sociology

DAGUM REGRESSIONS FOR STUDYING INCOME INEQUALITY

Abstract

In a highly influential article, “Variance Function Regressions for Studying Inequality,” Western and Bloome (2009) introduced variance-function regression (VFR), a two-equation regression model for the joint modeling of the conditional mean and variance of a variable, typically log income or log earnings. VFR has been extensively used to study the contribution of between-group and within-group income inequality to overall trends in income inequality, to conduct counterfactual analyses, and to decompose changes in inequality into the shares that may be attributed to various factors (e.g., increasing returns to education), among other purposes (e.g., Mouw and Kalleberg 2010; VanHeuvelen 2018a, 2018b, 2018c; Western and Rosenfeld 2011; Western et al. 2008; Williams 2013; Wilmers 2017; Wodtke 2016; Zhou 2014). In this paper, we present a new regression model, Dagum regression, which is motivated similarly to VFR but goes substantially beyond VFR in several dimensions.


Felipe Maia Polo

PhD Student
Statistics

Efficient multi-prompt evaluation of LLMs

Abstract

Scaling laws for large language models (LLMs) predict model performance based on parameters like size and training data. However, differences in training configurations and data processing across model families lead to significant variations in benchmark performance, making it difficult for a single scaling law to generalize across all LLMs. On the other hand, training family-specific scaling laws requires training models of varying sizes for every family. In this work, we propose Skills Scaling Laws (SSLaws, pronounced as Sloth), a novel scaling law that leverages publicly available benchmark data and assumes LLM performance is driven by low-dimensional latent skills, such as reasoning and instruction following. These latent skills are influenced by computational resources like model size and training tokens but with varying efficiencies across model families. Sloth exploits correlations across benchmarks to provide more accurate and interpretable predictions while alleviating the need to train multiple LLMs per family. We present both theoretical results on parameter identification and empirical evaluations on 12 prominent benchmarks, from Open LLM Leaderboard v1/v2, demonstrating that Sloth predicts LLM performance efficiently and offers insights into scaling behaviors for downstream tasks such as coding and emotional intelligence applications.

Session VI

March 28th, 2:00pm – 4:00pm
Amphitheater

Moderated by
Sergio Martinez
PhD Student, Survey and Data Science


Rona Hu

Master
Survey and Data Science

Predicting the Unobserved: Improving Sexual Identity Measures in Health Disparity Studies with Machine Learning and Resampling

Abstract

Survey research on sexual identity often categorizes respondents as heterosexual, homosexual, and bisexual, but may miss more nuanced identities. Recent Federal recommendations regarding best practices for the measurement of sexual identity have called for the inclusion of “something else” response options. Prior research has suggested that estimates of health disparities between sexual identity subgroups can be affected if a “something else” response option is not provided, given that respondents who do not identify as heterosexual, homosexual, or bisexual may be forced to select one of these options even if they do not see them as being relevant. Unfortunately, some surveys lack this option. We propose an innovative method using random forests to predict four-category sexual identity based on surveys that provide these options, followed by the prediction of four-category sexual identity responses in surveys that do not include these categories but do include common covariates to construct the “unobserved” identities retrospectively. We utilize bootstrap resampling to capture uncertainty with this prediction process. Leveraging a split-ballot experiment in the 2015-2019 National Survey of Family Growth, we first fit benchmark models of interest to selected health outcomes as a function of sexual identity in the half-sample including “something else” as a response category. Using this as a training data set, we then develop a classifier for four-category sexual identity, and use the half-sample excluding “something else” as a test data set, where we develop predictions of four-category sexual identity and then use these predictions to estimate the models of interest. We repeat this process for each of several hundred bootstrap samples, and evaluate the ability of this methodology to recover the model of interest based on four sexual identity categories in the data set that did not offer them.


Chendi Zhao

PhD Student
Survey and Data Science

Navigating Data Privacy and Utility: A Study of the Impact of Data Perturbation on Small Area Estimation

Abstract

Microdata containing individual-level information poses significant privacy risks, particularly in small geographic areas. To mitigate these risks, various perturbation methods have been developed. However, balancing the trade-off between data utility and privacy remains a pressing challenge, especially in the context of Small Area Estimation (SAE), where systematic evaluations are scarce. This study investigates the impact of data perturbation on the accuracy of area-level SAE estimates, focusing on optimizing privacy preservation and data utility. We hypothesize that higher privacy levels and smaller geographic areas will lead to reduced utility, resulting in less accurate SAE estimates.
Using the 2018–2022 American Community Survey Public Use Microdata Sample (ACS-PUMS), we estimate average household income and poverty rates at the state and Public Use Microdata Area (PUMA) levels. Six covariates including age, gender, race/ethnicity, education, occupation, and health insurance status are incorporated into the SAE models and subjected to perturbation.
We first assess disclosure risk and classify records based on four risk thresholds. Three perturbation methods are applied at the national, state, and PUMA levels. These include Random Data Swapping, the Post-Randomization Method (PRAM), and Multiple Imputation (MI). For Multiple Imputation, we explore random intercept and random coefficient models, as well as fully synthetic covariates. SAE estimates are generated using the Fay Herriot model under each perturbation scenario. Utility is evaluated using mean absolute bias, relative bias, and Mean Squared Error (MSE) relative to unperturbed estimates. Propensity models are used to assess sample balance after perturbation. Privacy protection is measured by the extent of record alterations and Hellinger Distance. The findings provide practical guidance for data users and statistical agencies, informing the selection of privacy-preserving methods that maintain key data characteristics while ensuring robust privacy protection in SAE applications.


Sarahi Palma

Undergraduate
Astronomy

Analyzing Methane’s Vertical Chemical Gradient Profile on 51 Eridani b

Abstract

The James Webb Space Telescope (JWST) now enables the characterization of alien worlds in stupendous detail. Bayesian techniques, called atmospheric retrieval, play a crucial role in constraining the range of exoplanet atmosphere properties consistent with JWST spectra. However, the retrieval analyses must make several simplifying assumptions a priori to render the parameter space tractable for statistical exploration. In particular, it is common to assume the chemical composition does not vary with height, even though we know from the solar system planets and theoretical models that different gases exhibit vertical variations. The absence of this vertical variability can affect or bias the inferred atmospheric properties, potentially leading to erroneous conclusions about the nature of the atmosphere and how the planet formed. Here, we investigate a case study of the well-studied directly imaged planet, 51 Eridani b (51 Eri b), a ~ 700 K planet for which chemical models predict strong vertical variation in the CH4 abundance. We present different retrieval models for 51 Eridani b, one with a parameterized vertical abundance profile and one without, demonstrating how varying the presence of molecules at certain parts of the atmosphere can vastly alter the measured properties of exoplanet atmospheres.


Deji Suolang

PhD Student
Survey and Data Science

Leveraging Wearable Sensor Data to Improve Self-Reports in the Survey Research: An Imputation-Based Approach

Abstract

The integration of wearable sensor data in survey research has the potential to mitigate the recall and response errors that are typical in self-report data. However, such studies are often constrained in scale by implementation challenges and associated costs. This study used data from the National Health and Nutrition Examination Survey (NHANES), which includes both self-report responses and wearable sensor data measuring physical activity, to mass-impute sensor values for the National Health Interview Survey (NHIS), a larger survey relying solely on self-reports. Imputations were performed on synthetic populations to fully account for the complex sample design features.

Cross-validation demonstrated the robust predictive performance of the imputation model. The results showed disparities between the sensor and self-reported values. Imputation estimates and standard errors resembled the NHANES estimates and were consistent across different subgroups. Self-reports and imputed sensor values were used to predict health conditions as a means for evaluating data quality. Models with sensor values exhibited higher R-squared values and smaller deviance. The study contributes to the existing literature on combining multiple data sources and provides insights into the use of wearable sensor data in survey research.


Htay-Wah Saw

PhD Student
Survey and Data Science

Analyzing the causal effect of survey frequency on nonresponse in probability-based online panels among new panel respondents

Abstract

We present the full results of a randomized controlled trial (RCT) that evaluated the causal effect of survey frequency on nonresponse in a probability-based online panel. The experiment was implemented within the Understanding America Study (UAS), a probability-based online panel representative of the U.S. adult population. We recruited 2,000 new participants for this experiment and randomly assigned half of the participants to a low survey frequency condition (n=1,000) and the remaining half to a high survey frequency condition (n=1,000). In the low frequency condition, participants received one survey invitation every four weeks, whereas in the high frequency condition, participants received one survey invitation every two weeks. The only difference between the two conditions is the frequency of survey invitations, with other design features such as survey topics and questionnaire length remaining the same in both conditions. The RCT began in February 2024 and ended in December 2024. We found that new panelists on a more frequent survey schedule had higher response rates than those on a less frequent schedule. Subgroup analyses revealed the largest treatment effects among high-education and high-income groups, with no effects found among low-education and low-income groups. Our findings suggested the treatment effects were mainly driven by engagement rather than incentive effects linked to receiving a more frequent survey schedule. We discuss the theoretical and practical implications of our findings for improving panel management and retention practices in future studies.


Curtiss Engstrom

PhD Student
Survey and Data Science

A Comparison of Collapsing and Bridging Methods for Measures of Sexual Identity Using Two National Health Surveys in the United States

Abstract

The harmonization of measures of sexual identity aids in the comparative measurement of health outcomes by sexual identity between surveys that use different measures of sexual identity. For example, multiple national health surveys in the United States collect information about sexual identity, but may differ in the response options presented, the offered question wording, or both. For instance, the National Survey of Drug Use and Health (NSDUH) utilizes a three-category measure of sexual identity, while the National Health Interview Survey (NHIS) uses a four-category measure. Comparing results between the NHIS and the NSDUH would be inappropriate as health-based estimates among sexual minority individuals could differ based on the sexual identity question used. This study evaluates two possible harmonization methods for these scenarios: bridging one sexual identity measure onto another survey using random forests and collapsing response options for sexual identity on both surveys to match one another. We will contrast estimates of the leading cause of preventable death: cigarette smoking, and a preventative measure for lowering the chance of death due to cigarette smoking: lung cancer screening eligibility for individuals aged 45 and up, by either the bridged or collapsed versions of sexual identity. We will compare the statistical power achieved using either the bridged or collapsed versions of sexual identity to determine which version achieves greater power for detecting differences between sexual minority subgroups in smoking-related outcomes. We will also examine area under the curve metrics to examine possible differences in the predictive power of logistic regression models that regress past-year smoking, pack-a-day smoking, and lung cancer screening eligibility on sexual identity. Additionally, we will examine goodness-of-fit metrics for each logistic regression to compare model fit using the bridged and collapsed versions of sexual identity.


Sergio Martinez

PhD Student
Survey and Data Science

Responsive survey design strategies for difficult cases in a longitudinal panel study: The case of the 2022 Health and Retirement Study

Abstract

Declining response rates in longitudinal surveys, as seen in the Health and Retirement Study (HRS) where panel response rates fell from 88% in 2010 to 74% in 2020, threaten data quality. To combat this trend, this study evaluates two responsive survey design strategies in the 2022 HRS: case prioritization and an “endgame” incentive offer.

The case prioritization evaluated in this study used an influence measure (IM) calculated across key variables to assess each active case’s potential to reduce nonresponse bias. Indicator variables (-1, 0, or 1) were computed based on whether the IM was below zero (worsened estimates), exactly zero (no effect), or above zero (improved estimates), respectively. The net sum of these indicators was calculated, with active cases scoring above zero being prioritized as they would be expected to improve more estimates than they do not.

For the endgame strategy, eligible panelists who had not yet responded in 2022 were randomly assigned to either a treatment or control group. Upon reaching the 12th face-to-face or telephone contact attempt, those in the treatment group received a letter promising an additional $100 incentive upon completing the interview.
The findings demonstrate that both strategies were effective. Case prioritization significantly increased the response rate (RR) among prioritized cases (37% vs. 29%), particularly among those with lower education, lower employment, and more limitations with daily living activities. The endgame offer significantly increased RR in the treatment group (31% vs. 23%), especially among younger cohorts and higher-educated respondents. Additionally, treated complete respondents required fewer contact attempts (6.9 vs. 8.1).

These results emphasize the importance of case prioritization to ensure that effort is focused on cases most likely to improve survey estimates. At the same time, targeted incentive increases help to secure responses from hard-to-reach respondents in panel surveys, ultimately reducing fieldwork costs and enhancing representativeness.

Poster

March 27th, 4:30pm – 6:30pm
Assembly Hall


Benjamin Radmore

Undergraduate
Astronomy

Parametrizing New Dwarf Galaxy Candidates in Hubble Data

Abstract

Dwarf galaxies provide key insights into galaxy formation, evolution, and dark matter distributions. In this study, we highlight techniques used to characterize newly identified dwarf galaxy candidates in Hubble Space Telescope imaging. We begin by identifying red giant branch (RGB) point sources, making uniform color-magnitude cuts for each galaxy. To correct for observational biases, we perform artificial star injection and recovery tests, generating completeness curves to further restrict our selection. Using these point sources, we fit either Sérsic or exponential density profiles, using Markov Chain Monte Carlo (MCMC) sampling to determine structural parameters such as ellipticity, half-light radius, and Sérsic index. Our results include color-magnitude diagrams with RGB cuts and completeness limits, as well as corner plots showing the posterior distributions from our MCMC fits. By applying these techniques to our new dwarf galaxy candidates, we clarify their structural properties and contribute to ongoing efforts in understanding low-mass galaxy populations.


Vincent Louis Claes

Undergraduate
Astronomy

Correlation Between X-Shape Parameters and Edge-On parameters of Boxy/Peanut Barred Galaxies

Abstract

Barred galaxies can form boxy/peanut or X-shapes (BP/X) due to the bar bending out of the disk midplane. We parametrize the BP/X structure observed in edge-on barred galaxies into radius from the peak of the peanut to the center of the galaxy Rpea, width from the center of the peak σpea, and amplitude from the galaxy plane to the peak apea, bar amplitude (length of the bar) Abar, and bar pattern speed (angular speed of the bar) Ω. We use a set of N-body models and show that there is a statistically significant correlation between Rpea and bar amplitude, and Rpea and bar pattern speed, indicating a link between X-shape and bar parameters.


Shuge Ouyang, Haytham Tang

Undergraduate
Statistics

Undergraduate
Computer Science Engineering

Low-Rank Expectile Representations of a Data Matrix, with Application to Diurnal Heart Rates

Abstract

Low-rank matrix factorization is a powerful tool for understanding the structure of 2-way data, and is usually accomplished by minimizing a sum of squares criterion. Expectile analysis generalizes squared-error loss by introducing asymmetry, allowing tail behavior to be elicited. Here we present a framework for low-rank expectile analysis of a data matrix that incorporates both additive and multiplicative effects, utilizing expectile loss, and accommodating arbitrary patterns of missing data. The representation can be fit with gradient-descent. Simulation studies demonstrate the accuracy of the structure recovery. Using diurnal heart rate data indexed by person-days versus minutes within a day, we find divergent behavior for lower versus upper expectiles, with the lower expectiles being much more stable within subjects across days, while the upper expectiles are much more variable, even within subjects.


Xingran Chen

PhD Student
Biostatistics

Generating Synthetic Electronic Health Record (EHR) Data: A Methodological Scoping Review with Benchmarking on Phenotype Data and Open-source Software

Abstract

Objectives: To conduct a scoping review of existing approaches for synthetic Electronic Health Records (EHR) data generation, to benchmark major methods, and to provide an open-source software and offer recommendations for practitioners.

Materials and Methods: We search three academic databases for our scoping review. Methods are benchmarked on open-source EHR datasets, Medical Information Mart for Intensive Care III and IV (MIMIC-III/IV). Seven existing methods covering major categories and two baseline methods are implemented and compared. Evaluation metrics concern data fidelity, downstream utility, privacy protection, and computational cost.

Results: 42 studies are identified and classified into five categories. Seven open-source methods covering all categories are selected, trained on MIMIC-III, and evaluated on MIMIC-III or MIMIC-IV for transportability considerations. Among them, Generative Adversarial Network (GAN)-based methods demonstrate competitive performance in fidelity and utility on MIMIC-III; rule-based methods excel in privacy protection. Similar findings are observed on MIMIC-IV, except that GAN-based methods further outperform the baseline methods in preserving fidelity.

Discussion: Method choice is governed by the relative importance of the evaluation metrics in downstream use cases. We provide a decision tree to guide the choice among the benchmarked methods.

Conclusion: GAN-based methods excel when distributional shifts exist between the training and testing populations. Otherwise, CorGAN and MedGAN are most suitable for association modeling and predictive modeling, respectively. Future research should prioritize enhancing fidelity of the synthetic data while controlling privacy exposure, and comprehensive benchmarking of longitudinal or conditional generation methods.


Sarah Medley

PhD Student
Biostatistics

Design and Analysis of SMARTs with Treatment Preference, with Application to the STAR*D Trial

Abstract

Effective care for chronic conditions with high rates of non-response or
relapse requires personalized and adaptive treatment guidelines known as
dynamic treatment regimens (DTRs). Sequential, multiple assignment, ran-
domized trials (SMARTs) are the gold standard for estimating DTRs, but
SMARTs, like any trial, may struggle with recruitment and retention due
to patient treatment preferences. A partially randomized, patient preference
SMART (PRPP-SMART) design overcomes these issues by assigning partic-
ipants with a preference to their preferred treatment and randomizing treat-
ment indifferent participants at each stage of the SMART. We have previously
shown that weighted and replicated regression models (WRRMs) combining
data from all participants, whether randomized or assigned to treatment, es-
timate DTRs with binary outcomes with minimal bias in a PRPP-SMART.
Here, we evaluate WRRMs to estimate PRPP-SMART DTRs with continu-
ous outcomes and find that the performance of our method is robust to dif-
ferent preference rates and outcome distributions. We illustrate our method
using data from the STAR*D trial (NCT00021528) which considered treat-
ment preferences but did not formally compare DTRs. DTR estimates in the
STAR*D example from our methodology agree with previous exploratory re-
sults and suggest a small benefit of expressing a treatment preference. The PRPP-SMART design and methods would have overcome many shortcomings of STAR*D.


Yao Song

PhD Student
Biostatistics

Clustered Q-learning]{Q-learning Regression with Clustered SMART Data: Examining Moderators in the Construction of Clustered Adaptive Interventions

Abstract

A clustered adaptive intervention (cAI) is a pre-specified sequence of decision rules that guides practitioners on how best — and based on which measures — to tailor cluster-level intervention to improve outcomes at the level of individuals within the clusters. A clustered sequential multiple assignment randomized trial (SMART) is a type of trial used to inform the empirical development of a cAI. The most common type of secondary aim in a cSMART focuses on assessing causal effect moderation by candidate tailoring variables. This manuscript develops a Q-learning regression method for this purpose. This includes a method for calculating confidence intervals for the parameters indexing the causal effect moderation function. This enables analysts to make inferences concerning the utility of candidate tailoring variables in a cAI that maximizes a mean end-of-study outcome. The method uses an M-out-of-N bootstrap approach to calculate confidence intervals with near nominal coverage rates under conditions of nonregularity, a well-known challenge in Q-learning regression. A first simulation experiment shows that confidence intervals achieve (near) nominal coverage rates under varying types of non-regularity, and investigates the impact of varying sizes for the total study sample and true intra-cluster correlation coefficient on coverage rates. A second experiment investigates the impact of a modified estimator that uses a working variance model for individuals within clusters on the size of the confidence interval (efficiency) and coverage rates. Methods are illustrated using data from ADEPT to inform the construction of a clinic-level cAI to improve the uptake of evidence-based practices in the treatment of patients with mood disorders.


Benjamin Osafo Agyare

PhD Student
Statistics

Expectile Regression via Pseudo-Observations: A Flexible Framework for Distributional Learning in Longitudinal Data

Abstract

In longitudinal studies, responses are often measured at irregular time points, leading to challenges in analyzing their temporal evolution. Standard regression techniques require fully observed responses, limiting their applicability when studying incomplete or irregularly observed data. To address this, we develop a novel statistical framework that extends the use of pseudo-observations, a technique gaining traction in survival analysis to handle incomplete/partially observed data. Pseudo-observations approximate the contribution of each individual to an overall statistical estimate by leveraging influence function-based resampling, allowing for the estimation of quantities that would otherwise be missing. Unlike traditional approaches that focus solely on the mean response, we generalize pseudo-observations using expectile regression, enabling the study of the full conditional distribution of a response variable. This generalization is particularly important for analyzing heterogeneous effects, as it allows us to model not only central tendencies but also the upper and lower tails of the distribution.

Our methodology is broadly applicable to irregularly observed longitudinal outcomes. As an application, we estimate pseudo-values for BMI at specific target ages, which then serve as synthetic response variables in regression models examining the heterogeneous effects of demographic, behavioral, and socioeconomic factors on BMI across the distribution. This approach overcomes limitations in existing methods that rely on mean-based imputation or restrictive parametric assumptions. By applying our framework to real-world longitudinal health data, we illustrate its ability to uncover nuanced relationships in BMI trajectories. More broadly, our method advances statistical techniques for handling incomplete longitudinal data and offers a flexible tool for studying heterogeneous effects across various disciplines.


Samuel Tan

Master
Biostatistics, Statistics

Evaluating the Zero-Shot Predictive Ability of Large Language Models for Continuous Glucose Monitoring Data

Abstract

With the growing adoption of Continuous Glucose Monitoring (CGM) devices in clinical settings, accurate blood glucose forecasting has become pivotal for optimal diabetes management. Although traditional statistical models and time-series models have been used to predict glucose levels from CGM data, most require extensive training on large CGM datasets and do not account for important patient demographics—such as sex, BMI, insulin therapy type, and diabetes type. The lack of an easy-to-use, out-of-the-box model that seamlessly integrateS demographic information has hindered broader clinical implementation and the timely prediction of glycemic risks.
In this work, we address these challenges by repurposing a prompt-based, zero-shot Large Language Model (LLM) framework for CGM data forecasting. Rather than training a specialized regression model, we convert CGM data (with blood glucose level readings at regular intervals) into text-based prompts and augment them with relevant patient demographics. We then query out-of-the-box LLMs, without additional fine-tuning, to predict future blood glucose levels based solely on these text-formatted prompts. We evaluate model performance against conventional approaches (e.g., linear regression, XGBoost) and time-series models (e.g. ARIMA), using mean absolute error (MAE) and root mean squared error (RMSE) as our primary metrics.We anticipated that by reducing the need for extensive training and simplifying the inclusion of patient characteristics, this LLM-based approach could significantly lower barriers to clinical implementation and advance the personalization of diabetes management.


Jianhan Zhang

Undergraduate
Mathematics, Statistics

Counterfactually Fair Reinforcement Learning via Sequential Data Preprocessing

Abstract

When applied in healthcare, reinforcement learning (RL) seeks to dynamically match the right interventions to subjects to maximize population benefit. However, the learned policy may disproportionately allocate efficacious actions to one subpopulation, creating or exacerbating disparities in other socioeconomically-disadvantaged subgroups. These biases tend to occur in multi-stage decision making and can be self-perpetuating, which if unaccounted for could cause serious unintended consequences that limit access to care or treatment benefit. Counterfactual fairness (CF) offers a promising statistical tool grounded in causal inference to formulate and study fairness. We propose a general framework for fair sequential decision making. We theoretically characterize the optimal CF policy and prove its stationarity, which greatly simplifies the search for optimal CF policies by leveraging existing RL algorithms. The theory also motivates a sequential data preprocessing algorithm to achieve CF decision making under an additive noise assumption. We prove and then validate our policy learning approach in controlling unfairness and attaining optimal value through simulations. Analysis of a digital health dataset designed to reduce opioid misuse shows that our proposal greatly enhances fair access to counseling.


Yilin Chen

Master
Statistics

Can Multivariate Adaptive Shrinkage (Mashr) Enhance Gene Discovery? A Comparative Analysis with Traditional Meta-Analysis Methods

Abstract

High-dimensional gene expression data presents unique challenges in statistical analysis, particularly in managing noise and identifying subtle biological signals. Multivariate Adaptive Shrinkage in R (Mashr) potentially offers an innovative framework for integrating and analyzing data across multiple conditions.
This study evaluates the utility of applying Mashr to gene expression effect sizes prior to running a meta-analysis. For our test case, we used eleven depression-related RNA-Seq or microarray datasets with varying small sample sizes. The effect sizes were calculated for the depression-related differential expression for each gene from each study, and Mashr was applied to compute posterior means and standard deviations for the effect sizes for each gene, borrowing information across studies. Meta-analysis was then performed using a random effects model on either the Mashr output or the traditional effect sizes.
Unexpectedly, out of 13,336 genes included in the meta-analysis, Mashr did not identify any significant results whereas a traditional meta-analysis method identified 14 differentially expressed genes (false discovery rate (FDR)<0.10). However, Mashr’s estimated effect sizes showed a strong correlation (R=0.866) with the estimated effect sizes in the traditional meta-analysis results, suggesting that the two methods identified similar differential expression patterns. Ranked results from the top 200 genes identified by each method showed approximately 25% overlap (53 genes). For the top 100 ranked differentially expressed genes, Mashr also produced a narrower range of estimated effect sizes, primarily between -1 and 1, compared to meta-analysis (approximately -1.5 to 1).
While Mashr theoretically offers unique advantages, such as managing noise and providing nuanced estimates, this study concludes that its sensitivity relative to traditional meta-analysis remains tentative. Further exploration is required to determine the reliability of signals detected by either method in the absence of ground truth data.


Soham Das

PhD Student
Statistics

Extended Bayesian Estimation for In-flight Calibration of Space-based Instruments

Abstract

Instrument calibration is essential for space-based instruments to ensure accurate measurements of physical quantities, such as photon flux from astronomical sources. This process involves in-flight adjustments to address discrepancies arising from instrument-specific variations. To enhance calibration reliability, we propose a Bayesian hierarchical spectral model that leverages domain-specific priors and integrates information across different spectra, instruments, and sources. By analyzing measurements of the same set of objects from multiple instruments, the model estimates posterior uncertainties for calibration parameters and infers about structural parameters with statistical significance. Using a log-normal approximation for photon counts, our approach provides a principled alternative to empirical methods, improving the calibration of new equipment.


Ashlan Simpson

PhD Student
Statistics

Quantifying Drought Effects on Tree Growth Using Bayesian Stochastic Antecedent Modeling at the Boreal-Temperate Ecotone

Abstract

Drought exerts a profound influence on tree growth, yet its effects vary across species, ecosystems, and temporal scales. Understanding these effects is particularly critical at biome transition zones, such as the boreal-temperate ecotone, where climate change may exacerbate drought impacts. Bayesian stochastic antecedent modeling (BSAM) provides a powerful framework for disentangling these influences by probabilistically estimating how past climate conditions shape tree height growth over time. This study applies BSAM to multiple tree species representative of both boreal and temperate forest biomes, leveraging experimental data from B4WarmED—experimental data in northern Minnesota including. By incorporating Bayesian inference, BSAM accounts for uncertainty in drought timing, duration, and severity, while identifying species-specific growth responses to antecedent climate conditions.
The primary objective is to attribute past variability in tree height growth to drought by assessing how different species respond to historical moisture deficits. This approach enables the quantification of drought sensitivity across species, shedding light on the role of antecedent conditions beyond the current growing season. By integrating experimental data with potential observational datasets, this study will explore whether trees exhibit delayed responses to drought and how interspecies differences influence growth resilience or susceptibility.
Results will provide insights into how trees at the boreal-temperate ecotone respond to past and ongoing climate variability, enhancing our understanding of forest dynamics under increasing climate stress. These findings will contribute to ecological forecasting efforts, offering a probabilistic framework for predicting future growth trajectories under changing climate conditions. Ultimately, this work will inform adaptive forest management strategies, supporting ecosystem resilience in an era of heightened drought frequency and severity.


Jun Chen

Undergraduate
Statistics

Analysis of heteroscedastic outcomes in NHANES using scalable sufficient dimension reduction

Abstract

Sufficient Dimension Reduction (SDR) is a powerful nonparametric statistical technique that aims to reveal the relationship between responses and high-dimensional predictors by identifying a lower-dimensional structure that retains all the information about the relationship. However, existing SDR methods, including OPCG (Outer Product of Canonical Gradient), face computational challenges in handling large-scale heteroscedastic datasets. In this work, we propose and employ an extended and optimized OPCG approach (S-OPCG) by support-point-based quantization. Our approach not only improves scalability but also ensures robustness in capturing sufficient dimension-reduction structures. We apply S-OPCG to the National Health and Nutrition Examination Survey (NHANES) dataset to explore the relationships between blood pressure, blood chemistry, and mental health, generating biomedical insights from the perspective of SDR.


Mason Ferlic

PhD Student
Statistics

Optimizing Event-Triggered Adaptive Interventions in Mobile Health with Sequentially Randomized Trials

Abstract

Technological advances in mobile and digital health, such as wearable
sensors and momentary self-reporting, have now made it possible to
monitor treatment response in near real-time. This has led to significant
scientific interest in developing technology-assisted adaptive interventions.
An adaptive intervention is a protocolized sequence of decision rules used
to guide an intervention across multiple stages of treatments contingent
on the evolving status of the individual. We introduce a new class of
adaptive interventions called event-triggered adaptive interventions, which
leverage time-varying tailoring variables to determine when, if, and what
treatment is needed at each stage. In such mobile monitoring environments,
event-triggered adaptive interventions are more agile and address
temporal treatment response heterogeneity to further improve individual
outcomes. Sequential multiple-assignment randomized trial (SMART) designs
can be used to develop optimized event-triggered adaptive interventions.
We propose a two-stage regression algorithm based on the structural
nested mean model to analyze data from a SMART with continuous, longitudinal
outcomes and time-varying tailoring variables. This approach
targets stage-level causal effects and allows the scientist to examine time-varying
treatment moderators while avoiding causal collider bias. We
illustrate our methodology on data from a SMART to develop an event-triggered
adaptive intervention for digitally monitored weight loss.


Benjamin Tward

Undergraduate
Statistics

Understanding simulated ecological responses to climate change through functional average treatment effect

Abstract

Current greenhouse gas emissions models project that by the 22nd century, global temperatures will be at least 5 degrees Fahrenheit warmer than the 1901-1960 average. Trees capture carbon dioxide, a significant greenhouse gas driving climate change. However, increasing temperatures and shifting climate conditions threaten forest ecosystems by altering species composition, growth rates, and overall structure. These changes will impact the ability of forests to sequester carbon effectively. Accurately quantifying these ecological shifts is crucial for improving climate models, yet existing statistical approaches struggle to capture the complexity of functional ecological responses over time. To address this, we apply the Functional Average Treatment Effect (FATE), a method for making causal inferences on treatments with functional outcomes. FATE offers a principled approach to estimating treatment effects when outcomes evolve over time, making it particularly well suited for studying climate-driven changes in tree growth. To assess the applicability of FATE, we first validate the method using synthetic data informed by B4WARMED, a University of Minnesota study on how temperature treatments affect the growth and survival of trees at the boreal-temperate ecotone. This approach allows us to establish baseline performance of FATE before applying it to noisy, real-world datasets. Since tree growth follows an S-shaped trajectory—starting slowly, accelerating, and eventually leveling off—sigmoid functions provide a biologically realistic model for evaluating FATE’s performance. We systematically adjust key sigmoid parameters to assess how well FATE captures treatment effects across tree growth patterns. This approach will refine our understanding of FATE’s adaptability across varying sigmoid transformations.


Jiuqian Shang

PhD Student
Statistics

Statistical Inferences and Uncertainty Quantification for Noisy Low-Tubal-Rank Tensor Completion

Abstract

The low-tubal-rank tensor model has been used for real-world multidimensional data to capture signals in the frequency domain. Algorithms have been developed to estimate low-rank third-order tensors from partial and corrupted entries. However, uncertainty quantification and statistical inference for these estimates remain largely unclear.

Our work addresses this gap. We introduce a flexible framework for making inferences about general linear forms of a large tensor whenever an entry-wise consistent estimator is available. Under mild regularity conditions, we construct asymptotically normal estimators of these linear forms through double-sample debiasing and low-rank projection. These estimators allow us to construct confidence intervals and perform hypothesis testing. Simulation studies support our theoretical results, and we apply the method to the total electron content (TEC) reconstruction problem.


Jiatong Liang

PhD Student
Statistics

Likelihood-based inference of migration surfaces

Abstract

In this work, we derive a method for visualizing spatial population structure using inverse instantaneous coalescent rate (IICR) curves. Unlike traditional approaches, such as EEMS, which model genetic variation as a function of migration rates and approximate its expectation using resistance distance, our method introduces a fundamentally different perspective by focusing on the coalescent process. The IICR curve quantifies the rate at which lineages coalesce as a function of time, providing a framework for inferring population structure. Our approach is based on a stepping-stone model and we model the relationship between pairs of samples as independent Markov processes with an extended joint state space that accounts for coalescence. By utilizing efficient procedures to compute the matrix exponential, we derive the distribution of coalescent times and expected IICR curves with high computational efficiency. This enables us to infer migration surfaces and visualize population structure.


Daniel Zou

PhD Student
Statistics

Distribution Matching for Transfer Learning under Conditional Label Shift

Abstract

Faced with a setting where the training (source) data has a different distribution than the target data, we wish to adapt a classifier trained on the source data to better perform on the target data. We propose a generalized version of the label shift assumption, conditional label shift, that bridges the spectrum between label shift and arbitrary distribution shift. Specifically, we assume that there is a lower rank matrix A such that the marginal p(Ax,y) changes but the conditional p(x|Ax,y) does not. We consider a distribution matching approach, in which we learn data importance weights to minimize the KL divergence between the weighted source distribution and the target distributions. We demonstrate the efficacy of our method to learn the lower rank matrix A and improve performance on the classification tasks on the target distribution.


Zhiwei Xu

PhD Student
Statistics

Let Me Grok for You: Accelerating Grokking via Embedding Transfer from a Weaker Model

Abstract

”Grokking” is a phenomenon where a neural network first memorizes training data and generalizes poorly, but then suddenly transitions to near-perfect generalization after prolonged training. While intriguing, this delayed generalization phenomenon compromises predictability and efficiency. Ideally, models should generalize directly without delay. To this end, this paper proposes GrokTransfer, a simple and principled method for accelerating grokking in training neural networks, based on the key observation that data embedding plays a crucial role in determining whether generalization is delayed. GrokTransfer first trains a smaller, weaker model to reach a nontrivial (but far from optimal) test performance. Then, the learned input embedding from this weaker model is extracted and used to initialize the embedding in the target, stronger model. We rigorously prove that, on a synthetic XOR task where delayed generalization always occurs in normal training, GrokTransfer enables the target model to generalize directly without delay. Moreover, we demonstrate that, across empirical studies of different tasks, GrokTransfer effectively reshapes the training dynamics and eliminates delayed generalization, for both fully-connected neural networks and Transformers.


Daniele Bracale

PhD Student
Statistics

Microfoundation inference for strategic prediction

Abstract

Often in prediction tasks, the predictive model itself can influence the distribution of the target variable, a phenomenon termed performative prediction. Generally, this influence stems from strategic actions taken by stakeholders with a vested interest in predictive models.
A key challenge that hinders the widespread adaptation of performative prediction in machine learning is that practitioners are generally unaware of the social impacts of their predictions. To address this gap, we propose a methodology for learning the distribution map that encapsulates the long-term impacts of predictive models on the population. Specifically, we model agents’ responses as a cost-adjusted utility maximization problem and propose estimates for said cost.
Our approach leverages optimal transport to align pre-model exposure (ex ante) and post-model exposure (ex post) distributions. We provide a rate of convergence for this proposed estimate and assess its quality through empirical demonstrations on a credit-scoring dataset.


Sunrit Chakraborty

PhD Student
Statistics

FLIPHAT: Joint Differential Privacy for High Dimensional Sparse Linear Bandits

Abstract

High dimensional sparse linear bandits serve as an efficient model for sequential decision-making problems (e.g. personalized medicine), where high dimensional features (e.g. genomic data) on the users are available, but only a small subset of them are relevant. Motivated by data privacy concerns in these applications, we study the joint differentially private high dimensional sparse linear bandits, where both rewards and contexts are considered private data. First, to quantify the cost of privacy, we derive a lower bound on the regret achievable in this setting. To further address the problem, we design a computationally efficient bandit algorithm, \textbf{F}orgetfu\textbf{L} \textbf{I}terative \textbf{P}rivate \textbf{HA}rd \textbf{T}hresholding (FLIPHAT). Along with the doubling of episodes and episodic forgetting, FLIPHAT deploys a variant of the Noisy Iterative Hard Thresholding (N-IHT) algorithm as a sparse linear regression oracle to ensure both privacy and regret-optimality. We show that FLIPHAT achieves optimal regret in terms of privacy parameters $\epsilon, \delta$, context dimension $d$, and time horizon $T$ up to a linear factor in model sparsity and logarithmic factor in $d$. We analyze the regret by providing a novel refined analysis of the estimation error of N-IHT, which is of parallel interest.


Soo Min Kwon

PhD Student
Electrical and Computer Engineering

Dynamic Subspace Estimation from Undersampled Data using Grassmannian Geodesics

Abstract

In this work, we consider recovering a sequence of low-rank matrices from undersampled measurements, where the underlying subspace varies across samples over time. Existing works involve concatenating all of the samples from each time point to recover the underlying matrix under the assumption that the data are generated from a single, static subspace. However, this assumption may be sub-optimal for applications in which the subspaces vary over time. To address this issue, we propose a Riemannian block majorize-minimization algorithm that constrains the time-varying subspaces as a geodesic along the Grassmann manifold. Our proposed method can faithfully estimate the subspaces at each time point, even when the number of samples at each time point is less than the rank of the subspace. Theoretically, we show that our algorithm enjoys a monotonically decreasing objective function while converging to an $\epsilon$-stationary point within $\widetilde{\mathcal{O}}(\epsilon^{-2})$ iterations. We demonstrate the effectiveness of our algorithm on synthetic data, dynamic fMRI data, and video data, where the samples at each time point are either compressed or partially missing.


Anabela Dill Gomes

Master
MEND – Metabolism, Endocrinology and Diabetes Division of MM

Exploring MRI quantification techniques for phenotypes of adiposity: data from a clinical trial of metreleptin to treat Madelung’s disease

Abstract

Precise body fat measurements are essential for assessing novel phenotypes and monitoring lipodystrophy syndromes, but there is a lack of adequate tools for evaluating regional fat distribution. In this study, we adapt a fat quantification method previously developed by our radiology team to quantify adipose tissue in the neck and thighs to assist in diagnosing abnormal fat distribution disorders.
Data from a previously reported clinical trial involving 4 subjects with Madelung’s Disease were collected before and after 24 weeks of leptin replacement therapy. MRI and DEXA scans were performed at baseline and week 24, with region of interest (ROI) masks applied to three slices in the neck and mid-thigh regions. We quantified the percentage of fat in these areas and compared them across methods. Statistical analyses, including t-tests, Pearson correlation, and Bland-Altman analysis, were performed to assess measurement differences and agreement between MRI and DEXA.
MRI quantification showed an average reduction of 17% fat in the neck (baseline = 45%±0.1, week 24 = 37%±0.2) and 36% in the thigh (baseline = 26%±0.2, week 24 = 16.6%±0.2). DEXA showed a 23% reduction in the neck (baseline = 37.8%±0.05, week 24 = 29%±0.1) and a 26.5% reduction in the thigh (baseline = 11.2%±0.06, week 24 = 8.3%±0.03). The fat percentages detected by MRI were significantly different compared to DEXA in both the neck (p=0.03) and thigh (p=0.05). Both methods showed good correlation in the neck (r = 0.79, p=0.02) and thigh (r = 0.92, p=0.001). However, Bland-Altman analysis revealed good agreement for neck quantification (bias = 1.25), but not for thigh quantification (bias = 1.85).
Despite differences in fat quantification, MRI and DEXA were highly correlated. MRI with ROI mapping provides precise regional fat quantification, which can be valuable for fat distribution studies. Further validation in larger samples is needed.


Karly Miller

Master
Nutritional Sciences

Food Groups and Micronutrients Associated with Chemotherapy-Induced Peripheral Neuropathy in Survivors of Cancer Post-Neurotoxic Treatment

Abstract

Associations Between Dietary Intake and Chemotherapy-Induced Peripheral Neuropathy Among Survivors of Cancer

Purpose
This cross-sectional study determined the association between food groups and micronutrient intake in survivors of cancer with and without CIPN using linear, logistic, and ordered logistic regression models.

Methods
Participants completed the PRO-CTCAE Numbness and Tingling severity item and Vioscreen Research Graphical Food Frequency Questionnaire to determine CIPN status and dietary intake, respectively. Separate linear, logistic, and ordered logistic regression models were used to calculate the mean food group and micronutrient intakes and the association between CIPN status and severity adjusting for average daily caloric intake, age, race, gender, BMI, and food security status. As this is a secondary analysis of prior research that was not powered to detect these associations, we considered p≤0.10 as an indicator of relevance for future study.

Results
A total of 136 participants (56% with CIPN, age 54.45 years ± 12.06, 93% female, and 93% white) completed the surveys and questionnaires. Participants with CIPN reported greater intake of refined grains (p=0.01) and thiamin (p=0.09) along with fewer eggs (p=0.04), legumes (p=0.05), and selenium (p=0.02) per 1,000 kcals daily. The odds of experiencing worse CIPN increased with each additional intake of refined grains (OR=2.05, 95% CI: 1.22, 3.46), and decreased with additional intake of tomatoes (OR=0.10, 95% CI: 0.01, 1.14), fish (OR=0.2, 95% CI: 0.06, 0.75), eggs (OR=0.06, 95% CI: 0.01, 0.34), and selenium (OR=0.96, 95% CI: 0.92, 1.00).

Conclusion
There are meaningful dietary intake differences between participants with and without CIPN. Further research is needed to establish a dietary intervention using the findings in a larger population of survivors of cancer.


Carly Mistick,
Nathan Allan

Undergraduates
Physics and Astronomy


Confirming Simulation Expectations of Galaxy Cluster Velocity Dispersion Bias with Ensemble Velocity Likelihood Modeling

Abstract

A deeper understanding of the masses of galaxy clusters, which consist of both baryonic and dark matter, will lead to better constraints on the cosmological parameters. Because cluster masses are not directly measurable, we can determine the masses of galaxy clusters by estimating their internal dark matter velocity dispersions, which are inferred from velocity dispersions of satellite galaxies within each cluster. However, because satellite galaxies are biased tracers of dark matter density and velocity fields, a velocity dispersion bias is introduced. Recent cosmological simulations indicate a ‘brighter is cooler’ scaling relation: more massive satellite galaxies should have slightly smaller velocity dispersion bias than less massive satellite galaxies. We confirm this scaling relation on a percent-level by utilizing empirical data from large surveys within an ensemble velocity likelihood model.


Yao Lu

Master
Survey and Data Science

Reexamining Survey Methodology: Evaluating the complex social survey with LLM in Predicting 2024 Election Outcomes

Abstract

Over the past eight years of U.S. elections, nearly all traditional large-scale election surveys based on telephone and text methods have failed to make accurate predictions. This study aims to leverage advanced generative AI technologies to assist social surveys and predict election outcomes through AI simulations. ChatGPT, as the most advanced large language model currently available, serves as an ideal tool for election data simulation, given its training data cutoff in 2019, which excludes the past four years of election events. Specifically, abortion rights—a key issue of contention between the two major parties with complex historical, political, and cultural dimensions—provides a rich dataset for analysis.
This study focuses on the impact of abortion rights on election outcomes. Methodologically, we employ clustering, classification, and similarity evaluation techniques (e.g., KL divergence and regression modeling) to simulate voter preferences across various demographic characteristics, including race, geography, economic status, and cultural and religious backgrounds. The AI-generated results are then compared with real-world data to evaluate the accuracy of AI in predicting political and other complex social issues, as well as the challenges it faces.
This research represents an innovative approach in survey methodology. It not only applies AI technology as a substitute for traditional survey methods but also addresses the biases and limitations inherent in generative AI-generated data. Using abortion rights as a case study, we discuss how to identify key and effective predictive variables when applying big data to complex social problem predictions. We argue that AI will become a pivotal tool in the field of complex social surveys in the future.

lsa logoum logoU-M Privacy StatementAccessibility at U-M