Student Presentations – MSSISS 2024

Student Presentations

Oral Presentation 1: Applications

March 28th 9:15 AM – 12:00 PM @ East Conference Room

Aaron Kaye

Economics, Business Economics & Public Policy (Joint)

The Personalization Paradox: Welfare Effects of Personalized Recommendations in Two-Sided Digital Markets

Abstract

In many online markets, platforms engage in platform design by choosing product recom- mendation systems and selectively emphasizing certain product characteristics. I analyze the welfare effects of personalized recommendations in the context of the online market for hotel rooms using clickstream data from Expedia Group. This paper highlights a tradeoff between match quality and price competition. Personalized recommendations can improve consumer welfare through the “long-tail effect,” where consumers find products that better match their tastes. However, sellers, facing demand from better-matched consumers, may be incentivized to increase prices. To understand the welfare effects of personalized recommendations, I develop a structural model of consumer demand, product recommendation systems, and hotel pricing behavior. The structural model accounts for the fact that prices impact demand directly through consumers’ disutility of price and indirectly through positioning by the recommendation system. I find that ignoring seller price adjustments would cause considerable differences in the estimated impact of personalization. Without price adjustments, personalization would increase consumer surplus by 2.3% of total booking revenue (∼$0.9 billion). However, once sellers update prices, personalization would lead to a welfare loss, with consumer surplus decreasing by 5% of booking revenue (∼$2 billion).

Abigail Kappelman

Epidemiology

Establishing survey methodology to capture long term population level Patient Reported Outcomes (PROs) in a surgical population

Abstract

Background: Understanding long-term patient outcomes (PROs) following surgery requires an efficacious survey methodology. We utilized data from a surgical registry to establish a survey frame to gather long-term PROs, providing a scientific route to assess generalizability, even within the context of lower overall response rates in a surgical population. Methods: Leveraging a statewide hernia surgery registry, we conducted a one-year post-operative survey using three PRO measures (PROMs): Ventral Hernia Recurrence Inventory, PROMIS Pain Intensity 3a, HerQLes. The registry collects basic contact information, socioeconomic characteristics, and clinical variables potentially associated with PROs. We employed a responsive survey design approach across four cohorts, varying invitation and reminder methods and incentive offer. Outcomes included: contact rate and response rate, calculated using American Association for Public Opinion Research (AAPOR) formulas; item non-response (%); and the impact of number of reminders and incentive offer on response rates, respondent characteristics, and item non-response (%). Each cohort was exposed to a unique set of design features to investigate associations between design features and response rates to determine survey cost effectiveness. Further, differences in characteristics of respondents and non-respondents were investigated using registry data, and adjustment methods, grounded in rich auxiliary data, were developed and evaluated. Results: Of 7,062 representative patients who received hernia surgery between January 2020 and March 2022, 6,068 were eligible for survey participation. Contact was achieved with 5,645 (contact rate 93.02%) and 1,816 responded to the survey across all four cohorts (overall response rate 29.93%). Response rates by cohort were 42.34%, 32.48%, 25.19%, and 25.89% with overall low item non-response (%). Response rates increased with number of reminders, but with diminishing returns over time; offer of postpaid incentive over no incentive did not significantly improve overall response rates; and item non-response was not associated with incentive offer. We identified targeted phone call reminders as a cost-effective strategy. Weighted respondents were comparable to the survey population after weighting for available registry covariates. Conclusion: We illustrate a strategy to maximize response rate with known cost components and to evaluate representativeness of long-term PROs using a sample-based registry, targeted multi-mode contact methods, and weighting adjustment methods.

Alicia Dominguez

Biostatistics

Investigating the Impact of Winner’s Curse on Polygenic Risk Scores

Abstract

Polygenic risk scores (PRS) are an increasingly used tool to predict genetic risk for many complex traits. PRS quantify the cumulative effect of risk alleles at several genetic markers by using estimates from genome-wide association studies (GWAS) as weights. GWAS are a powerful approach to identify genetic variants associated with traits of interest, however there are limitations of this study design that can bias or limit the utility of downstream analyses. One form of bias GWAS encounter is Winner’s Curse, which refers to a phenomenon where the estimated effect sizes of significant variants tend to be overestimated or larger in magnitude than their true values. Using overestimated effect sizes can limit PRS utility by overestimating the effects of non-causal variants but has not been studied in detail. In this project, I evaluate the impact of Winner’s Curse on PRS and assess whether adjusting for it improves PRS performance. For this analysis, we use simulated GWAS summary statistics and genotype data for a million markers in linkage disequilibrium (LD) under varying sample size, heritability, and polygenicity parameters. We obtain three sets of PRS calculated with varying number of markers using clumping and p-value thresholding method. We compare the performance of PRS calculated with the original summary statistics to those that were calculated with summary statistics adjusted for Winner’s Curse. We assess Empirical Bayes and FDR Inverse Quantile Transformation as methods to correct for Winner’s Curse. We found that when more markers are included into the original PRS, the variance of PRS estimates dramatically increases. However, the variance of the adjusted PRS is much more controlled than the original PRS when more than 100 markers are used. Additionally, we find that the true genetic predictor and adjusted PRS are increasingly more correlated than the original PRS as the number of markers used in the calculations increases. For PRS calculated with more than 100 markers, the Empirical Bayes method performs best at preserving the correlation between the true genetic predictor and PRS estimates. Some improvement in performance can be attributed to shrinkage of the effect size estimates of non-causal variants that are added into the PRS. In this analysis, we demonstrate that methods that account for Winner’s Curse improve PRS performance under several simulation scenarios. We observe large improvements in terms of variance and correlation with the true genetic predictor but especially when more than 100 markers are included into the construction of PRS.

Basil Isaac

Economics

Opioid Shortages and Overdose Deaths

Abstract

Since 2015 the United States has faced an average of 130 new drug shortages every year. Drug shortages have cascading effects on patients’ health – causing delays in treatment, use of inferior alternatives, and increasing the risk of medication errors. In addition to healthcare practitioners, policymakers have also expressed concern about shortages. However, most existing research on harms of drug shortages are case studies or single-center reports. This paper is one of the first to examine the effects of a drug shortage using state-level variation in the incidence of a shortage. Specifically, it quantifies the effect of a 16-month shortage of oxymorphone, an opioid analgesic, in 2012 on patient errors leading to death. Oxymorphone, an opioid, is part of the Drug Enforcement Agency’s (DEA)’s Schedule II of controlled substances. An unanticipated shortage of oxymorphone will necessitate an abrupt switch to other opioids which could cause medication errors. Furthermore, patients using oxymorphone may switch to illicit use of other opioids like heroin – which have a higher risk of death. Despite accounting for less than 1% of opioid fills in the early 2010s, this paper demonstrates that the oxymorphone shortage led to a significant increase in drug-overdose related deaths. In this paper, I create a geographically varying measure of the incidence of the oxymorphone shortage using administrative data on legal shipments of the drug. There is considerable variation in the incidence of the shortage – some states experienced no decline in their supply of oxymorphone, while others experienced a 50% reduction relative to the period before the shortage. To quantify the effect of the shortage on errors leading to death I use the Underlying Cause of Deaths mortality files published by the National Vital Statistics System. Specifically I use the deaths related to an overdose of opioids, hallucinogenic, or other drugs. I construct a shift-share instrument to address the potential endogeneity between volume of oxymorphone shipments and drug-overdose deaths. I exploit features of this specific shortage episode to create the instrumental variable. In 2012, there were two primary manufacturers of the drug, and the shortage was precipitated by manufacturing issues in the factory belonging to one of them. The shift-share instrument is a state’s projected supply of oxymorphone during the shortage period holding fixed the share of both manufacturers’ total shipments that went to the state. I demonstrate that pharmaceutical supply chains exhibit stickiness and strong first-stage effects for the instrumental variable. The estimates from the regression using the instrumental variable along with state and time fixed effects demonstrate that an unanticipated reduction in the supply of oxymorphone leads to a statistically significant increase in the number of drug-overdose related deaths. Back-of-the envelope calculations suggest that the oxymorphone shortage led to a 4% increase in deaths related to drug-overdoses. This is one of few papers examining the effect of a drug shortage at the national level, and the first to use a shift-share instrumental variable to address the endogeneity concern that arises from using geographical variation in the incidence of a shortage. In addition to serving as a template for research into effects of other shortages, the results here also caution against the DEAs use of restrictive production quotas which have recently precipitated other opioid shortages.

Cameron Pratt

Astronomy

Separating Weak Astrophysical Signals using Multi-Channel Convolutional Neural Networks

Abstract

A major task throughout many disciplines of science is to measure a signal coming from a source of interest that is mixed with other components. Many component separation techniques have been developed to isolate the desired signals, however, these often rely on strong assumptions that may not always be valid. Recently, scientists have turned to machine learning algorithms in hopes of achieving more reliable results. I will present a convolutional neural network that can separate signals given multi-channel images. Moreover, I apply this technique to observations of the Universe in the microwave bands with the intent of detecting weak signals coming from clusters of galaxies. Similar methods can and have been applied to other areas of research, such as the separation of tumors from healthy tissue in biomedical imaging, and may be useful in many different contexts.

Elizabeth Trinh

Management and Organizations, Ross

The Busy Bee Effect: How and Why Self-imposed Busyness Affects Work-related Outcomes

Abstract

While conventional wisdom from wellness experts and work-life balance advocates as well as prior research often cast busyness in a negative light, many people engage in busyness—often by their own choice. To explore this puzzle, we execute a multistudy investigation. In Study 1, we conducted a qualitative study of 39 entrepreneurs to understand their experiences of busyness. Through our inductive analysis, we identify the phenomenon of ‘self-imposed busyness,’ which we define as the self-initiated filling of one’s time with work-related tasks or activities. Building on this, we develop a conceptual model to investigate dual-edged effects of self-imposed busyness. We then test it across two studies of creative independent workers and management students. Our findings reveal that on one hand, self-imposed busyness can increase anxiety, which in turn, decreases well-being. On the other hand, self-imposed busyness can boost cognitive work engagement, which leads to improvements in well-being. Our theory and findings challenge the prevailing view that busyness is entirely detrimental, providing valuable insights into how and why self-imposed busyness can yield both positive and negative outcomes.

Jingyang Rui

Political Science

How Crisis Responsibility Attribution Shapes Authoritarian Co-optation Strategy: Evidence from China’s COVID Testing Resource Allocation

Abstract

Authoritarian governments often employ the strategy of preferential resource distribution to gain favour with certain civilian groups during external crises like natural disasters or economic downturns. However, how do they adjust their co-optation strategy if the crisis is perceived to be caused by their own actions? This study updates existing co-optation theories by proposing that when crisis responsibility becomes attributable to the government, it will maintain the distribution of genuine benefits to the civilian group with the highest rent-extraction value, while shifting from offering performative benefits to genuine benefits to the “ideologically favoured” group according to the regime’s ideological narrative to boost its legitimacy. Our theory is empirically supported by an analysis of the preferential allocation of COVID-19 testing resources across different economic classes in N city, a Chinese metropolis and economic powerhouse with high income disparity, before and after the sudden outbreak of popular protests against the zero-COVID policies in October 2022. Using a novel dataset of the queuing status at N city’s over 7000 COVID testing sites and a spatial border analysis approach across N city’s 4698 neighbourhoods, we compare testing site density and government responsiveness to site crowdedness across testing sites in the rich, the middleclass, and the poor neighbourhoods. We find that prior to the protest, in which government policies were viewed as a greater threat than the pandemic itself, the government offered genuine benefits to the rich, performative benefits to the poor, and the lowest level of resources to the middle class. After the protest broke out, the government provided genuine benefits to both the rich and the poor, without raising benefits to the middle class. These results are supplemented by a text analysis of N city government documents on COVID-19, showing the government’s inclination to publicize their resource preferences to the poor both before and after the protest outburst.

Micaela Rodriguez

Psychology

Lonely or Just Alone? Beliefs Shape the Experience of Being Alone

Abstract

This talk suggests that current efforts to combat loneliness may inadvertently exacerbate it by negatively influencing people’s beliefs about being alone. I present evidence from novel and diverse methods that (i) the media portrays being alone as harmful, (ii) such portrayals negatively impact people’s beliefs, and (iii) such beliefs predict increases in loneliness over time. I use a combination of archival, experimental, and experience sampling data.

Minseo Kim

Electrical Engineering and Computer Science

Diffusion Model for Undersampled MRI Reconstruction

Abstract

Undersampled MRI is a critical technique to speed up MRI scans by capturing only a portion of the necessary k-space data for image reconstruction. This approach enables shorter scan times while still generating diagnostically useful images. By reducing patient discomfort and motion-related artifacts caused by lengthy scans, undersampled MRI holds great potential to improve the overall patient experience. However, the undersampling can result in loss of image detail and introduce unwanted distortions, making accurate reconstruction quite challenging. Recently, machine learning methods, especially diffusion models, have gained significant attention in the field of MRI reconstruction and have shown promising outcomes across various imaging tasks. We apply the score-based diffusion model with diffusion posterior sampling to better solve this medical imaging inverse problem. We demonstrate the effectiveness of the method on a FastMRI dataset with over 3000 images. This talk will mainly focus on the following key aspects: a brief introduction to diffusion probabilistic models (forward (i.e. training) and reverse (i.e. sampling) process), score modeling for training images, and application of diffusion model to undersampled MRI images.

Peijun Wu

Biostatistics

Statistical identification of cell type-specific spatially variable genes in spatial transcriptomics

Abstract

Spatially resolved transcriptomics, enabled by a diverse range of technologies, facilitates in-depth exploration of transcriptomic landscapes, extending from individual cellular domains to broader tissue contexts. However, the interpretation of abundant gene expression data derived from techniques quantifying averaged expression per spot is frequently complicated by the heterogeneity in cellular compositions, leading to significant computational and statistical challenges. The spatial heterogeneity of gene expression within specific cell types, influenced by functionality, microenvironments, and intercellular communication, further adds to this complexity. Evident in distinct brain regions, these spatial variations in gene expression play critical roles in cellular differentiation, tissue organization, disease progression, and in identifying potential new therapeutic targets, underscoring the importance of better analytical methods to interpret these spatially resolved transcriptomics data. To tackle these limitations, we introduce Celina (CELl type-specific spatIal patterN Analysis in spatial transcriptomics), a statistical method developed to identify genes exhibiting cell type-specific spatial expression patterns. By employing a spatially varying coefficient model, Celina examines one gene at a time and accurately models each gene’s spatial expression pattern, in relation to the distribution of cell types across tissue locations. Not only does Celina maintain calibrated type I error control, but it also shows a significant increase in detection power across a spectrum of technical platforms. Applications to seven spatial transcriptomics data, including a mouse cerebellum Slide-seq data, Celina identified 5 Purkinje-specific spatial genes and 12 granular-specific spatial genes, thereby disclosing spatial heterogeneity and diverse functional lobules among these cell types in the mouse cerebellum. Thus, Celina offers a significant advance in the reliable interpretation of spatial transcriptomic data, contributing an innovative dimension to our understanding of cellular heterogeneity.

Tim White

Statistics

Sequential Monte Carlo for detecting and deblending objects in astronomical images

Abstract

Many of the objects imaged by the forthcoming generation of astronomical surveys will overlap visually. These objects are known as blends. Distinguishing and characterizing blended light sources is a challenging task, as there is inherent ambiguity in the type, position, and properties of each source. We propose SMC-Deblender, a novel approach to probabilistic astronomical cataloging based on sequential Monte Carlo (SMC). Given an image, SMC-Deblender evaluates catalogs with various source counts by partitioning the SMC particles into blocks. With this technique, we demonstrate that SMC can be a viable alternative to existing deblending methods based on Markov chain Monte Carlo and variational inference. In experiments with ambiguous synthetic images of crowded starfields, SMC-Deblender accurately detects and deblends sources, a task which proves infeasible for Source Extractor, a widely used non-probabilistic cataloging program.

Oral Presentation 2: Methods/Theory

March 28th 9:15 AM – 12:00 PM @ West Conference Room

Chih-Yu Chang

Statistics

Generalized Least Square-Based Aggregation for Regression Problems

Abstract

Bagging is a noteworthy technique for improving the performance of prediction models in ensemble learning. The literature underscores the importance of effectively managing the correlation between the aggregated prediction models for the success of bagging. For instance, random forests, a widely used prediction model, tackle this issue by achieving decorrelation through the random selection of features when aggregating multiple single-tree models. This study presents an innovative approach to boost predictive performance within the context of bagging, by addressing the correlation via the concept of generalized least squares. Both theoretical analysis and numerical experiments provide evidence in favor of the proposed method, positioning it as a promising avenue for advancing bagging techniques.

Dat Do

Statistics

Dendrogram of latent mixing measures: Learning hierarchy and model selection for finite mixture models

Abstract

e present a new way to summarize and perform model selection for mixture models via the hierarchical clustering tree (dendrogram) of the over-fitted latent mixing measure. Our proposed method bridges agglomerative hierarchical clustering with a mixture model-based approach. The dendrogram’s construction is derived from the theory of convergence of the mixing measures so we can (1) consistently select the true number of components and (2) recover a good convergence rate for parameter estimation from the tree. Theoretically, it explicates the choice of the optimal number of clusters in hierarchical clustering. Methodologically, the dendrogram reveals more information on the hierarchy of subpopulations compared to traditional ways of summarizing mixture models. Several simulation studies are carried out to support our theory. We also illustrate the methodology via single-cell RNA sequence data.

Gabriel Durham

Statistics

Comparing Multilevel Adaptive Interventions in Clustered SMARTs using Longitudinal Outcomes: With Application to Health Policy

Abstract

Researchers and policymakers can conceptualize many health policies as adaptive interventions, which incorporate adjustments of recommended action(s) at each decision point based on previous outcomes or actions. Health policy interventions often involve intervening at a system-level (e.g., a primary care clinic) with the intent to modify behavior of individuals within the system (e.g., doctors in a clinic). Policy scientists (e.g., implementation scientists) can use clustered, sequential, multiple assignment, randomized trials (SMART) to compare such “multilevel adaptive interventions” on a nested, end-of-study outcome. However, existing methods are not suitable when the primary outcome in a clustered SMART is nested and longitudinal; e.g., repeated outcome measures nested within each clinician and clinicians nested within sequentially-randomized clinics. In this manuscript, we propose a three-level marginal mean modeling and estimation approach for comparing multilevel adaptive interventions in a clustered SMART. This methodology accommodates both the cross-temporal within-unit correlation in the longitudinal outcome and the inter-unit correlation within each cluster. We illustrate our methods using data from two clustered, health-policy SMARTs: the first aims to improve guideline concordant opioid prescribing in non-cancer primary care clinics in Wisconsin and Michigan; and the second aims to improve the adoption of evidence-based mental health treatments in high schools across Michigan.

Hrithik Ravi

Electrical and Computer Engineering

Beyond Cross-Entropy Loss: Multiclass Implicit Regularization for Exponentially-Tailed PERM Losses

Abstract

The study of implicit regularization effects so far have shed light on understanding generalization properties of different optimization algorithms for different loss functions. However, there is a notable gap in the existing literature- most of the focus is on the binary classification setting, with very limited work being done for multiclass classification problems. In our work, focusing on linearly separable datasets, we bridge this gap and prove implicit regularization results for a broader class of multiclass loss functions.

Jesse Wheeler

Statistics

Arima2: An R Package for Likelihood Based Inference for ARIMA Modeling

Abstract

Autoregressive moving average (ARMA) models are frequently used to analyze time series data. Due to their popular use in scientific studies, any improvement in parameter estimation can be considered a significant advancement in computational statistics. Despite the popularity of these models, algorithms for fitting ARMA models have weaknesses that are not well known. We provide a summary of parameter estimation via maximum likelihood and discuss common pitfalls that may lead to sub-optimal parameter estimates. We propose a random restart algorithm for parameter estimation that frequently yields higher likelihoods than traditional maximum likelihood estimation procedures. The random restart algorithm is implemented in an R package called “arima2”. Through a series of simulation studies, we demonstrate the efficacy of our proposed algorithm.

Kaiwen Hou

Columbia Business School

Geometry-Aware Normalizing Wasserstein Flows for Optimal Causal Inference

Abstract

This paper introduces a novel approach to causal inference, leveraging the power of continuous normalizing flows (CNFs) within the parametric submodel framework. Traditional Targeted Maximum Likelihood Estimation (TMLE), while effective, depends heavily on accurately defined propensity models and perturbation directions, a requirement often challenging to fulfill. Our method integrates CNFs, known for their ability to model complex distributions via differential equations, to enhance the geometric sensitivity of these parametric submodels. This integration is particularly significant in TMLE, allowing for a more nuanced approach that minimizes the Cramér-Rao bound, shifting from a priori distribution $p_0$ to an empirically-informed distribution $p_1$. The paper extends the typical application of CNFs by incorporating Wasserstein gradient flows into Fokker-Planck equations. In particular, the versatility of CNFs is harnessed to impose geometric structures based on prior objectives, transcending regularization and enhancing model robustness, especially in the context of optimal transport theory. To apply to the optimal causal inference, our approach focuses on the disparity between sample and population distributions, a common source of bias in parameter estimation. Utilizing the concepts of optimal transport and Wasserstein gradient flows, we aim to develop methodologies for causal inference that are minimally variant in finite-sample settings and compare favorably against asymptotically-optimal traditional methods like TMLE and AIPW. The proposed framework, through the lens of Wasserstein gradient flows, minimizes the variance of efficient influence functions under distribution $p_t$. Preliminary experiments demonstrate the efficacy of our approach, indicating lower mean-squared errors compared to naive flows, and underscoring the potential of geometry-aware normalizing Wasserstein flows in enhancing statistical modeling and inference.

Subha Maity

Statistics

An Investigation of Representation and Allocation Harms in Contrastive Learning

Abstract

The effect of underrepresentation on the performance of minority groups is known to be a serious problem in supervised learning settings; however, it has been underexplored so far in the context of self-supervised learning (SSL). In this paper, we demonstrate that contrastive learning (CL), a popular variant of SSL, tends to collapse representations of minority groups with certain majority groups. We refer to this phenomenon as representation harm and demonstrate it on image and text datasets using the corresponding popular CL methods. Furthermore, our causal mediation analysis of allocation harm on a downstream classification task reveals that representation harm is partly responsible for it, thus emphasizing the importance of studying and mitigating representation harm. Finally, we provide a theoretical explanation for representation harm using a stochastic block model that leads to a representational neural collapse in a contrastive learning setting.

Rachel Newton

Electrical and Computer Engineering

Optimality of POD for Data-Driven LQR With Low-Rank Structures

Abstract

The optimal state-feedback gain for the Linear Quadratic Regulator (LQR) problem is computationally costly to compute for high-order systems. Reduced-order models (ROMs) can be used to compute feedback gains with reduced computational cost. However, the performance of this common practice is not fully understood. This letter studies this practice in the context of data-driven LQR problems. We show that, for a class of LQR problems with low-rank structures, the controllers designed via their ROM, based on the Proper Orthogonal Decomposition (POD), are indeed optimal. Experimental results not only validate our theory but also demonstrate that even with moderate perturbations on the low-rank structure, the incurred suboptimality is mild.

Robert Malinas

Electrical and Computer Engineering

Community Detection in High-Dimensional Graph Ensembles

Abstract

Detecting communities in high-dimensional graphs can be achieved by applying random matrix theory where the adjacency matrix of the graph is modeled by a Stochastic Block Model (SBM). However, the SBM makes an unrealistic assumption that the edge probabilities are homogeneous within communities, i.e., the edges occur with the same probabilities. The Degree-Corrected SBM is a generalization of the SBM that allows these edge probabilities to be different, but existing results from random matrix theory are not directly applicable to this heterogeneous model. In this paper, we derive a transformation of the adjacency matrix that eliminates this heterogeneity and preserves the relevant eigenstructure for community detection. We propose a test based on the extreme eigenvalues of this transformed matrix and (1) provide a method for controlling the significance level, (2) formulate a conjecture that the test achieves power one for all positive significance levels in the limit as the number of nodes approaches infinity, and (3) provide empirical evidence and theory supporting these claims.

Shihao Wu

Statistics

A General Latent Embedding Approach for Modeling High-dimensional Hyperlinks

Abstract

Hyperlinks encompass polyadic interactions among entities beyond dyadic relations. Despite the growing research interest in hyperlink modeling, most existing methodologies have significant limitations, including a heavy reliance on uniform restrictions of hyperlink orders and the inability to account for repeated observations of identical hyperlinks. We introduce a novel and general latent embedding approach that tackles these challenges through the integration of latent embeddings, vertex degree heterogeneity parameters, and an order-adjusting parameter. Theoretically, we investigate identification conditions for the latent embeddings and associated parameters and establish convergence rates of their estimators along with asymptotic normality. Computationally, we employ a universal singular value thresholding initialization and a projected gradient ascent algorithm for parameter estimation. A comprehensive simulation study is performed to demonstrate the effectiveness of the algorithms and validate the theoretical findings. Moreover, an application involving a co-citation hypergraph network is used to further illustrate the advantages of the proposed method.

Oral Presentation 3: Methods/Theory

March 28th 1:00 PM – 3:00 PM @ Amphitheatre

Mengqi Lin

Statistics

Controlling the False Discovery Proportion in Observational Studies with Hidden Bias

Abstract

We propose an approach to exploratory data analysis in matched observational studies. We consider the setting where a single intervention is thought to potentially impact multiple outcome variables, and the researcher would like to investigate which of these causal hypotheses come to bear while accounting not only for the possibility of false discoveries, but also the possibility that the study is plagued by unmeasured confounding. For any candidate set of rejected hypotheses, our method provides sensitivity intervals for the false discovery proportion (FDP), the proportion of rejected hypotheses that are actually true. For a set containing $L$ outcomes, the method describes how much unmeasured confounding would need to exist for us to believe that the proportion of true hypotheses is $0/L$,$1/L$,…, all the way to $L/L$. Moreover, the resulting confidence statements intervals are valid simultaneously over all possible choices for the rejected set, allowing the researcher to look in an ad hoc manner for promising subsets of outcomes that maintain a large estimated fraction of correct discoveries even if a large degree of unmeasured confounding is present. The approach is particularly well suited to sensitivity analysis, as conclusions that some fraction of outcomes were affected by the treatment exhibit larger robustness to unmeasured confounding than the conclusion that any particular outcome was affected. In principle, the method requires solving a series of quadratically constrained integer programs. That said, we show not only that a solution can be obtained in reasonable run time, but also that one can avoid running the integer program altogether with high probability in large samples. We illustrate the practical utility of the method through simulation studies and a data example.

Unique Subedi

Statistics

Online Infinite-Dimensional Regression: Learning Linear Operators

Abstract

We consider the problem of learning linear operators under squared loss between two infinitedimensional Hilbert spaces in the online setting. We show that the class of linear operators with uniformly bounded p-Schatten norm is online learnable for any p ∈ [1, ∞). On the other hand, we prove an impossibility result by showing that the class of uniformly bounded linear operators with respect to the operator norm is not online learnable. Moreover, we show a separation between sequential uniform convergence and online learnability by identifying a class of bounded linear operators that is online learnable but uniform convergence does not hold. Finally, we prove that the impossibility result and the separation between uniform convergence and learnability also hold in the batch setting.

Vincenzo Loffredo

Statistics

Nonparametric Velocity Field Modeling

Abstract

In this paper we propose a practical approach to analyze longitudinal data based on Gaussian Processes. We introduce the concept of modeling the velocity and the velocity field of a time series. This approach allows for a broader application of Gaussian Processes in longitudinal settings, without increasing the computational complexity of the problem and without losing interpretability of the results. We show that this new class of models performs comparatively well to existing ones in standard settings, and provides improvements in inference in mis-specification settings like mixtures and non-stationarity. Our motivation for this model comes from the LONGRoad study, which analyzes the effects of aging on the driving abilities of the participants.

Yao Song

Biostatistics

Multi-Objective Tree-based Reinforcement Learning for Estimating Tolerant Dynamic Treatment Regimes

Abstract

A dynamic treatment regime (DTR) is a sequence of treatment decision rules, one per stage of intervention, that dictates individualized treatments based on evolving treatment and covariate history. It provides a means for operationalizing a clinical decision support system and fits well into a broader paradigm of personalized medicine. However, many real-world problems involve multiple objectives, and decision rules may differ for different objectives when trade-offs are present. Furthermore, there may be more than one feasible decision that leads to an empirically sufficient optimization result. In this study, we propose the concept of a tolerant regime, which gives a set of individualized feasible decision(s) at each stage under a pre-specified tolerance rate. We present a multi-objective tree-based reinforcement learning (MOT-RL) method to directly estimate the tolerant DTR (tDTR) that optimizes multiple objectives in a multi-stage multi-treatment setting. At each stage, MOT-RL constructs an unsupervised decision tree by first modeling the mean of counterfactual outcomes for each objective via semiparametric regression models and then maximizing the purity measure constructed by the scalarized augmented inverse probability weighted estimators (AIPWE) calculated from the AIPWE of all objectives. The proposed method is implemented in a backward inductive manner through multiple decision stages, and it delivers optimal DTR as well as tDTR depending on the decision-maker’s preferences. MOT-RL is robust, efficient, easy to interpret, and flexible to different problem settings. With the proposed method, we identify the two-stage chemotherapy regime that simultaneously maximizes the relief of disease burden and prolongs the survival of prostate cancer patients.

Yuliang Xu

Biostatistics

Bayesian Image regression with Soft-thresholded Conditional Autoregressive prior

Abstract

For regression problems with brain imaging component, Bayesian models are one of the most popular choices due to its flexibility and uncertainty quantification features. However, it can be computationally challenging for high-dimensional problems, and the correlation structure of the imaging component is usually pre-specified and may not reflect the underlying true structure adequately. To overcome these challenges in computation and correlation accuracy, we develop a general and scalable variational inference method for regression models with large-scale imaging data. We first propose a soft-thresholded conditional autoregressive (ST-CAR) prior for the sparse-mean model, where the correlation structure of the imaging component can be learned through ST-CAR prior. Next we apply the ST-CAR prior to scalar-on-image and image-on-scalar regression models as two examples, and develop the coordinate ascent variational inference (CAVI) and stochastic subsampling variational inference (SSVI) algorithms for these two models. We perform simulations to show that ST-CAR prior outperforms the existing methods in terms of selecting active regions with complex correlation patterns, and to demonstrate that CAVI and SSVI have superior computational performance over existing methods. We apply the proposed method on ABCD study as an real data example.

Zeyu Sun

Electrical and Computer Engineering

Minimum-Risk Recalibration of Classifiers

Abstract

Recalibrating probabilistic classifiers is vital for enhancing the reliability and accuracy of predictive models. Despite the development of numerous recalibration algorithms, there is still a lack of a comprehensive theory that integrates calibration and sharpness (which is essential for maintaining predictive power). In this paper, we introduce the concept of minimum-risk recalibration within the framework of mean-squared-error (MSE) decomposition, offering a principled approach for evaluating and recalibrating probabilistic classifiers. Using this framework, we analyze the uniform-mass binning (UMB) recalibration method and establish a finite-sample risk upper bound of order O~(B/n+1/B2) where B is the number of bins and n is the sample size. By balancing calibration and sharpness, we further determine that the optimal number of bins for UMB scales with n1/3, resulting in a risk bound of approximately O(n−2/3). Additionally, we tackle the challenge of label shift by proposing a two-stage approach that adjusts the recalibration function using limited labeled data from the target domain. Our results show that transferring a calibrated classifier requires significantly fewer target samples compared to recalibrating from scratch. We validate our theoretical findings through numerical simulations, which confirm the tightness of the proposed bounds, the optimal number of bins, and the effectiveness of label shift adaptation.

Oral Presentation 4: Applications

March 29th 9:00 AM – 12:00 PM @ East Conference Room

Cheoljoon Jeong

Industrial and Operations Engineering

Calibration of Building Energy Computer Models via Bias-Corrected Iteratively Reweighted Least Squares Method

Abstract

As the building sector contributes approximately three-quarters of the U.S. electricity load, analyzing buildings’ energy consumption patterns and establishing their effective operational strategy become of great importance. To achieve those goals, a physics-based building energy model (BEM), which can simulate a building’s energy demand under various weather conditions and operational scenarios, has been developed. To obtain accurate simulation outputs, it is necessary to calibrate some parameters required for the BEM’s pre-configuration. The BEM calibration is usually accomplished by matching the simulated energy use with the measured one. However, even with the efforts to calibrate the BEM at best, a systematic discrepancy between the two quantities is often observed, preventing the precise estimation of the energy demand. Such discrepancy is referred to as bias in this study. We present a new calibration approach that models the discrepancy to correct the relationship between the simulated and measured energy use. We show that our bias correction can improve predictive performance. Additionally, we observe the heterogeneous variance in the electricity loads, especially in the afternoon hours, which often reduces prediction accuracy and increases uncertainty. To address this issue, we incorporate heterogeneous weights into the least squares loss function. To implement the bias-correction procedure with the weighted least squares formulation, we propose a newly devised iteratively reweighted least squares algorithm. The effectiveness of the proposed calibration methodology is evaluated with a real-world dataset collected from a residential building in Texas

Hannah Van Wyk

Epidemiology

Inferring undetected transmission dynamics prior to a dengue outbreak using a hidden Markov model

Abstract

An infectious disease outbreak investigation typically consists of retrospectively determining information regarding the outbreak such as the timing of the primary case (the first case of the outbreak, whether detected or not) and transmission dynamics that occurred prior to the outbreak. However, information on the primary case is often hard to obtain, especially in scenarios where the disease has a high asymptomatic ratio. In these cases, the outbreak investigation is conducted based on knowledge of the index case, or the first detected case. Around half of dengue infections are asymptomatic, making it unlikely that the primary case is detected. Therefore, individuals with asymptomatic infections can begin a chain of transmission that goes undetected until the outbreak is intractable. We use a 2019 dengue outbreak that occurred in a rural town in Northern Ecuador as a case study to investigate potential undetected transmission dynamics prior to the outbreak. The outbreak was preceded by 4 candidate index cases occurring between 3 months and 2 weeks prior to the outbreak, which began in mid-May. Using a hidden Markov model, we estimate the most likely date of the primary case. We found that the most likely date was highly dependent on the assumed case reporting fraction and that lower reporting fractions had a wider range of possible primary case dates. For higher reporting fractions, the most likely primary case occurred closer to the outbreak (May 6th for the 40% reporting fraction; 95% confidence interval April 12th through May 12th) and for lower reporting fractions, they occurred earlier (April 23rd for the 5% reporting fraction; 95% confidence interval: January 3rd through May 3rd). Of the 4 candidate index cases, the May 2nd case was the most likely. Our modeling approach can be used to retrospectively determine undetected transmission dynamics in other infectious disease outbreaks.

Lillian Rountree

Biostatistics

Is Public Trust a Predictor of COVID outcomes?

Abstract

Background: There is evidence that the level of trust individuals have in government, science and society impacts their compliance with life-saving public health measures like vaccination, particularly during disease outbreaks such as the COVID-19 pandemic. The World Values Survey is an international research program that, since 1981, has collected data on countries’ cultural values, making it a crucial resource for understanding societal levels of trust. However, current analyses connecting the social and cultural data provided by the World Values Survey to COVID-19 outcomes remain simplistic, particularly regarding trust and confidence. Methods: From the data collected in 64 countries on 32 questions from Wave 7 of the World Values Survey, we use dimension reduction and clustering methods to construct composite trust and confidence scores for use in a regression model predicting COVID-19 outcomes, such as vaccination rate and excess deaths per million. We then cluster these countries to identify groups of states with similar levels (high, low, middling) of trust. In this ecological regression, we also include potential confounders for COVID-19 outcomes, including GDP and other measures of development. We also investigate the longitudinal aspect of the data, studying the effect of past levels of trust and how these levels vary over time. Results: Unadjusted results suggest that there is a positive relationship between higher levels of trust—particularly trust in government, the WHO, and the police—with lower excess deaths per million during the years of the pandemic and higher total levels of COVID-19 vaccination. Further analysis, particularly with the inclusion of confounders, will offer greater insight into the specific aspects of trust that most impact a range of COVID-19 outcomes. Significance: Quantifying the impact of institutional and societal trust on disease outcomes is important for improving public health interventions and mitigating the negative effects of outbreaks. The ecological nature of the analysis, while revealing important patterns, suggests the need for more individually resolved data that can be used in future modeling of transmission dynamics of infectious disease.

Mingyan Yu

Biostatistics

Joint modeling of longitudinal data and survival outcome via threshold regression to study the association between individual-level biomarker variabilities and survival outcomes

Abstract

Longitudinal biomarker data and health outcomes are regularly collected in numerous epidemiological studies for studying the prediction of biomarker trajectories to health outcomes, which informs possible effective health interventions. Many existing methods that connect longitudinal trajectories with health outcomes put their attention mainly on mean profiles, treating variabilities as nuisance parameters. However, these variabilities may carry a substantial information related to the health outcomes. In this project, we develop a Bayesian joint modeling approach to study the association between the mean trajectories along with variabilities in longitudinal biomarker and survival times. To model the longitudinal biomarker data, we adopt the linear mixed effects model and allow individuals to have their own variabilities. Following that, we model the survival times by incorporating random effects and variabilities from the longitudinal piece as predictors through threshold regression. Threshold regression, also known as “first-hitting-time model”, is a more general stochastic process approach for modeling time-to-event outcomes that allows for non-proportional hazards. We demonstrate the behavior of the proposed joint model through simulations. We apply the proposed joint model to data from Study of Women’s Health Across the Nation (SWAN) and reveal that higher mean values and larger variabilities of Follicle-stimulating hormone (FSH) are associated with an earlier age of final menstrual period.

Mukai Wang

Biostatistics

Analysis of Microbiome Differential Abundance by Pooling Tobit Models

Abstract

Motivation: Microbiome differential abundance analysis (DAA) is commonly used to identify microbiome species associated with different disease conditions. Many statistical and computational methods tailored for microbiome metagenomics data have been proposed for DAA. However, controlling FDR while maintaining high statistical power remains challenging. The compositionality and sparseness of metagenomics data are two main challenges for DAA. The identification of a reliable normalization factor and an accurate interpretation of zeros are the key to a robust DAA method. Results: We offer two new perspectives to solving the two challenges. First, we demonstrate a procedure to find a subset of reference microbiome taxa that are not differentially abundant (DA). The procedure is justified based on mathematical relationships between relative abundance and absolute abundance under the assumption that fewer than half of all the taxa are DA. We can find DA taxa based on the count ratio between individual taxa and the sum of reference taxa. Second, we consider the zero counts as left censored and introduce the tobit model for log count ratios between a single taxon and the sum of multiple taxa. We combine these two ideas to propose analysis of microbiome differential abundance by pooling tobit models (ADAPT). Through simulation studies and real data analysis, we show that our method has more consistent control of false discovery rates than competitors while displaying competitive statistical power. Availability and Implementation: The R package ADAPT can be installed from Github at https://github.com/mkbwang/ADAPT. The source codes for simulation studies and real data analysis are available at https://github.com/mkbwang/ADAPT example.

Prayag Chatha

Statistics

Neural Posterior Estimation for Simulation-based Inference in Epidemic Modeling

Abstract

Stochastic epidemic models with individual-based transmission are useful for capturing heterogeneous mixing and infection risks across a population. While simulating an epidemic from these models is straightforward, they tend to exhibit complex dynamics that give rise to intractable likelihoods, making exact Bayesian inference of key transmission parameters difficult. Approximate Bayesian Computation (ABC) leverages simulated copies of data to construct a nonparametric estimate of the true posterior, but ABC can be prohibitively slow for high-dimensional posteriors. In contrast, Neural Posterior Estimation (NPE) involves training a neural network on simulated data to efficiently learn a Gaussian approximation to the posterior. Our work is the first application of NPE to epidemiology, and we hypothesize that neural networks can automatically summarize high-dimensional epidemic data. We compare the NPE with ABC as inference procedures using popular compartmental disease models such as the SIR model. We conclude with a real-world case study of inferring spatially variegated infection risks in a pneumonia outbreak at a long-term acute care hospital.

Timothy Raxworthy

Michigan Program in Survey Methodology

Comparison of School Dropout Proportions for Evaluating Environmental Education Programs in Madagascar

Abstract

This study’s focus is to estimate the probability of a primary school student dropping out across selected schools in the Ifanadina region of Madagascar. Using data collected in August 2023, we fit a marginal logistic model using generalized estimating equations (GEE) to model our binary outcome of a student dropping out as a function of the number of school years repeated, the environmental education (EE) program for each student, the student’s age and gender. Our study presents the results of our fixed effect coefficients and the predictive probabilities across all students included in our data. Overall we found evidence to support that certain EE programs reduce the log likelihood of a student dropping for those individuals included in our data. We explain our method for building our model, discuss our interpretation of the results and make recommendations based on our findings.

Zheng Li

Biostatistics

VINTAGE: A unified framework integrating gene expression mapping studies with genome-wide association studies for detecting and deciphering gene-trait associations

Abstract

Integrative analysis of genome-wide association studies (GWASs) and gene expression mapping studies has the potential to better elucidate the molecular mechanisms underlying disease etiology. Here, we present VINTAGE, an optimal and unified statistical framework for such integrative analysis, that aims to identify genes associated with a trait of interest. VINTAGE unifies the widely applied SKAT and TWAS methods into the same analytic framework, bridged by the local genetic correlation, and includes both methods as special cases, explicitly models and quantifies the amount of information contributed by the gene expression mapping study, achieves robust power performance across a range of local genetic correlation values between gene expression and the trait, enables the testing of the role of gene expression in mediating the gene-trait association, and is computationally fast. We illustrated the benefits of VINTAGE through comprehensive simulations and applications to eighteen complex traits from UK Biobank. In the real data applications, we leveraged eQTL summary statistics from eQTLGen and GWAS summary statistics from UK Biobank. VINTAGE improves the power for detecting gene-trait associations by an average of 8% compared to existing approaches, improves the power for testing a mediation effect of gene expression on trait by an average of 231%, and quantifies the amount of genetic effect on trait that is mediated through the gene expression.

Oral Presentation 5: Methods/Theory

March 29th 9:00 AM – 12:00 PM @ West Conference Room

Andrej Leban

Statistics

Approaching an unknown communication system by latent space exploration and causal inference

Abstract

This paper proposes a methodology for discovering meaningful properties in data by exploring the latent space of unsupervised deep generative models. We combine manipulation of individual latent variables to extreme values with methods inspired by causal inference into an approach we call causal disentanglement with extreme values (CDEV) and show that this method yields insights for model interpretability. With this, we can test for what properties of unknown data the model encodes as meaningful, using it to glean insight into the communication system of sperm whales (Physeter macrocephalus), one of the most intriguing and understudied animal communication systems. The network architecture used has been shown to learn meaningful representations of speech; here, it is used as a learning mechanism to decipher the properties of another vocal communication system in which case we have no ground truth. The proposed methodology suggests that sperm whales encode information using the number of clicks in a sequence, the regularity of their timing, and audio properties such as the spectral mean and the acoustic regularity of the sequences. Some of these findings are consistent with existing hypotheses, while others are proposed for the first time. We also argue that our models uncover rules that govern the structure of units in the communication system and apply them while generating innovative data not shown during training. This paper suggests that an interpretation of the outputs of deep neural networks with causal inference methodology can be a viable strategy for approaching data about which little is known and presents another case of how deep learning can limit the hypothesis space. Finally, the proposed approach can be extended to other architectures and datasets.

Daniele Bracale

Statistics

Distribution Shift Estimation in Strategic Classification

Abstract

In many prediction problems, the predictive model affects the distribution of the (prediction) target. This phenomenon is known as performativity and is often caused by the strategic behavior of agents in the problem environment. For example, spammers will change the content of their messages to evade spam filters. One of the main barriers to the broader adoption and study of performative prediction in machine learning practice is practitioners are generally unaware of how their predictions affect the population. To overcome this barrier, we develop methods for learning the distribution map that encodes the long-term impacts of predictive models on the population.

Hu Sun

Statistics

Conformalized Tensor Completion with Riemannian Optimization

Abstract

Tensor completion is a technique that estimates the values of tensor entries where data is missing, and real applications are commonly seen, for instance, in video in-painting, network analysis and recommender system. Despite the encouraging progresses made on the tensor completion methodologies, the uncertainty quantification of tensor completion estimators is lacking in the literature. In this paper, we attempt to fill in this gap by utilizing the framework of conformal prediction with weighted exchangeability, under which the problem of quantifying the uncertainty of the tensor completion estimator is converted to estimating the missing probability of each tensor entry. With the assumption that the missing probability tensor has low tensor-train rank, we propose to estimate the probability tensor using a single tensor instance with fast Riemannian optimization. Theoretical guarantees of the convergence as well as the coverage of the resulting conformal intervals are provided. We validate the efficacy of the proposed framework with both numerical experiments and application to a global total electron content (TEC) imputation problem.

Jiacong Du

Biostatistics

Doubly robust causal inference in high dimension: combining non-probability samples with designed surveys

Abstract

Causal inference on the average treatment effect (ATE) using non-probability samples, such as electronic health records (EHR), presents challenges due to possibility of sample selection bias and high-dimensional covariates. This introduces the need to consider a selection model in addition to treatment and outcome models that are typical ingredients of a causal inference problem. We propose a doubly robust (DR) ATE estimator that integrates internal data from a large non-probability sample with an external probability sample from design surveys considering possibly high-dimensional confounders and variables that influence selection. We introduce a novel penalized estimating equation for nuisance parameters by minimizing the squared asymptotic bias of the DR estimator. Our approach allows us to make inferences on the ATE in high-dimensional settings by ignoring the variability in estimating nuisance parameters, which is not guaranteed in conventional likelihood approaches due to non-differentiable L1-type penalties. We provide a consistent variance estimator for the DR estimator. Simulation studies demonstrate the double robustness of our DR estimator under misspecification of either the outcome model or the selection and treatment model, as well as the validity of statistical inference under penalized estimation. We apply our method to integrate EHR data from the Michigan Genomics Initiative with an external probability sample.

Kaiwen Hou

Columbia Business School

Constrained Learning for Causal Inference and Semiparametric Statistics

Abstract

A fundamental problem in causal inference is the accurate estimation of the average treatment effect (ATE). Existing methods such as Augmented Inverse Probability Weighting (AIPW) and Targeted Maximum Likelihood Estimation (TMLE) are asymptotically optimal. Although these methods are asymptotically equivalent, they exhibit significant differences in finite-sample performance, numerical stability, and complexity, which raises questions about their relative practical utility. In response, we develop the Constrained Learner (C-Learner), which is a new asymptotically optimal method for estimating the ATE. C-Learner is flexible and conceptually very simple: it directly encodes the condition for asymptotic optimality of the estimator as a constraint for learning outcome models, which are then used in a plug-in estimator for the ATE. C-Learner can thus leverage tools and advances from constrained optimization to learn these outcome models. In practice, we find that C-Learner performs comparably to or better than other asymptotically optimal methods. These attributes collectively position C-Learner as a compelling new tool for researchers and practitioners of causal inference.

Kevin Christian Wibisono

Statistics

Estimation of Non-Randomized Heterogeneous Treatment Effects in the Presence of Unobserved Confounding Variables

Abstract

Non-randomized treatment effect models, in which the treatment assignment depends on some covariates being above or below some threshold, are widely used in fields like econometrics, political science and epidemiology. Treatment effect estimation in such models is generally done using a local approach which only considers observations from a small neighborhood of the threshold. In numerous situations, however, researchers are equally (or more) interested in observations further away from the threshold. Moreover, most methods rely on the assumption that the treatment effect for each observation is the same, which is often unrealistic. In this paper, we present a new method for estimating non-randomized heterogeneous treatment effects which takes into account all observations regardless of their distance from the threshold. We show that our method is capable of estimating the average treatment effect on the treated (ATT) at a parametric rate. We then apply our method to simulated and real data sets and compare our results with those from existing approaches. We conclude this paper with possible extensions of our method.

Meng Hsuan “Rex” Hsieh

Business Economics, Ross

Revisiting the analysis of matched-pair and stratified experiments in the presence of attrition

Abstract

In this paper, we revisit some common recommendations regarding the analysis of matched-pair and stratified experimental designs in the presence of attrition. Our main objective is to clarify a number of well-known claims about the practice of dropping pairs with an attrited unit when analyzing matched-pair designs. Contradictory advice appears in the literature about whether or not dropping pairs is beneficial or harmful, and stratifying into larger groups has been recommended as a resolution to the issue. To address these claims, we derive the estimands obtained from the difference-in-means estimator in a matched-pair design both when the observations from pairs with an attrited unit are retained and when they are dropped. We find limited evidence to support the claims that dropping pairs helps recover the average treatment effect, but we find that it may potentially help in recovering a convex-weighted average of conditional average treatment effects. We report similar findings for stratified designs when studying the estimands obtained from a regression of outcomes on treatment with and without strata fixed effects.

Seamus Somerstep

Statistics

Algorithmic fairness in performative policy learning: overcoming conflicting fairness definitions

Abstract

In many prediction problems, the predictive model affects the distribution of the prediction target. This phenomenon is known as performativity, and it is often caused by the behavior of individuals with vested interests in the outcome of the predictive model. Although performativity is generally problematic because it manifests as distribution shifts, we develop algorithmic fairness practices that leverage performativity to achieve stronger group fairness guarantees in social classification problems (compared to what is achievable in non-performative settings). In particular, we leverage the policymaker’s ability to steer the population to remedy inequities in the long term. A crucial benefit of this approach is that it is possible to resolve the incompatibilities between conflicting group fairness definitions.

Shushu Zhang

Statistics

Expected Shortfall Regression via Optimization

Abstract

The expected shortfall (ES) is defined as the average over the tail above (or below) a certain quantile of a response distribution, which provides a comprehensive summary of the tail distribution. The ES regression captures the heterogeneous covariate-response relationship, and describes the covariate effects on the tails of the response distribution, which is of particular interest in various applications. Based on a critical observation that the superquantile regression from Rockafellar, Royset and Miranda (2014) is not the solution to the ES regression, we propose and validate a novel optimization-based approach for the linear ES regression, named as the i-Rock approach, without specifying the conditional quantile models. We provide a prototype implementation of the i-Rock approach with some initial ES estimators based on binning techniques, and show the consistency and asymptotic normality of the resulting i-Rock estimator. The i-Rock approach achieves heterogeneity-adaptive weights automatically and therefore often offers efficiency gain over other existing linear ES regression approaches in the literature.

Thomas Coons

Mechanical Engineering

Adaptive Covariance Estimation for Multi-fidelity Monte Carlo

Abstract

Multi-fidelity variance-reduction techniques (e.g., multi-fidelity Monte Carlo [1], approximate control variates [2,3], and multilevel BLUEs [4]) have seen considerable attention in recent years, in many cases providing orders-of-magnitude computational savings in estimating statistics of a high-fidelity model. These methods require the covariance matrix across model fidelities, which is usually estimated via pilot sampling or reinforcement-learning techniques [5] in conjunction with the sample covariance formula. Depending on the model ensemble available, this covariance estimation can be costly or inaccurate, leading to suboptimal estimators. Furthermore, most multi-fidelity estimators are not designed with an outer design optimization loop in mind, where covariance information and thus estimator properties may vary substantially from design to design. In this work, we leverage uncertainty information in a parameterization of the covariance matrix to adaptively guide pilot sampling simultaneously as the outer optimization loop converges. In doing so, the overall multi-fidelity optimization process can converge more efficiently. We demonstrate this through applications for multi-fidelity optimal experimental design. References: B. Peherstorfer, K. Willcox, and M. Gunzburger, “Optimal Model Management for Multifidelity Monte Carlo Estimation,” SIAM J. Sci. Comput., vol. 38, no. 5, pp. A3163–A3194, Jan. 2016, doi: 10.1137/15M1046472. G. F. Bomarito, P. E. Leser, J. E. Warner, and W. P. Leser, “On the optimization of approximate control variates with parametrically defined estimators,” Journal of Computational Physics, vol. 451, p. 110882, Feb. 2022, doi: 10.1016/j.jcp.2021.110882. A. A. Gorodetsky, G. Geraci, M. S. Eldred, and J. D. Jakeman, “A generalized approximate control variate framework for multifidelity uncertainty quantification,” Journal of Computational Physics, vol. 408, p. 109257, 2020, doi: https://doi.org/10.1016/j.jcp.2020.109257. D. Schaden and E. Ullmann, “On Multilevel Best Linear Unbiased Estimators,” SIAM/ASA J. Uncertainty Quantification, vol. 8, no. 2, pp. 601–635, Jan. 2020, doi: 10.1137/19M1263534. Y. Xu, V. Keshavarzzadeh, R. M. Kirby, and A. Narayan, “A Bandit-Learning Approach to Multifidelity Approximation,” SIAM J. Sci. Comput., vol. 44, no. 1, pp. A150–A175, Feb. 2022, doi: 10.1137/21M1408312.

Victor Verma

Statistics

Optimal Extreme Event Prediction in Heavy-Tailed Time Series

Abstract

A problem that arises in many areas is predicting whether a time series will exceed a high threshold. One example is solar flare forecasting, which can be done by predicting when a quantity called the X-ray flux will surpass a threshold. We define a predictor to be optimal if it maximizes the precision, the probability of the event of interest occurring given that an alarm has been raised. We prove that in the general case, the optimal predictor is a ratio of two conditional densities. For several time series models, such as MA($\infty$) and AR($d$) models, we obtain a simple, closed-form expression for the optimal predictor. This leads to new methodology for optimal prediction of extreme events in heavy-tailed time series. We establish the asymptotic optimality of the resulting predictors as the training set size goes to infinity using results on uniform laws of large numbers for empirical processes of ergodic time series. Under the assumption of regularly varying tails, we also obtain theoretical expressions for the asymptotic precisions of the optimal predictors as the extreme-event threshold rises. The performance of the optimal predictors and their approximations is demonstrated with simulation studies and the methodology is applied to solar flare forecasting based on the time series of X-ray fluxes obtained from the GOES satellites.

Oral Presentation 6: Methods/Theory

March 29th 2:00 PM – 4:00 PM @ Amphitheatre

Vinod Raman

Statistics

Revisiting the Learnability of Apple Tasting

Abstract

In online binary classification under apple tasting feedback, the learner only observes the true label if it predicts “1”. First studied by \cite{helmbold2000apple}, we revisit this classical partial-feedback setting and study online learnability from a combinatorial perspective. We show that the Littlestone dimension continues to provide a tight quantitative characterization of apple tasting in the agnostic setting, closing an open question posed by \cite{helmbold2000apple}. In addition, we give a new combinatorial parameter, called the Effective width, that tightly quantifies the minimax expected mistakes in the realizable setting. As a corollary, we use the Effective width to establish a trichotomy of the minimax expected number of mistakes in the realizable setting. In particular, we show that in the realizable setting, the expected number of mistakes of any learner, under apple tasting feedback, can be Θ(1), Θ(√T ), or Θ(T ). This is in contrast to the full-information realizable setting where only Θ(1) and Θ(T ) are possible.

Wenshan Yu

Survey and Data Science

Using Principal Stratification to Detect Mode Effects in a Longitudinal Setting

Abstract

Longitudinal studies serve the purpose of measuring changes over time; however, the validity of such estimates can be threatened when the modes of data collection vary across periods, as different modes can result in different levels of measurement error. This study provides a general framework to accommodate different mixed-mode designs and thus has the potential to support mode comparisons across studies or waves. Borrowing from the causal inference literature, we treat the mode of data collection as the treatment. We employ a potential outcome framework to multiply impute the potential response status of cases if assigned to another mode, along with the associated potential outcomes. After imputation, we construct principal strata based on the observed and the predicted response status of each case to adjust for whether a participant is able to respond via a certain mode when making inference about mode effects. Next, we estimate mode effects within each principal stratum. We then combine these estimates across both the principal strata and the imputed datasets for inference. This analytical strategy is applied to the Health and Retirement Study 2016 and 2018 core surveys.

Yuan Zhong

Biostatistics

Deep kernel learning based Gaussian processes for Bayesian image regression analysis

Abstract

Regression models are widely used in neuroimaging studies to learn complex associations between clinical variables and image data. Gaussian process (GP) is one of the most popular Bayesian nonparametric methods and has been widely used as prior models for the unknown functions in those models. However, many existing GP methods need to pre-specify the functional form of the kernels, which often suffer less flexibility in model fitting and computational bottlenecks in large-scale datasets. To address these challenges, we develop a scalable Bayesian kernel learning framework for GP priors in various image regression models. Our approach leverages deep neural networks (DNNs) to perform low-rank approximations of GP kernel functions via spectral decomposition. With Bayesian kernel learning techniques, we achieve improved accuracy in parameter estimation and variable selection in image regression models. We establish large prior support and posterior consistency of the kernel estimations. Through extensive simulations, we demonstrate our model outperforms other competitive methods. We illustrate the proposed method by analyzing multiple neuroimaging datasets in different medical studies.

Yumeng Wang

Statistics

Post-Selection Inference for Smoothed Quantile Regression

Abstract

Quantile regression is a powerful technique for estimating the conditional quantiles of a response variable, which provides robust estimates for heavy-tailed responses or outliers without assuming a specific parametric distribution. However, modeling conditional quantiles in high-dimensional data and studying post-selection inferences for the quantile effects of selected covariates has computational and efficiency challenges. The computational challenges arise because of the non-differentiable quantile loss function, while the efficiency challenges arise due to the data discarded during model selection. To address these challenges, we have developed a new approach for post-selection inference after modeling conditional quantiles with smoothed quantile regression. Our approach is fast to compute, and it no longer discards data during model selection, circumventing the disadvantages of a non-differentiable loss. In addition, we study the asymptotic properties of our pivot and demonstrate the effectiveness of our method in practical data analysis.

Zhiwei Xu

Statistics

Benign Overfitting and Grokking in ReLU Networks for XOR Cluster Data

Abstract

Neural networks trained by gradient descent (GD) have exhibited a number of surprising generalization behaviors. First, they can achieve a perfect fit to noisy training data and still generalize near-optimally, showing that overfitting can sometimes be benign. Second, they can undergo a period of classical, harmful overfitting — achieving a perfect fit to training data with near-random performance on test data — before transitioning (“grokking”) to near-optimal generalization later in training. In this work, we show that both of these phenomena provably occur in two-layer ReLU networks trained by GD on XOR cluster data where a constant fraction of the training labels are flipped. In this setting, we show that after the first step of GD, the network achieves 100% training accuracy, perfectly fitting the noisy labels in the training data, but achieves near-random test accuracy. At a later training step, the network achieves near-optimal test accuracy while still fitting the random labels in the training data, exhibiting a “grokking” phenomenon. This provides the first theoretical result of benign overfitting in neural network classification when the data distribution is not linearly separable. Our proofs rely on analyzing the feature learning process under GD, which reveals that the network implements a non-generalizable linear classifier after one step and gradually learns generalizable features in later steps.

Zijin Zhang

Ross

Costly Quantity-vs-Quality Sampling in Newsvendor

Abstract

While data might be readily available in some applications, there are many cases where collecting data is very expensive and time consuming. Previous research has mainly focused on improving decision-making with data, leaving a gap in understanding the optimal quantity and quality of data needed. Our paper studies the data-driven variant of one of the most widely used operations models, the newsvendor model, in which the retailer is selling products with uncertain demand. Effective ordering decisions can only be made when the retailer collects insightful data to learn about the demand distribution. However, data is not always accurate, and collecting this data is costly due to the sample size (quantity) and sample noise (quality). We aim to understand how the quantity and quality of data impact the retailer’s newsvendor profit and to develop policies for data collection to enhance profit margins. In this paper, we provide a novel denoised-SAA approach that minimizes the in-sample newsvendor loss given a noisy dataset, and prove several bounds of the loss margin on the quantity and quality. Based on our in-depth theoretical analysis, we develop a series of single and adaptive data sampling policies with analytical performance guarantees. In an extensive set of computational experiments, we show that these policies perform well in realistic settings.

Poster Session

March 28th 4:30 PM – 6:30 PM @ Assembly Hall

01

Benjamin Osafo Agyare

Statistics

Flexible Kernel-Based Expectile Regression with Inference for Dependent Data: An Application to Heritability Studies.

Abstract

Conventional regression methods, often based on least squares, predominantly focus on learning the conditional mean function, neglecting the heterogeneity observed in much contemporary data. This heterogeneity, stemming from mean/variance relationships or covariate-dependent error distributions, however, presents opportunities for investigating conditional relationships in the tails of the distribution, especially in domains where extreme outcomes are of interest. For instance, in studying the growth and development of children in relation to nutrition, the distribution of, say, height may differ for extremely malnourished children in a way that is not entirely captured through study of the conditional mean. We propose a framework utilizing expectiles and kernel regression within Reproducing Kernel Hilbert Space to assess the conditional distribution of an outcome of interest based on covariates in a way that accommodates non-linearity, non-additivity, non-independence, and heterogeneous covariate effects. We present a computationally efficient algorithm that computes the entire solution path, spanning from the least to the most regularized models, at specified expectiles using the over-relaxed Alternating Direction Method of Multipliers Algorithm. Furthermore, we demonstrate that rigorous inference is achievable through cross-fitting and (in some settings) tools from robust inference. Illustrating its performance, we conduct extensive numerical studies and apply this approach to a heritability analysis, investigating the relationship between growth curves of anthropometry measures (e.g. height) of children and those of their parents.

02

Carlos Dario Cristiano Botia

SRC-Survey Methodology

Extension of Fay Herriot models to the estimation of the Quarterly Labor Market Survey (PNAD) in Brazil.

Abstract

This document investigates the use of small-area estimation models to enhance the accuracy of unemployment rate predictions in Brazil from 2012 to 2019. By comparing direct estimations with those derived from the Fay-Herriot (FH) and space-time Fay-Herriot (FHET) models, the study reveals a decline in unemployment rates, especially in the Southeast region of Brazil. Furthermore, it was found that the FHET model provides reduced variability and smoother Coefficients of Variation (CVs), indicating higher precision in unemployment rate estimations. The seminar will conclude with a discussion of potential future research into latent Markov models as a tool for further improving small-area estimations. These findings have significant implications for policymakers and researchers focused on regional economic dynamics analysis.

03

Easton Huch

Statistics

Data Integration Methods for the Analysis of Micro-randomized Trials

Abstract

Existing statistical methods for the analysis of micro-randomized trials (MRTs) are designed to estimate causal excursion effects using data from a single MRT. In practice, however, researchers can often find related MRTs employing similar adaptive interventions, which begs the question: Can we leverage the additional data from these trials to improve statistical efficiency? This project provides an affirmative answer to this question. We develop four related statistical methods that allow researchers to pool data across MRTs, including asymptotic standard errors derived via stacked estimating equations. We also show how to combine our methods via a generalization of precision weighting that allows for correlation between estimates; we show that this method possesses an asymptotic optimality property among linear unbiased meta-estimators. We demonstrate the statistical gains from our methods in simulation and in a case study involving two closely related MRTs in the area of smoking cessation.

04

Eduardo Ochoa Rivera

Statistics

Optimal Thresholding Linear Bandit

Abstract

This project delves into the Thresholding Bandit Problem (TBP) within the context of fixed confidence in stochastic linear bandits. The objective is for the learner to identify the set of arms whose mean rewards are above a given threshold, achieving a predetermined level of precision and confidence, all while minimizing the sample size. Our extension of the framework to the linear bandit case involves establishing a lower bound for sample complexity. Furthermore, we introduce an algorithm and prove that it is asymptotically optimal almost surely and in expectation.

05

Gabriel Patron

Statistics

Structured Background Estimation for Astronomical Images with Deep Generative Models

Abstract

Measuring the flux of astronomical objects in images is a crucial endeavor. A traditional image model treats every pixel as sum of elements, including a foreground (e.g. stars and galaxies) and a background term. Thus, foreground flux estimation is susceptible to background mismodeling. However, complicated spatially variable, or structured backgrounds, for example due to the presence of dust filaments and clouds around objects of interest. In this work, we use deep generative models, which excel at image generation and infilling, to approximate the posterior distribution of the background. In particular, we explore conditional variational autoencoders as a model of background. Our approach, unlike an existing method called Local Pixelwise Infilling (LPI), is global, meaning that it does not require local parameter estimation for every object. It also improves reconstruction performance on both synthetic and real data with respect to LPI.

06

Jessica Aldous

Biostatistics

Reconstructing individual patient data from extracted overall and cause-specific survival curves for secondary analysis of published competing risks data in prostate cancer

Abstract

Background: Meta-analysis of competing risks data from published studies can be of interest for studying different time-to-event outcomes. For instance, randomized control trials that explore treatments like androgen deprivation therapy often fail to report their impact on other-cause mortality in prostate cancer, despite other-cause mortality being a critical component in informing treatment decisions. Secondary analyses of these trials often rely on inconsistently reported summary statistics and require assumptions about the correlation structure between outcomes. Meta-analysis of individual patient data (IPD) is considered the gold-standard approach, but not often possible to obtain. Methods: We present an algorithm to reconstruct individual patient data (IPD) for multiple competing events from published Kaplan-Meier and cumulative incidence curves. First, survival and cause-specific mortality curves are matched at pre-specified times. Then, iterative estimation of the number of events on the defined intervals is performed by multiplying the number at risk by a function of the survival probabilities. Iterations continue until the estimated number at risk agrees with the risk table. Results: In a simulation study, we explore our algorithm’s accuracy in reproducing summary statistics and survival curves from the original data. We demonstrate the utility of our algorithm by performing a meta-analysis investigating the impact of androgen deprivation therapy duration on other-cause mortality in prostate cancer patients. Conclusion: Flexible tools like our algorithm extend the utility of published studies and further their contribution to medical research.

07

Jun Chen

Statistics/Math

Enhancing Computational Efficiency in Dimension Reduction for Data under Populations with Expectation/Variance Relationship: Improved GLM-based Outer Product Canonical Gradient (OPCG)

Abstract

Reducing the dimension of the covariate (X) in a manner that preserves the regression relationship with the response (Y) plays a crucial role in the analysis of high-dimensional data. Methods of recovering the multi-index mean relationship where the response is categorical are not widely studied. Here we integrate the Generalized Linear Model (GLM) framework and Sufficient Dimension Reduction(SDR), and reproduce Outer Product Canonical Gradient (OPCG) algorithm to achieve dimension reduction and recover multi-index structure in X. As the existing OPCG algorithm involves local smoothing estimation at every data point, it is not computationally efficient when the data set is large-scale. In order to reduce this computational burden while retaining the statistical accuracy, we estimate the OPCG using “support points” that optimally quantize a multivariate distribution using the notion of energy distance. We show that our approach reduces a great computational cost of fitting while incurring only a relatively little cost in estimation accuracy. We illustrate the approach using a study of educational attainment in the Dogon of Mali to conduct a data analysis and test the improved OPCG algorithm, which not only provides the main structure of the original data, but also reaches a higher efficiency in finding the central subspace of data.

08

Katherine Ahn

Statistics

Scalable Kernel Inverse Regression for Dimension Reduction Regression

Abstract

Sufficient Dimension Reduction (SDR) is a class of methods that reduces the dimension of the high-dimensional covariates (X) while preserving the conditional distribution of the response (Y) given the covariates. A key step of our method is non-parametric estimation of the reduced eigen vector subspace of the moment statistic M = Cov E[X|Y].  Here we consider computationally efficient kernel-based estimates of M that can accommodate multivariate, longitudinal, and partially observed data.  We study the computational and statistical performance of coarse-grained algorithms for estimation of M inspired by optimal design and quadrature theory. Due to the nested moment structure of M, heuristics for bandwidth selection borrowed from other types of nonparametric regression are not always applicable in this setting. Simulation studies suggest that the tuning-free approach works well in the dimension reduction.  We illustrate the approach through educational attainment in a longitudinal cohort study of young people in Mali, west Africa.

09

Kevin Christian

Statistics

On the Role of Unstructured Training Data in Transformers’ In-Context Learning Capabilities

Abstract

Transformers have exhibited impressive in-context learning (ICL) capabilities: they can generate predictions for new query inputs based on sequences of inputs and outputs (i.e., prompts) without parameter updates. Efforts to provide theoretical explanations for the emergence of these abilities have primarily focused on the structured data setting, where input-output pairings in the training data are known. This scenario can enable simplified transformers (e.g., ones comprising a single attention layer without the softmax activation) to achieve notable ICL performance. However, transformers are primarily trained on unstructured data that rarely include such input-output pairings. To better understand how ICL emerges, we propose to study transformers that are trained on unstructured data, namely data that lack prior knowledge of input-output pairings. This new setting elucidates the pivotal role of softmax attention in the robust ICL abilities of transformers, particularly those with a single attention layer. We posit that the significance of the softmax activation partially stems from the equivalence of softmax-based attention models with mixtures of experts, facilitating the implicit inference of input-output pairings in the test prompts. Additionally, a probing analysis reveals where these pairings are learned within the model. While subsequent layers predictably encode more information about these pairings, we find that even the first attention layer contains a significant amount of pairing information.

10

Lingxuan Kong

Biostatistics

Empirical Bayesian modeling framework for semi-competing risks data with application to evaluate health outcomes of kidney transplant

Abstract

End-stage renal disease (ESRD) has an increasing prevalence from 2000 to 2019, which puts thousands of patients on costly dialysis each month. As a final treatment for patients with long-time dialysis, kidney transplantation performs significantly and substantially better than dialysis in quality of life, and the benefits of transplantation increase over time. However, due to the limitation of kidney donor supply and the complex region-level donor allocation system, patients need to stay on dialysis until the donor is available. The uncertainty of kidney donor availability and rough prediction of possible health outcomes make patients hesitate on whether to receive kidney transplantation, which leads to 20% of valuable kidney donors wasted and excess deaths on dialysis. To adequately model the complex association between health outcomes, treatment availability, and patient characteristics, we proposed a new shared-frailty multi-state modeling framework that incorporates multi-levels of random effects and time-varying effects to facilitate better prediction accuracy in many aspects. Compared to current multi-state models, our model relaxes the Markov assumption when estimating transition probabilities and estimates the possible correlations among transition processes. An empirical Bayesian estimation algorithm was proposed to achieve better estimation efficiency and consistency of risk factors’ effects under multiple scenarios. This modeling framework considers regional differences in kidney donor availability and kidney transplant qualities and risk factors’ time-varying effects as well. The high prediction accuracy of our modeling framework allows our model to provide better guidance to transplant centers and ESRD patients.

11

Matt Raymond

Electrical and Computer Engineering

Joint Optimization of Piecewise Linear Ensembles

Abstract

Despite recent advances in neural networks for structured data (e.g. natural language
and images), ensemble methods remain the state-of-the-art for many unstructured or tabular datasets. However, ensemble methods are frequently optimized using greedy or uniform weighting schemes. Furthermore, popular weak learners such as decision trees greedily learn boundaries and partition weights. Unfortunately, such greedy optimization schemes may result in suboptimal solutions, especially for objective functions that include non-trivial regularization terms. In this paper, we propose JOPLEn, a convex framework for the Joint Optimization of Piecewise Linear Ensembles. Given an ensemble of partitions, JOPLEn jointly fits a linear model to each cell in a partition. Furthermore, JOPLEn is easily extended to the multitask setting, allowing all tasks to be jointly optimized. Using proximal gradients, JOPLEn can utilize arbitrary convex penalties, including sparsity-promoting penalties such as the ℓ1-norm and ℓ∞,1 group norm. We investigate the performance of JOPLEn on single-task regression, single-task feature selection, and multitask feature selection. Besides improving regression performance, JOPLEn can easily extend linear multitask feature selection approaches such as Dirty LASSO to the nonlinear setting. We anticipate that JOPLEn will provide a principled method for improving performance and sparsity for many existing ensemble methods, especially those with complex regularization constraints.

12

Soham Bakshi

Statistics

Selective Inference for Time-Varying Effect Moderation

Abstract

The scientific community is increasingly focused on developing data analysis techniques to enhance mobile health interventions. A crucial aspect of this endeavor involves assessing the time-varying causal effect moderators. Effect modification, a scenario where the impact of treatment on outcomes varies based on other covariates, plays a significant role in decision-making processes. When there are tens or hundreds of covariates, it becomes necessary to use the observed data to select a simpler model for effect modification and then make valid statistical inference. To achieve this, the Lasso method is employed for selecting a lower complexity model for effect modification. Compared to a full model consisting of all the covariates, the selected model is much more interpretable. To ensure valid post-selection inference of our models, we take the conditional approach and construct an asymptotically valid pivot that is uniformly distributed when conditioned on the selection event.

13

Tiffany Parise

Electrical and Computer Engineering

Fairness via Robust Machine Learning

Abstract

Machine learning models are increasingly deployed to aid decisions with significant societal impact. Defining and assessing the degree of fairness of these models, therefore, is both important and urgent. One thread of research in Machine Learning (ML) aims to quantify the fairness of ML models using probabilistic metrics. To ascertain the fairness of a given model, many popular fairness metrics measure the difference in predictive power of that model across different subgroups of a population – typically, where one subgroup has historically been marginalized. A separate thread of research aims to construct robust ML models. Intuitively, robustness may be understood as the ability of a model to perform well even in the presence of noisy data. Typically, robust models are trained by intentionally introducing perturbations in the data. Our work aims to connect these two threads of research. We hypothesize that models trained to be robust are naturally more fair than those trained using standard empirical risk minimization. To what extent are fairness and robustness related? Do some notions of fairness and robustness have a stronger correlation than others? We investigate these questions empirically by setting up experiments to measure the relationship between these concepts. To study trade-offs between robustness, fairness, and nominal accuracy, we use a probabilistically robust learning framework (Robey et. al., 2022) to train classifiers with varying levels of robustness on real-world datasets. We then use widely-used statistical metrics (Barocas et. al., 2019) to evaluate the fairness of these models. Preliminary results indicate that probabilistically robust learning reduces nominal accuracy but increases fairness with respect to the evaluated metrics. The significance of such a trade-off would be the conceptualization of fairness in terms of robustness and the ability to increase model fairness without explicitly optimizing for fairness.

14

Yichao Chen

Statistics

Modeling Hypergraphs Using Non-symmetric Determinantal Point Processes

Abstract

Conventional statistical network modeling typically focuses on interactions between pairs of individuals. However, in many real-world applications, interactions often involve multiple entities. To bridge this gap, we propose a latent space model for hypergraphs, using a non-symmetric determinantal point process (DPP). Unlike existing hypergraph models that are driven solely by either similarity or diversity among nodes, our adjusted non-symmetric DPP structure allows for both repulsive and attractive interactions between nodes, as well as accounting for the popularity of each node. This approach significantly enhances the model’s flexibility. Our model also accommodates various types of hypergraphs without limitations on the cardinality and multiplicity of hyperedges. For parameter estimation, we employ the Adam optimization in conjunction with maximum likelihood estimation. We have established the consistency and asymptotic normality of these maximum likelihood estimators. The proof is non-trivial due to the unique configuration of the parameter space. Simulation studies support the effectiveness of our method. Moreover, we have applied our model to two real-world datasets, demonstrating its practical utility and the ability to provide insightful embeddings.

15

Yidan Xu

Statistics

Wasserstein Sensitivity Analysis

Abstract

We propose a new sensitivity analysis framework for partial identification of a wide range of causal estimands, including conditional treatment effect (CATE) and average treatment effect (ATE). Without pointwise nor distributional assumptions of the unobserved confounders, the method instead puts a constraint on the counterfactual conditional distributions to the observed ones in the p-Wasserstein space. We show the optimal sensitivity interval corresponds to the unique solution of a Wasserstein Distributionally Robust Optimization (WDRO) problem. The dual form of the problem admits two nested convex optimizations, which can be solved efficiently with only empirical measures. We establish consistency result of the estimated bounds and the rate of convergence. Lastly, we demonstrate out method in simulated and real data examples, in comparison with existing models in sensitivity analysis.

16

Yilun Zhu

Electrical Engineering and Computer Science

Mixture Proportion Estimation Beyond Irreducibility

Abstract

The task of mixture proportion estimation (MPE) is to estimate the weight of a component distribution in a mixture, given observations from both the component and mixture. Previous work on MPE adopts the \emph{irreducibility} assumption, which ensures identifiablity of the mixture proportion. In this paper, we propose a more general sufficient condition that accommodates several settings of interest where irreducibility does not hold. We further present a resampling-based meta-algorithm that takes any existing MPE algorithm designed to work under irreducibility and adapts it to work under our more general condition. Our approach empirically exhibits improved estimation performance relative to baseline methods and to a recently proposed regrouping-based algorithm.

17

Yue Yu

Statistics

On parameter estimation with Sinkhorn Divergence

Abstract

Without regularization, the generalization error of OT objects, such as Wasserstein distance, suffer from the curse of dimensionality, i.e, the rate of convergence slows down rapidly with d, which hinders its utility. In this work, we consider parametric estimation by minimizing Sinkhorn divergence. We first prove that our estimator achieves √n-consistency and asymptotic normality, enabling the construction of confidence intervals via bootstrap methods. Additionally, we extend our analysis to a two-sample counterpart estimator, proving its √(nm)/(n+m)-consistency and asymptotic normality. A key practical advantage of our methodology is its compatibility with existing Generative Adversarial Network (GAN)-based methods, requiring only minimal modifications for implementation in spatio-temporal training scenarios.

18

Yuxuan Ke

Statistics

Spectral Solar Irradiance Missing Data Imputation with Temporally-Smoothed Matrix Factorization-Based Method

Abstract

The solar spectral irradiance (SSI), which is the solar energy received at the top of the Earth’s atmosphere at a given wavelength, is an important quantity in geophysical researches. However, a significant amount of missing data could occur in the measurement process. A typical example is the missingness caused by instrument downtime when there is no observed data at all. With this specific missing pattern, commonly used imputation methods such as linear interpolation can not precisely predict the variability, especially when the missing band is wide. Prevalent matrix completion methods such as low-rank pursuit can not effectively recover these missing bands as well due to the lack of temporal smoothness and the solar irradiance’ 11-year cycle driven by the Sun’s periodic magnetic activity. In this project, we propose a matrix factorization-based imputation algorithm called SoftImpute with Projected Auto-regressive regularization (SIPA) that can effectively recover the downtime-missingness. SIPA consists of two parts: matrix low-rank pursuit and temporal smoothness preservation by imposing an AR-like penalty. We designed an efficient alternating algorithm to estimate the AR coefficients and solve the factorized imputed matrices. A projection to the AR penalty term was introduced to prevent disturbance on non-downtime entries by the downtime missing. With extensive numerical studies, we showed that SIPA effectively imputes the downtime missingness and outperformed competitive methods.

19

Ziming Zhou

Electrical Engineering and Computer Science

Unveiling the Neural Tapestry: Introducing Principal Component Regression to Model Sensitivity, Underfitting, and Overfitting Diagnosis in Handwritten Digit Classification

Abstract

Statistical methods have been widely introduced to unveil the neural tapestry of neural network prediction process. In order to eliminate potential multicollinearity and overfitting from existing neural network interpretation methods, this work introduces principal component regression (PCR) to decipher the intricate correlation between neural network performance, feature characteristics, and model structure. Through the application of PCR for feature extraction, our analysis underscores the significant impact of both low- and high-dimensional Principal Component Analysis (PCA) features on neural network performance. Notably, these features play distinct roles, with low-dimensional features primarily influencing the global shape of predictions and high-dimensional features affecting local noise. Furthermore, our investigation unveils a correlation between model underfitting and a more pronounced prediction probability distribution, as indicated by a lower estimated parameter of the symmetric beta distribution. On the other hand, the study identifies a link between model overfitting and the number of significant PCA features within the high-dimensional range. This association suggests a heightened sensitivity of the model to local details and noises, signifying a potential source of overfitting. In summary, our work introduces PCR as a tool to enhance the interpretability of neural network predictions, shedding light on new opportunities to diagnose model sensitivity, underfitting, and overfitting in 2D image predictions.

20

Ziyu Zhou

Statistics

Forecast Phenology in Near-term with Mechanistic and Data-driven Models

Abstract

Predicting phenology, which is the timing of important biological events, is crucial during climate change. Changes in phenological events have significant impacts on ecosystem functioning and human activities, including carbon sequestration, tourism, and agriculture. Diverse models have been used to predict phenology, ranging from process-based to data-driven, each with their pros and cons. While process-based models are informed by and contribute to ecological knowledge, data-driven models can achieve high predictive accuracy at a cost of interpretability. In this study, we compare these two types of models to understand their capabilities in predicting phenology under climate change. We focused on phenology of temperate deciduous forests in the Appalachian and Cumberland Plateau regions from 2000 to 2021. Across 100 randomly selected sites, we extracted a land surface phenology metric, start of the season (SOS), from MODIS Aqua and Terra Vegetation Indices 16-Day L3 Global 500m datasets. We retrieved daily climatic predictors, such as temperature, precipitation, and day length, from the Daymet dataset. We trained three process-based models (Thermal Time (AA), Alternating Time (AT), and Parallel (PA)) and a linear regression model to predict SOS with climatic variables. We evaluated the two types of models from three aspects: 1) root mean square error (RMSE) to evaluate short-term predictive accuracy both in- and out-of-sample, 2) uncertainty of model parameters and their correlation to evaluate parameter stability and identifiability, 3) predictive distribution with simulated climatic variables to evaluate the ability to generate realistic predictions. We showed that linear regression, a simple data-driven model, demonstrated higher out-of-sample accuracy in short-term predictions. Process-based models, although comparable in predictive accuracy, might suffer from parameter identifiability issues. When projecting into the future, linear regression is more likely to generate ecologically unrealistic predictions. Our findings underscore the complexity of phenology modeling and the need for integrative approaches to accurately predict phenology under climate change.

lsa logoum logoU-M Privacy StatementAccessibility at U-M