Speed & Poster Sessions


Speed Session

March 10th, 1:00pm – 2:30pm
Vandenberg

Moderated by
Simon Fontaine
PhD Student, Statistics


Madeline Abbott

Second-year PhD Student
Biostatistics

Modeling cigarette use with mobile health data from a study on smoking cessation

Abstract

Ecological momentary assessment (EMA), which consists of frequently delivered surveys sent to subjects’ smartphones, allows for the collection of data in real time and in subjects’ natural environments. As a result, data collected using EMA can be particularly useful in understanding rapid temporal variation in subjects’ states and environments, as well as how these states and environments relate to outcomes of interest. Focusing on a study of smoking cessation, we use data collected via EMA from current smokers who recently quit to model changes in their emotional state over time and to assess possible associations between their state and risk of cigarette use. We apply a factor model to summarize 23 different emotions as two key psychological concepts: positive affect and negative affect. We then use positive and negative affect, along with a measure of urge to smoke, to model lapses in smoking cessation over time. Based on currently available data, we find that positive and negative affect summarize the emotions in our data well and that time since quit is strongly associated with risk of cigarette use.

Coauthors: Walter Dempsey, Inbal Nahum-Shani and Jeremy Taylor


Neophytos Charalambides

Fifth-year PhD Student
EECS

Resilient and Secure Distributed Matrix Inversion

Abstract

A cumbersome operation in statistics, signal processing, numerical analysis and linear algebra, optimization and machine learning, is inverting large full-rank matrices. We propose a coded computing approach for recovering matrix inverse approximations. We present an approximate matrix inversion algorithm which does not require a matrix factorization, but uses a black-box least squares optimization solver as a subroutine, to give an estimate of the inverse of real full-rank matrices. We then present a distributed framework for which our algorithm can be implemented, and show how we can leverage from sparsest-balanced MDS generator matrices to devise inverse coded computing schemes. We focus on balanced Reed-Solomon codes, which are optimal in terms of computational load; and communication from the workers to the master server. We also discuss how our algorithms can be used to compute the pseudoinverse of a full-rank matrix, and how the communication can be secured from eavesdroppers. Our approach can also be utilized for exact matrix product recovery.

Coauthors: Alfred Hero and Mert Pilanci


Alicia Dominguez

Fourth-year PhD Student
Biostatistics

Bias accumulates in polygenic risk scores constructed with larger sets of markers in multiple complex traits. 

Abstract

Polygenic risk scores (PRS) are increasingly used to predict genetic risk for many complex traits. Many PRS methods, like p-value thresholding (P&T), rely on effect size estimates from genome-wide association studies (GWAS) of predominantly European (EUR) samples which can potentially bias risk estimates for non-EUR populations, especially individuals of African (AFR) ancestry. How this bias aggregates over PRS with larger sets of markers is not well understood.

Our sample (n=5,848) consists of genotyped individuals from the University of Michigan (UM) Prechter Bipolar Study and Michigan Genome Initiative. To estimate ancestry for these individuals, we performed principal component analysis on their genomic data and data (n=2,504) from the 1,000 Genome Project (1KGP). We used publicly-available GWAS summary statistics and P&T method to infer PRS for the UM and 1KGP samples for four complex traits: bipolar disorder (BD), height, schizophrenia (SCZ), and type II diabetes. For 1KGP participants, we quantify transferability of EUR-biased genetic studies to other populations by comparing PRS across populations for multiple sets of markers. For the UM sample, we evaluate the relationship between number of markers and predictive accuracy for PRS of BD and SCZ and use regression models to evaluate factors associated with PRS.

For traits like BD, PRS calculated with more markers were more predictive of affection status in our UM sample. For 1KGP, we see directional inconsistencies across different populations for several PRS but most evident in PRS constructed with more markers. Furthermore, PRS calculated with more markers were significantly associated (p<0.05) with several ancestry principal components for both samples.

There is a tension between having more complex, informative PRS and their susceptibility to population structure. Consequently, this results in more informative PRS for individuals with EUR ancestry but biased PRS for individuals from other ancestries.

Coauthors: Sebastian Zoellner and Yuhua Zhang


Abigail Loe

First-year Master’s Student
Biostatistics

Just Statistics: In the Dark, Statistical Analysis, and the Failure of the Justice System

Abstract

On July 16th, 1996, someone walked into Tardy Furniture in Winona, Mississippi, and killed four employees. During an investigation which stressed and shocked the quiet community, District Attorney Doug Evans eventually accused, arrested and tried Curtis Flowers, a 26-year-old Black man. Curtis was first tried in 1997, found guilty, and sentenced to death in Mississippi’s Fifth Circuit Court District. He won an appeal based on prosecutorial misconduct, and since then Doug Evans retried Curtis five times for the same crime. During Curtis Flowers’ many trials, his defense team alleged racial bias in the seating of a jury. They were frequently unable to prove it in the trial court, but often won on appeal. Statistics however can show that something more than just chance has dictated the actions of a prosecutor, or the seating of a jury. Curtis Flowers’ case shows that data has the potential to reveal the bias usually inherent in U.S. society, but a consistent disregard for statistical methods by the U.S. legal system has sidelined an avenue for rigorous proof. In this presentation, I examine the role that statistics can play in the law, and the ways in which the law has sidelined mathematical methods.


Anandkumar Patel

Second-year Master’s Student
Statistics & D3 Center

Standardized Effect Sizes for the Comparison of the Embedded, Clustered Adaptive Interventions in Clustered SMARTs

Abstract

In many fields, such as in medicine and education, it is often necessary to make decisions about how best to intervene sequentially at the cluster level (e.g., at the level of a clinic or classroom), in a way that adapts and re-adapts the intervention over time, depending on the evolving needs of the cluster over time (including the cluster’s response to prior intervention). Clustered Adaptive Interventions (CAIs) provide clinicians, educators, or other policymakers a guide to making such sequential, clustered intervention decisions. Often, however, there are open scientific questions preventing scientists from recommending a particular CAI. Clustered, sequential multiple assignments randomized trials (Clustered SMARTs) are one type of experimental design that can be used to answer such questions to develop highly effective CAIs. In SMARTs, randomization occurs at multiple stages corresponding to critical intervention decision points. Each randomization allows researchers to investigate and learn how to best adapt (potentially re-adapt) the intervention strategy at the cluster level while measuring the outcome at the individual level. Typically, Clustered SMARTs have a number of embedded CAIs, by design. A common primary aim in a SMART is the comparison of these embedded CAIs. This manuscript contributes to the statistical literature on clustered SMARTs by (i) defining effect sizes for the comparison of embedded CAIs in clustered SMARTs, deriving methods for (ii) estimating the effect size, and (iii) constructing confidence intervals for the estimated effect size. The methods are illustrated using data from a study that seeks to understand how best to implement Cognitive Behavioral therapy in high schools in Michigan. 

Coauthor: Daniel Almirall


Donald Scott

Junior
Statistics

Discussion and Implementation of the Intra-Cluster Correlation in the Design and Analysis of Clustered SMARTs

Abstract

In many fields, such as in medicine and education, it is often necessary to make decisions about how best to intervene sequentially at the cluster level (e.g., at the level of a clinic or classroom), in a way that adapts and re-adapts the intervention over time, depending on the evolving needs of the cluster over time (including the cluster’s response to prior intervention).  Clustered Adaptive Interventions (CAIs) provide a guide to making such sequential, clustered intervention decisions. Clustered, sequential multiple assignments randomized trials (Clustered SMARTs) are one type of experimental design that can be used to answer questions regarding the development of highly effective CAIs.  In SMARTs, randomization occurs at multiple stages corresponding to critical intervention decision points. Each randomization allows researchers to investigate and learn how to best adapt (potentially re-adapt) the intervention strategy at the cluster level, while measuring the outcome at the individual level.  The ICC (Inter Cluster Correlation) plays a crucial role in the design and analysis of Clustered Smarts.  The ICC provides information about the amount of variance within the study related to the grouping of clusters.  This manuscript contributes to the statistical literature on clustered SMARTs by (i) describing a method of calculating the ICC from existing data from a Clustered SMART and (ii) methods of calculating the ICC using existing data to inform the design of a clustered SMART.

Coauthors: Daniel Almirall


Fatema Shafie Khorassani

Third-year PhD Student
Biostatistics

Data Fusion for Time-to-Event Outcomes

Abstract

Despite significant reductions in cancer mortality over the past three decades, racial disparities in cancer-specific mortality persist. Studying factors associated with these observed disparities requires data on many variables, including demographics, healthcare access, socioeconomic status, and comorbidities. There are existing national cancer surveillance databases that each collect parts of the information needed for studying racial disparities in cancer. Integrating data from multiple sources allows us to study associations between race and cancer-specific mortality over time adjusted for important confounders. Existing data integration methods do not consider time-to-event outcomes, hence are not applicable to studying cancer-specific mortality. Data integration methods for time-to-event outcomes can have many applications, including improving risk predictions, adjusting for dependent censoring, finding new associations, adjusting for unmeasured confounding, and improving the efficiency of analyses.

We propose a doubly robust regression method for data fusion with a time to event outcome. Data fusion is a particularly challenging problem in data integration, in which no subject has complete data on all the covariates and outcome. Some existing missing data methods have been extended to the setting of data fusion; however, they do not account for time-to-event outcomes. We present a method for regressing a time-to-event outcome on a set of covariates from two integrated datasets that include some overlapping variables. We will present a class of doubly robust estimators which are unbiased if either the data source model or the model of the unobserved covariates is specified correctly. Through simulation studies we will present the bias and coverage of our estimators under correctly specified and misspecified models and will apply the method to integrate cancer-specific mortality information from the Surveillance, Epidemiology, and End Results (SEER) Program with confounders collected in the National Cancer Database (NCDB) that are not available in SEER.



Coauthors: Jeremy M.G. Taylor and Xu Shi


Jiahao Shi

First-year PhD Student
IOE

Accelerating Stochastic Sequential Quadratic Programming for Equality Constrained Stochastic Optimization using Predictive Variance Reduction

Abstract

We propose a variance reduction method for solving equality constrained stochastic optimization problems. Specifically, we develop a method based on the sequential quadratic programming paradigm that utilizes variance reduction in the gradient approximation via stochastic variance reduction gradient (SVRG). We prove exact convergence in expectation to first order stationarities with non-diminishing stepsize sequences. Finally, we demonstrate the practical performance of our proposed algorithm of standard constrained machine learning problems.

Coauthors: Albert S. Berahas, Zihong Yi and Baoyu Zhou


Xianlin Sun

Second-year Master’s Student
Statistics

Machine Learning Forecast and Statistical Exploration of Equatorial Ionization Anomaly Based on Total Electron Content

Abstract

The ionosphere total electronic content (TEC), derived from multi-frequency Global Navigation Satellite System (GNSS) receiver, has been one of the most popular datasets in ionosphere research academia. The new advances in the completion of TEC maps and the forecast of TEC data by the modern ML(Machine Learning) algorithms have significantly leveraged its usability. While observing the TEC data, significant equatorial ionization anomaly (EIA) phenomenon displays that observably high TEC values occur around the magnetic equator lasting for a noticeable time, showing two strips each on one side of the equator. As we marched into multi-GNSS era, a new frontier of combining the traditional space science and the cutting-edge statistical learning to make a breakthrough in the specification and forecasting of EIA phenomenon has emerged. In this project, we aim at specifying EIA phenomena by automatically identifying its location and statistically describing its properties. We adopt Gaussian Mixture Model (GMM) with relatively free number of peaks to specify the EIA phenomenon and a series of state-of-the-art ML algorithms will be used to forecast local, regional and global EIA behavior. We automatically identify the EIA peaks, evaluating the peak TECs and prominences, and other key features, such as peak to equator distances and hemispheric asymmetry. Based on these EIA properties obtained, we could further explore the evolution of EIA peaks and the frequency, duration, intensity and periodicity of EIA bifurcation by constructing a state-of-the-art ML model based on the constructed EIA database as well as data indicating space weather conditions, for instance, solar wind and FISM solar radiation measurements, etc.

Coauthors: Shasha Zou, Yang Chen, Hu Sun and Jiaen Ren


Sahita Manda

Junior
Psychology

Experience of stigma and its relationship to identification with the neurodiversity model for Indian parents of children with autism spectrum disorder

Abstract

It is widely recognized that individuals with autism spectrum disorder (ASD) and their families continue to face extensive stigma and that much of the current research on ASD is deficit-focused. Diversity and inclusion perspectives are emerging, but there is less of a focus on how stigma affects the adoption of these approaches. In collaboration with the University of Michigan Department of Psychology and the national Indian organization Action For Autism, our research aims to understand the experience of stigma and its relationship to identification with the neurodiversity model for Indian parents of children with ASD. The study was carried out by administering online surveys through the platform of Qualtrics to Indian parents residing in India (N=56). This study explores the extent to which Asian value adherence, child functioning, and perceived ASD stigma contribute to parental alignment with the neurodiversity model. It also investigates the ways in which alignment with the model affects parental stress, isolation from family and friends, parenting goals, identification of child’s strengths, and positive perceptions about raising a child with ASD. Preliminary findings demonstrate statistically significant correlations between a child’s ASD behaviors, perceived ASD stigma, parental stress, and isolation from family and friends. A more complex mediation model of the effects of neurodiversity alignment on these variables will be presented and will have implications for the adoption of strength-based practices and the reduction of stigma associated with ASD within different cultural contexts.

Coauthors: Elizabeth Buvinger, Shichi Dhar and Harika Veldanda


Ziping Xu

Fourth-year PhD Student
Statistics

On the Statistical Benefits of Curriculum Learning

Abstract

Curriculum learning (CL) is a commonly used machine learning training strategy. However, we still lack a clear theoretical understanding of CL’s benefits. In this paper, we study the benefits of CL in the multitask linear regression problem under both structured and unstructured settings. For both settings, we derive the minimax rates for CL with the oracle that provides the optimal curriculum and without the oracle, where the agent has to adaptively learn a good curriculum. Our results reveal that adaptive learning can be fundamentally harder than the oracle learning in the unstructured setting, but it merely introduces a small extra term in the structured setting. To connect theory with practice, we provide justification for a popular empirical method that selects tasks with highest local prediction gain by comparing its guarantees with the minimax rates mentioned above.

Coauthors: Ambuj Tewari


Guanghao Zhang

First-year PhD Student
Biostatistics

A bipartite graph model for medical code mapping between healthcare systems

Abstract

It is notorious that electronic health records (EHR) data do not talk to each other. Due to financial incentives and differential care practice, the same clinical concept can often be described by alternative medical codes in different healthcare systems, leading to idiosyncratic “dialects” of EHRs across systems. Variability in medical coding has been observed for decades and can degrade model performance when models are applied to a new healthcare system. To facilitate data integration and improve model transportability, we adopt principles in how humans talk to each other and develop a data-driven method that automatically maps medical codes between two systems. We formulate a bipartite graph model that, unlike existing language translation methods, is naturally symmetric, accommodates all patterns such as one-to-one and one-to-many mappings, and does not require prior knowledge on code mapping or grouping. We demonstrate the validity of our proposed medical code mapping method through a simulation study and an application study of mapping ICD codes between two healthcare systems.

Coauthors: Xiaoou Li, Tianxi Cai and Xu Shi


Ruixuan Zhang


Civil and Environmental Engineering

Is the Car Following Behaviour of Human Drivers Affected when Following Autonomous Vehicles?

Abstract

In this work, we use naturalistic driving data from NGSIM and Lyft Level 5 prediction datasets to evaluate the potential effects of autonomous vehicles (AVs) on human drivers’ car-following behavior. We use time headway time series as a proxy to capture the car following behaviors of human drivers. A nested fixed model is developed to find possible changes when human drivers are following different types of vehicles (human-driven vehicles or AVs). The factors included in this model are the platoon structure (Legacy-Following-Legacy and Legacy-Following-Autonomous-Vehicle), road type (freeway and urban), time period (morning and afternoon), and lane (right, middle, and left). Results indicate a statistically significant difference between the car following behavior of drivers when they follow a human-driven vehicle, compared to an AV. This change in the car following the behavior of drivers has manifested in the form of a reduction in the mean and variance of time headways when human drivers follow an AV. These findings can bridge the gap between anticipated and real-world impacts of AVs on traffic streams as well as roadway safety and capacity.

Coauthors: Sara Masoud and Neda Masoud


Yongwen Zhuang

Fourth-year PhD Student
Biostatistics

A matrix completion approach for potential disease risk prediction

Abstract

Identification of at risk population is important for early-stage disease prevention. While increasing number of large-scale GWAS studies help support the risk prediction of various diseases through polygenic risk scores (PRS), the rapidly increasing PHEWAS studies provides further insight into the prediction of rare diseases with the use of multiple PRS across the phenome spectrum. However, two major challenges remain in utilizing phenome-wide information for risk prediction. Firstly, solutions remain unclear regarding the “unsupervised” scenario where no or little phenotypic information is available for the model calibration step of existing methods. Secondly, the existing cross-phenotype risk prediction methods are often trained in a disease-by-disease fashion, leading to increased computation burden when a larger number of diseases were of interest. We propose a computationally efficient matrix completion approach to identify potential at-risk individuals for diseases with small amount of case information by combining prior knowledge about individual similarity (constructed using genetic relatedness and health related features) and disease similarity available in various external data sources. Through simulations and analysis of biobank data, we show that the proposed method outperforms benchmark methods in terms of prediction accuracy and AUC.

Coauthors: Bhramar Mukherjee and Seunggeun Lee

Poster Session

March 10th, 2:30pm – 4:00pm
Hussey


Prayag Chatha

Second-year PhD Student
Statistics

Early Detection of Alcoholic Liver Disease in the Optum Claims Dataset with Transformers

Abstract

Alcoholic liver disease (ALD) is a leading cause of liver-related death world-wide. Unfortunately, ALD is often diagnosed too late for effective intervention. The Optum Claims dataset contains billing codes for the employee-sponsored insurance claims of millions of individuals—a vast amount of observational data about a general population, including patients with ALD. As a patient inter-acts with the medical system over time, they generate a detailed sequence of ICD diagnostic codes, lab results, and drug prescriptions in Optum Claims. A transformer is a deep learning architecture that excels at modeling long-range sequential dependencies in sequential data through self-attention. Unlike a re-current neural network, a transformer admits parallel computation for efficient training. We developed a transformer-based model called “tf-md” that differentiates early-stage and late-stage ALD based on Optum Claims data. tf-md achieved a validation AUROC of 0.689, whereas a fully-connected “bag of words” neural network model had a best AUROC of 0.674. The latter model has access only to the frequency of codes, not their sequence position, suggesting that the ordering of a patient’s insurance claims contains information that helps to detect ALD early.

Coauthors: Jessica Mellinger and Jeffrey Regier


Irena Chen

Fourth-year PhD Student
Biostatistics

Modeling Individual Variability to Predict Health Outcomes: A Joint Hierarchical Bayesian Approach

Abstract

Longitudinal biomarker data and cross-sectional outcomes are routinely collected in modern epidemiology studies, often with the goal of informing tailored early intervention decisions. For example, hormones such as estradiol (E2) and follicle-stimulating hormone (FSH) may predict changes in womens’ health during the midlife. Most existing methods focus on constructing predictors from mean marker trajectories. However, subject-level biomarker variability may also provide critical information about disease risks and health outcomes. In this paper, we develop a joint model that estimates subject-level means and variances of longitudinal predictors to predict a cross-sectional health outcome. Simulations demonstrate excellent recovery of true model parameters. The proposed method provides less biased and more efficient estimates, relative to alternative approaches that either ignore subject-level differences in variances or perform two-stage estimation where estimated marker variances are treated as observed. Analyses of women’s health data reveal that larger variability of E2 and higher mean levels of E2 and FSH are associated with higher levels of fat mass change across the menopausal transition.

Coauthors: Zhenke Wu, Siobán D. Harlow, Carrie A. Karvonen-Gutierrez, Michelle M. Hood and Michael R. Elliott


Seokhyun Chung

Fourth-year PhD Student
IOE

Federated Condition Monitoring Signal Prediction with Improved Generalization

Abstract

Revolutionary advances in Internet of Things technologies have paved the way for a significant increase in computational resources at edge devices that collect condition monitoring (CM) data. This poses a significant opportunity for federated analytics which exploits edge computing resources to distribute model learning, reduce communication traffic and circumvent the need to share raw data. In this paper we study CM signal prediction where operating units, that have data storage and computational capabilities, jointly learn models without sharing their collected CM signals. Specifically, we first propose a framework for CM signal prediction and introduce a federated approach that tries to improve generalization by encouraging flat solutions through distributed computations. Then, a personalization approach is proposed to adapt the learned model to new clients without losing old knowledge. We examine our proposed framework on CM signals from aircraft turbofan engines under three realistic federated CM scenarios. Experimental results highlight the advantageous features of the proposed approach in improving generalization while decentralizing model inference.

Coauthor: Raed Al Kontar


Dylan Glover

Second-year Master’s Student
Statistics

Forecasting Geomagnetically Induced Currents at the Ottawa Magnetometer Station using ACE Variables

Abstract

Geomagnetic storms occur when high-speed solar wind induces fluctuations in the Earth’s geomagnetic field. We consider a forecasting model of these fluctuations as a proxy for geo-magnetically induced currents (GICs) produced during storms, which can damage ground-based infrastructure and result in regionalized power and utility blackouts. The target quantity in this study was the maximum of the horizontal component of the geomagnetic field dB/dt over a 20-minute interval at the Ottawa SuperMAG magnetometer station. We chose to study storm times only so that the model learns storm behavior, rather than the characteristics of idle time periods. The 1-min resolution interplanetary magnetic field and plasma data were gathered by NASA’s Advanced Composition Explorer (ACE), and then smoothed, to forecast the target with one hour lead time. The random forest model was chosen to enable post-hoc interpretability using SHAP values, which explain the contribution of individual features to each test set observation’s point estimate prediction. Preliminary results indicate reasonable root mean square error of approximately 15 nT (nanotesla) on the test set, so this project will focus on the model as a starting point for quantifying the performance of various modeling choices under various data processing and splitting regimes. In future work, this model will also be compared to models trained on data originating from other stations at various latitudes, to determine how location may influence the features driving held-out test set performance.

Coauthor: Daniel Iong


Yiling Huang

Senior
Mathematics & Statistics

Balance Assessment of Matched Data with Multiple Treatment Levels

Abstract

Identifying and estimating causal effects of treatments is of significant research interest. In doing so, similar data are oftentimes matched into one stratum, and subsequent inferences of causality are carried out based on these strata. In particular, when the data are from observational studies, properly matching observations by their treatment assignment probabilities are especially important for removing potential selection bias induced by selecting observations that receive specific treatments in a non-randomized fashion. Therefore, it is an important task to evaluate whether matching was done properly, that is, whether the covariates are equally distributed in different treatment groups given the matching information. Traditional methods of matching evaluation involve visually investigating summary statistics, such as the standardized mean difference, by covariate, but lack uncertainty quantification of the conclusion and are less convenient compared to an omnibus test that checks matching validity for all covariates one-shot. We propose a hypothesis test that expresses treatment assignment probabilities by an adjacent category logistic regression model and provides an omnibus test of matching for all covariates by testing the global null β = 0 in the language of regression models. In this thesis, we adopt a χ2 approximation of the asymptotic distribution of the test statistic, inspired by the Rao score test. An application of the test indicates the matching results produced by a matching algorithm can be further improved.

Coauthor: Mark Fredrickson


Roman Kouznetsov

Third-year PhD Student
Statistics

deepST: A Graph Convolutional Autoencoder for Spatial Transcriptomics

Abstract

Spatial transcriptomics (ST) measures gene expression for individual cells and pairs these measurements with the positions of cells within a tissue sample. This opens the door for statistical methods to explore how neighboring cells interact. The statistical structure of these interactions can be investigated by posing prediction problems. For example, we can see which subsets of genes in neighboring cells are most predictive of gene expression in target cells. We can infer conditional independence structures by comparing prediction accuracy obtained from different subsets. Existing methods pursuing this vision use fixed-dimensional summaries of the attributes of neighboring cells, ignoring the number of neighbors and the interactions among them. We here propose deepST, a denoising graph convolutional autoencoder that accounts for these subtleties. For a large MERFISH hypothalamus dataset, deepST imputes missing expression levels for response genes more accurately than other state-of-the-art methods including gradient boosting, attaining a 8.7% reduction in absolute error. We also find that gradient boosting itself outperforms existing methods in this domain such as “Mixture of Experts for Spatial Signaling genes Identification'”, attaining a 7.2% reduction in absolute error.
This error reduction is critical because we are using differences in predictive accuracy to uncover biological structure, and these biological differences can be on the order of 1% or less.


Coauthors: Jackson Loper and Jeffrey Regier


Subha Maity

Fourth-year PhD Student
Statistics

Does enforcing fairness mitigate biases caused by subpopulation shift?

Abstract

Many instances of algorithmic bias are caused by subpopulation shifts. For example, ML models often perform worse on demographic groups that are underrepresented in the training data. In this paper, we study whether enforcing algorithmic fairness during training improves the performance of the trained model in the target domain. On one hand, we conceive scenarios in which enforcing fairness does not improve performance in the target domain. In fact, it may even harm performance. On the other hand, we derive necessary and sufficient conditions under which enforcing algorithmic fairness leads to the Bayes model in the target domain. We also illustrate the practical implications of our theoretical results in simulations and on real data.

Coauthors: Debarghya Mukherjee, Mikhail Yurochkin and Yuekai Sun


Robert Malinas

Fourth-year PhD Student
EECS

Detecting Changes in the Covariance Structure of a High-Dimensional Random Process

Abstract

Stationarity is a property often assumed of random samples in a variety of statistical techniques. Our purpose is to determine whether an independent random sample is stationary up to second order, i.e., whether the covariance matrix of the observations is homogeneous throughout the sample. Given a sample of size T of an N-dimensional random process, where T is on the order of N, we consider the presence of a single change point 2 r T such that the observations indexed {1, …, r} are i.i.d. and have a different covariance matrix than the observations indexed {r + 1, …, T}, also assumed to be i.i.d. Our procedure is to first determine whether r = T and, if not, estimate r. Using random matrix theory and free probability theory, we develop a statistic S(t) such that S(t) converges to 0 in probability, in the proportional growth limit of random matrix theory with respect to N and T, for all 2 t T if r = T. Furthermore, if r < T, we show that S(r) is asymptotically greater than or equal to S(t) with high probability for every 2 t T. This yields a procedure by which we can detect and estimate the change point by thresholding. Due to the universality of the random matrix theory used, these results are shown under mild regularity conditions; in particular, we assume only the existence of second moments. Finally, we discuss convergence rates in probability and the statistical performance of the associated test.

Coauthors: Benjamin D. Robinson and Alfred O. Hero III


Stephen Salerno

Fourth-year PhD Student
Biostatistics

A New Deep Learning Approach for Predicting Survival Processes in the Presence of Semi-Competing Risks

Abstract

Many survival processes involve a non-terminal (e.g., disease progression) and a terminal (e.g., death) event, which form a semi-competing risk relationship, i.e., the occurrence of the non-terminal event is subject to the terminal event. Deep learning has emerged as a powerful tool for survival prediction; however, limited work has been done to predict multi-state or competing risk outcomes, let alone semi-competing outcomes. We propose a new deep learning framework for predicting semi-competing risk outcomes based on the illness-death model, a compartment-type model for transitions between states, which allows us to estimate patient-specific transition hazards, including the sojourn time between events, and patient frailty. As deep learning can recover non-linear risk scores, we test our method predicting simulated risk surfaces of varying complexity. We apply our method to the Boston Lung Cancer Study, where we study the impact of clinical and genetic predictors on disease progression and mortality, and the Michigan Medicine Precision Health initiative, where we quantify risks for COVID-19 hospitalization and mortality.

Coauthor: Yi Li


Zeyu Sun

Third-year PhD Student
EECS

Predicting Solar Flares Using CNN and LSTM on Two Solar Cycles of Active Region Data

Abstract

We consider the flare prediction problem that distinguishes flare-imminent active regions that produce an M- or X-class flare in the future 24 hours, from quiet active regions that do not produce any flare within ±24 hours. Using line-of-sight magnetograms and parameters of active regions in two data products covering Solar Cycle 23 and 24, we train and evaluate two deep learning algorithms—CNN and LSTM—and their stacking ensembles. The decisions of CNN are explained using visual attribution methods. We have the following three main findings.
(1) LSTM trained on data from two solar cycles achieves significantly higher True Skill Scores (TSS) than that trained on data from a single solar cycle with a confidence level of at least 0.95.
(2) On data from Solar Cycle 23, a stacking ensemble that combines predictions from LSTM and CNN using the TSS criterion achieves significantly higher TSS than the “select-best” strategy with a confidence level of at least 0.95.
(3) A visual attribution method called Integrated Gradients is able to attribute the CNN’s predictions of flares to the emerging magnetic flux in the active region. It also reveals a limitation of CNN as a flare prediction method using line-of-sight magnetograms: it treats the polarity artifact of line-of-sight magnetograms as positive evidence of flares.

Coauthors: Monica Bobra, Yu Wang, Hu Sun, Yang Chen and Alfred Hero


Leyao Zhang

First-year PhD Student
Biostatistics

Adaptive learning of relevant questions from a questionnaire via best subset algorithms

Abstract

Questionnaire is one of the oldest and most widely used instruments in practice to measure variables relevant to certain traits of interest that cannot be easily measured by physical devices. This paper is bonded with a cohort study of elderly asthma patients in that we aim to examine associations between clinical outcomes and quality of life (QoL). In many practical studies, including our asthma clinical study, the scope of a questionnaire (e.g. QoL) is unfit to a new study population that appears different from the original population used for either questionnaire development or validation. As a result, items in a questionnaire may or may not be of relevance to the new study population. In our analysis, we consider a supervised learning method to identify a subset of questions whose summary score is maximally associated with a specific clinical outcome under investigation. The resultant set of selected items gives an optimal summary metric of the questionnaire, which improves both statistical power and clinical interpretation. Our item extraction procedure is built upon the best subset algorithm implemented by a mixed integer programming, which enjoys both theoretical guarantee of selection consistency and flexibility of handling non-responses. This best subset algorithm is first evaluated by extensive simulation studies with comparisons to existing methods, and then applied to derive tailored QoL scores adaptive to two clinical outcomes of lung function measure (FEV1) and asthma control test (ACT), respectively, among elderly people with persistent asthma.

Coauthors: Wen Wang, Mengtong Hu, Alan P. Baptist and Peter X.K. Song