Student Presentations

15-min Oral Presentation I

15-min Oral Presentation II

15-min Oral Presentation III

5-min Speed Oral Presentation

Poster Session

15-min Oral Presentation I

March 9th 9:30AM-12:00PM @ Amphitheatre

Wenshan Yu

PhD Student
Survey and Data Science

Are interviewer variances equal across modes in mixed-mode studies?

Abstract

As mixed-mode designs become increasingly popular, their effects on data quality have attracted much scholarly attention. Most studies focused on the bias properties of mixed-mode designs; however, few of them have investigated whether mixed-mode designs have heterogeneous variance structures across modes. While many factors can contribute to the interviewer variance component, this study investigates whether interviewer variances are equal across modes in mixed-mode studies. We use data collected with two designs to answer the research question. In the first design, when interviewers are responsible for either face-to-face or telephone mode, we examine whether there are mode differences in interviewer variance for 1) sensitive political questions, 2) international attitudes, 3) and item missing indicators, using the Arab Barometer wave 6 Jordan data with a randomized mixed-mode design. In the second design, we draw on Health and Retirement Study (HRS) 2016 core survey data to examine the question on three topics when interviewers are responsible for both modes. The topics cover 1) the CESD depression scale, 2) interviewer observations, and 3) the physical activity scale. To account for the lack of interpenetrated designs in both data sources, we include respondent-level demographic variables in our models. Given the small power of this study, we find significant differences in interviewer variances on one item (twelve items in total) in the Arab Barometer study; whereas for HRS, the results are three out of seventeen. Overall, we find the magnitude of the interviewer variances larger in FTF than TEL on sensitive items. However, for interviewer observation and non-sensitive items, the pattern is reversed.

Hu Sun

PhD Student
Statistics

Tensor Gaussian Process with Contraction for Tensor Regression

Abstract

Tensor data is a very prevalent data format in the fields such as astronomy and biology. The structured information and the high dimensionality of tensor data makes it an intriguing but challenging topic for statisticians and practitioners. Low-rank scalar-on-tensor regression model, in particular, has received widespread attention and has been re-formulated as a tensor Gaussian Process (Tensor-GP) model with multi-linear kernel. In this paper, we extend the Tensor-GP model by integrating a dimensionality reduction technique called tensor contraction with the Tensor-GP for the tensor regression task. We first estimate a latent, reduced-sized tensor for each data tensor and then use multi-linear Tensor-GP on the latent tensor data for prediction. We introduce anisotropic total-variation regularization when conducting the tensor contraction to obtain sparse and smooth latent tensor and propose an alternating proximal gradient descent algorithm for estimation. We validate our approach via extensive simulation study and real data experiments on solar flare forecasting.

Yilun Zhu

PhD Student
EECS

Mixture Proportion Estimation Beyond Irreducibility

Abstract

The task of mixture proportion estimation (MPE) is to estimate the weight of a component distribution in a mixture, given observations from both component and mixture. Previous work on MPE adopts the \emph{irreducibility} assumption, which ensures identifiablity of the mixture proportion. In this paper, we propose a more general sufficient condition that accommodates several settings of interest where irreducibility does not hold. We further devise a resampling-based algorithm that extends any existing MPE method. This algorithm yields a consistent estimate of the mixture proportion under our more general sufficient condition, and empirically exhibits improved estimation performance relative to baseline methods.

Jing Ouyang

PhD Student
Statistics

Statistical Inference for Noisy Incomplete Binary Matrix

Abstract

We consider the statistical inference for noisy incomplete binary (or 1-bit) matrix. Despite the importance of uncertainty quantification to matrix completion, most of the categorical matrix completion literature focuses on point estimation and prediction. This paper moves one step further toward the statistical inference for binary matrix completion. Under a popular nonlinear factor analysis model, we obtain a point estimator and derive its asymptotic normality. Moreover, our analysis adopts a flexible missing-entry design that does not require a random sampling scheme as required by most of the existing asymptotic results for matrix completion. Under reasonable conditions, the proposed estimator is statistically efficient and optimal in the sense that the Cramer-Rao lower bound is achieved asymptotically for the model parameters. Two applications are considered, including (1) linking two forms of an educational test and (2) linking the roll call voting records from multiple years in the United States Senate. The first application enables the comparison between examinees who took different test forms, and the second application allows us to compare the liberal-conservativeness of senators who did not serve in the Senate at the same time.

Zongyu Li

PhD Student
EECS

Poisson Phase Retrieval in Very Low-count Regimes

Abstract

This paper proposes novel phase retrieval algorithms for maximum likelihood (ML) estimation from measurements following independent Poisson distributions in very low-count regimes, e.g., 0.25 photon per pixel. Specifically, we propose a modified Wirtinger flow (WF) algorithm using a step size based on the observed Fisher information. This approach eliminates all parameter tuning except the number of iterations. We also propose a novel curvature for majorize-minimize (MM) algorithms with a quadratic majorizer. We show theoretically that our proposed curvature is sharper than the curvature derived from the supremum of the second derivative of the Poisson ML cost function. We compare the proposed algorithms (WF, MM) with existing optimization methods, including WF using other step-size schemes, quasi-Newton methods and alternating direction method of multipliers (ADMM) algorithms, under a variety of experimental settings. Simulation experiments with a random Gaussian matrix, a canonical discrete Fourier transform (DFT) matrix, a masked DFT matrix and an empirical transmission matrix demonstrate the following. 1) As expected, algorithms based on the Poisson ML model consistently produce higher quality reconstructions than algorithms derived from Gaussian noise ML models when applied to low-count data. 2) For unregularized cases, our proposed WF algorithm with Fisher information for step size converges faster than other WF methods, e.g., WF with empirical step size, backtracking line search, and optimal step size for the Gaussian noise model; it also converges faster than the quasi-Newton method. 3) In regularized cases, our proposed WF algorithm converges faster than WF with backtracking line search, quasi-Newton, MM and ADMM.

Shihao Wu

PhD Student
Statistics

L0 Constrained Approaches in Learning High-dimensional Sparse Structures: Statistical Optimality and Optimization Techniques

Abstract

Sparse structures are ubiquitous in high-dimensional statistical models. To learn sparse structures from data, non-L0 penalized approaches have been widely used and studied in the past two decades. L0 constrained approaches, however, were understudied due to their computational intractability, and have recently been regaining attention given algorithmic advances in the optimization community and hardware improvements. In this talk, we compare L0 constrained approaches with non-L0 penalized approaches in terms of feature selection in high-dimensional sparse linear regression. Specifically, we focus on false discovery in the early stage of the solution path, which tracks how features enter and leave the model for a selection approach. Su et al. (2017) showed that LASSO, as a non-L0 penalized approach, suffers false discoveries in the early stage. We show that best subset selection, as an L0 constrained approach, results in fewer or even zero false discoveries throughout the early stage of the path. We also identify the optimal condition under which best subset selection achieves exact zero false discovery, which we refer to as sure early selection. Moreover, we show that to achieve the sure early selection, one does not need to obtain an exact solution to best subset selection; a solution within a tolerable optimization error suffices. Extensive numerical experiments also demonstrate the advantages of L0 constrained approaches on the solution path over non-L0 penalized approaches.

Shushu Zhang

PhD Student
Statistics

Estimation and Inference for High-dimensional Expected Shortfall Regression

Abstract

The expected shortfall (also known as the superquantile), defined as the average over the tail below (or above) a certain quantile of a probability distribution, has been recognized as a coherent measure to characterize the tail of a distribution in many applications such as risk analysis. The expected shortfall regression provides powerful tools for learning the relationship between a response variable and a set of covariates, while exploring the heterogeneous effects of the covariates. We are particularly interested in health disparity research in which the lower/upper tail of the conditional distribution of a health-related outcome, given high-dimensional covariates, is of importance. To this end, we propose the penalized expected shortfall regression with the lasso penalty to encourage the resulting estimator to be sparse. We establish explicit non-asymptotic bounds on estimation errors under increasing dimensional settings. To perform statistical inference on a covariate of interest, we propose a debiased estimator and establish its asymptotic normality for valid inference. We illustrate the finite sample performance of the proposed methods through numerical studies and a data application on health disparity.

5-min Speed Oral Presentation

March 9th 2:00PM-3:45PM @ Amphitheatre

Jiazhi Yang

Master’s Student
Survey and Data Science

Weighting Adjustments for Person-Day Nonresponse: An Application to the National Household Food Acquisition and Purchase Survey

Abstract

Multi-day diary surveys, such as the U.S. National Household Food Acquisition and Purchase Survey (FoodAPS), request that participants provide data on a daily basis. These surveys are therefore subject to various nonsampling errors, especially daily nonresponse. Standard post-survey nonresponse adjustment methods include weighting and imputation. Both methods require auxiliary information available for the entire eligible sample to reduce the nonresponse bias in estimates. Previous research using FoodAPS has focused on imputation. In this study, we explore a new weighting methodology that constructs person-day level weights based on an individual’s nonresponse pattern, using FoodAPS data as a case study. In particular, the nonresponse-adjusted household weights in the FoodAPS public-use dataset are not sufficient for addressing the day-level nonresponse in multi-day diary surveys. We first analyze the relationship between the day-level response pattern and auxiliary variables related to key outcomes such as food acquisition events and expenditures, and then we use several analytic approaches, such as logistic regression and classification trees, to predict the response propensity per person per day. At last, we use the model with the highest prediction accuracy to construct individual-level weights on each day, and compare the daily estimates before and after weighting. Our results indicate that the logistic model tends to outperform the classification tree approach, resulting in the highest prediction accuracy. Regarding the estimates, we find that estimates applying the adjusted person-level weights have decreased standard errors relative to the unweighted estimates, and the daily weights also introduce shifts in the daily estimates.

Rupam Bhattacharyya

PhD Student
Biostatistics

BaySyn: Bayesian Evidence Synthesis for Multi-system Multiomic Integration

Abstract

The discovery of cancer drivers/drug targets are often limited to biological systems, e.g., cancer models/patients. While multiomic patient databases have sparse drug response, model systems databases provide lower lineage-specific sample sizes, resulting in reduced power to detect functional drivers and their associations with drug sensitivity. Hence, integrating evidence across model systems can more efficiently deconvolve cancer cellular mechanisms and learn therapeutic associations. To this end, we propose BaySyn – a hierarchical Bayesian evidence synthesis framework for multi-system multiomic integration. BaySyn detects functional driver genes based on their associations with upstream regulators and uses this evidence to calibrate Bayesian variable selection models in the outcome layer. We apply BaySyn to multiomic datasets from CCLE and TCGA, across pan-gynecological cancers. Our mechanistic models implicate several functional genes such as PTPN6 and ERBB2 in the KEGG adherens junction gene set. Further, our outcome model makes more discoveries in drug response models than uncalibrated models at similar Type I error control – such as BCL11A (breast) and FGFRL1 (ovary).

Kevin Smith

PhD Student
Industrial and Operations Engineering

Leveraging Observational Data to Estimate Adherence-Improving Treatment Effects for Stone Formers

Abstract

Randomized, controlled trials (RCTs) are the gold standard protocol for evaluating medication effectiveness on a primary outcome. Once the medication effect is established, it can be prescribed to patients, but many patients, like those prescribed preventative medication for kidney stone formation, exhibit insufficient medication filling patterns, which we define as adherence, that will not realize the medication’s RCT-cited outcome benefit. Insufficient adherence to preventative medications has been observed in a variety of medical domains and may justify a demand for adherence-improving interventions. Before adopting an adherence-improving intervention for clinical practice, an RCT may be proposed to evaluate its effect. Besides their prohibitively high costs, RCTs also suffer from a monitoring bias which may upward bias the true causal effect of an adherence-improving intervention and, thus, may be not be appropriate in this setting. Instead, we leverage observational data, and weighted outcome regression, subclassification, and an augmented inverse probability weighted method, to estimate the average treatment effect of an intervention that may be used to improve medication adherence in stone formers. We also discuss future opportunities for maximizing the utility of observational data using powerful statistical estimators.

Zeyu Sun

PhD Student
EECS

Event Rate Based Recalibration of Solar Flare Prediction

Abstract

Solar flare forecasting can be prone to prediction bias, producing poorly-calibrated and unreliable predictions. A major cause of prediction bias is the varying event rates between the training and the testing phase, which either comes from data collection (e.g., collecting the training set and the test set from different phases in a solar cycle) or data processing (e.g., resampling the training data to tackle the class imbalance). Much of the research on machine learning methods for solar flare forecasting has not addressed the prediction bias caused by shifting event rates. In this paper, we propose a simple yet effective calibration technique when the event rate in the test set is known, and an Expectation-Maximization algorithm when the event rate in the test set is unknown. We evaluated our calibration technique on various machine learning methods (logistic regression, quadratic desciminant analysis, decision trees, gradient boosting trees, and multilayer perceptrons) that are trained and evaluated on Space-Weather HMI Active Region Patches (SHARPs) from 2010 to 2022, roughly covering Solar Cycle 24. The experimental results show that, under the event rate shift, the proposed calibration generally improves the reliability, discrimination, and various skills as measured by BSS and HSS. While the proposed calibration decreases TSS which is insensitive to the event rate on the target domain, it preserves the peak TSS maximized by varying the probability threshold. We provide a decision theoretical explanation for the degradation of TSS. Our method is quite general and can be applied to a wide variety of probabilistic flare forecast methods to improve their calibration and discrimination.

Sehong Oh

Master’s Student
EECS

Anomaly detection via Pattern dictionary and Atypicality

Abstract

Anomaly detection is getting important in diverse research areas to find out fraud or reduce risk. However, anomaly detection is a quite difficult task due to the lack of anomalous data. For this reason, we propose a data-compression-based anomaly detection method for unlabeled time series data and sequence data. This method constructs two features(typicality and atypicality) to distinguish anomalies with clustering methods. The typicality of a test sequence is calculated by how well the sequence is compressed by the pattern dictionary created based on the frequency of all patterns in a training sequence. This typicality itself can be used as an anomaly score to detect anomalous data with a certain threshold. To improve the performance of the pattern dictionary method, we use atypicality to figure out a sequence that can be compressed in itself rather than by using the pattern dictionary. Then, the typicality and the atypicality of each sub-sequence in the test sequence are calculated and we figure out anomalous sub-sequences by clustering them. Thus, we have improved the pattern dictionary method by incorporating typicality and atypicality without thresholds. Namely, our proposed method can cover more anomalous cases than when only typicality or atypicality is considered.

Kiran Kumar

PhD Student
Biostatistics

Meta Imputing Low Coverage Ancient Genomes

Abstract

Background: Low coverage imputation is an essential tool for downstream analysis in ancient DNA (aDNA). However, appropriate reference panels are difficult to obtain. Only small numbers of high-quality ancient sequences are available and imputing aDNA from modern day reference panels leads to reference bias, especially for older non-European samples.

Methods: We evaluate meta-imputation, an imputation strategy that combines estimates from multiple reference panels using dynamically estimated weights, as a tool to optimize the imputation of aDNA samples. We meta-impute by combining imputation results from a smaller, more targeted ancient panel along with a larger, primarily European ancestry reference panel (Human Reference Consortium)

Results: We expect that our low coverage meta-imputed samples provide higher allelic R^2 and higher genomic concordance, thus increasing the accuracy of imputation in ancient samples.

Significance: Increasing the accuracy of imputation in ancient samples allows for better performance in downstream analyses such as PCA or estimating Runs of Heterozygosity (ROH), enlarging our understanding of our shared ancient past.

Mengqi Lin

PhD Student
Statistics

Identifiability of Cognitive Diagnostic Models with polytomous responses

Abstract

Cognitive Diagnostic Models (CDMs) are a powerful tool which allows researchers and practitioners to learn fine-grained diagnostic information about respondents’ latent attributes. Within, there is increasing attention being paid to polytomous responses data as more and more polytomous tests become popular. As many other latent variable models, identifiability is crucial for consistent estimation of the model parameters and valid statistical inference. However, existing identifiability results are mostly focused on binary responses, and identifiability for models with polytomous responses has been scarcely considered. This paper fills this gap and provides the sufficient and necessary condition for the identifiability of the basic and popular DINA models with polytomous responses.

Yumeng Wang

PhD Student
Statistics

ReBoot: Distributed statistical learning via refitting Bootstrap samples

Abstract

With the data explosion in the digital era, it is common for modern data to be distributed across multiple or even a large number of sites. However, there are two salient challenges of analyzing decentralized data: (a) communication of large-scale data between sites is expensive and inefficient; (b) data are not allowed to be shared for privacy or legal reasons. To conquer these challenges, we propose a one-shot distributed learning algorithm via refitting Bootstrap samples, which we refer to as ReBoot. Theoretically, we analyze the statistical rate of ReBoot for generalized linear models (GLM) and noisy phase retrieval, which represent convex and non-convex problems, respectively. ReBoot achieves the full-sample statistical rate in both cases whenever the subsample size is not too small. In particular, we show that the systematic bias of ReBoot, the error that is independent of the number of sub-samples, is O(n^-2) in GLM, where n is the subsample size. The simulation study illustrates the statistical advantage of ReBoot over competing methods, including averaging and CSL (Communication-efficient Surrogate Likelihood). In addition, we propose FedReBoot, an iterative version of ReBoot, to aggregate convolutional neural networks for image classification, which exhibits substantial superiority over FedAvg within early rounds of communication.

Yifan Hu

Master’s Student
Statistics

Establishing An Optimal Individualized Treatment Rule for Pediatric Anxiety with Longitudinal Modeling for Evaluation

Abstract

An Individualized Treatment Rule (ITR) is a special case of a dynamic treatment regimen that inputs information about a patient and recommends a treatment based on this information. This research contributes to the literature by estimating an optimal ITR, which is said to maximizes a pre-specified outcome Pediatric Anxiety Rating Scale (PARS), to guide which of two common treatments to provide for children/adolescents with separation anxiety disorder (SAD), generalized anxiety disorder (GAD) and social phobia (SOP): sertraline medication (SRT) or cognitive behavior therapy (CBT). We use data from the Child and Adolescent Anxiety Multimodal Study (CAMS). CAMS is a completed federally-funded, multi-site, randomized placebo-controlled trial, in which 488 children with anxiety disorders were randomized to: cognitive-behavior therapy (CBT), sertraline (SRT), their combination (COMB), and pill placebo (PBO). There are four steps to the analysis: (1) Split the data for training (70%) and evaluation (30%) along with transforming and scaling the response PARS. In the training data set: (2) Prune the baseline covariates with patients’ demographic information and historical clinical records according to their contribution levels to PARS with a specified variable screening algorithm for subset analysis; (3) Establish an interpretable and parsimonious ITR based on the screened covariates to guide clinicians on deciding personalized treatment plans for patients with pediatric anxiety disorders. Use the evaluation data to (4) Evaluate the effectiveness of ITR versus traditional treatments with only SRT, only CBT, and COMB for pediatric anxiety disorder with causal effect estimation based on comparisons of their clinical outcomes’ longitudinal trajectories from Linear Mixed Effect Models. Our initial result is promising for two reasons: (1) Our final ITR is simple with only two most significant covariates, which should be feasible and easy-to-understand for clinicians in a real-world trial; (2) The longitudinal evaluation on ITR shows that it has a non-inferiority pattern compared with the best treatment COMB in the clinical history.

Xinyu Liang

Master’s Student
Statistics

Social Network Analysis of securities analysts’ academic network and its impact on analysts’ performance taking bank industry star analysts as an example

Abstract

As known, the existence of alumni relationship and co-working relationship between analysts and company executives will promote the performance of analysts. My paper, however, focus on the network between analyst, taking star securities analysts in banking industry as the example to explore the impact of academic cooperation network on the performance of securities analysts. I have verified that analysts use academic networks to interact to promote the exchange of information and enhance their research ability with Wilcoxon rank test. Therefore, the existence of academic cooperation networks between analyst will positively promote their performance. At the same time, I also found that too large academic cooperation network will also have a negative effect on analysts’ performance, since analysts may fall into excessive information exchange and affect their independent judgment. Through Monte Carlo simulation, I proved that the motivation of the formation of academic cooperation network has randomness factor to some extent, that is, it fits to the small world model. Utilize Cramer’V and Rajski methods to detect causation of network: gender is one of the factor of analyst academic network formation, that is, male analysts are more likely to form a network, and the working relationship of analysts also causes analysts’ academic network.

Robert Malinas

PhD Student
EECS

An Improvement on the Hotelling T^2 Test Using the Ledoit-Wolf Nonlinear Shrinkage Estimator

Abstract

Hotelling’s T^2 test is a classical approach for discriminating the means of two multivariate normal samples that share a population covariance matrix. Hotelling’s test is not ideal for high-dimensional samples because the eigenvalues of the estimated sample covariance matrix are inconsistent estimators for their population counterparts. We replace the sample covariance matrix with the nonlinear shrinkage estimator of Ledoit and Wolf 2020. We observe empirically for sub-Gaussian data that the resulting algorithm dominates past methods (Bai and Saranadasa 1996, Chen and Qin 2010, and Li et al. 2020) for a family of population covariance matrices that includes matrices with high or low condition number and many or few nontrivial—i.e., spiked—eigenvalues. Finally, we present performance estimates for the test.

Huda Bashir

Master’s Student
Epidemiology & Public Policy

Racial residential segregation and adolescent birth rates in Brazil: a cross-sectional study in 152 cities, 2014-2016

Abstract

Background: In Brazil, the Adolescent Birth Rate (ABR) is 55 live births per 1,000 adolescent women 15-19 years old, higher than the global ABR. Few studies have examined manifestations of structural racism such as racial residential segregation (RRS) in relation to ABR.

Methods: Using pooled data on ABR (2014-2016), we examined the association between RRS and ABR in 152 Brazilian cities using general estimating equations. RRS was measured for each city using the isolation index for black/brown Brazilians and included in models as a categorical variable (low: ≤ 0.3 ; medium: 0.3 – 0.6; high: > 0.6). ABR was defined as the number of live births for adolescent women aged 15-19 per 1,000. We fit a series of regression models, subsequently adjusting for city-level characteristics that may confound or partially explain the association between RRS and ABR.

Findings: The ABR in our study sample was 54 per 1,000 live births. We observed 27 cities with low RRS, 75 cities with medium RRS, and 50 with high RRS. After adjustment for city-level characteristics, cities with medium RRS were associated with 15% higher ABR (Risk Ratio (RR): 1.15, [1.13, 1.16]) and high RRS was associated with 10% higher ABR (RR: 1.10 [1.00, 1.20]) when compared to cities with low RRS.

Interpretation: RRS is significantly associated with ABR independent of other city-level characteristics. These findings have implications for future policies and programs designed to reduce racial inequities in ABR in Brazilian cities.

Stephanie Morales

PhD Student
Survey and Data Science

Assessing Cross-Cultural Comparability of Self-Rated Health and Its Conceptualization through Web Probing

Abstract

Self-rated health (SRH) is a widely used question across different fields, as it is simple to administer yet has been shown to predict mortality. SRH asks respondents to rate their overall health typically using Likert-type response scales (i.e., excellent, very good, good, fair, poor). Although SRH is commonly used, few studies have examined its conceptualization from the respondents’ point of view and even less so for differences in its conceptualization across diverse populations. This study aims to assess the comparability of SRH across different cultural groups by investigating the factors that respondents consider when responding to the SRH question. We included an open-ended probe asking what respondents thought when responding to SRH in web surveys conducted in five countries: Great Britain, Germany, the U.S., Spain, and Mexico. In the U.S., we targeted six racial/ethnic and linguistic groups: English-dominant Koreans, Korean-dominant Koreans, English-dominant Latinos, Spanish-dominant Latinos, non-Latino Black Americans, and non-Latino White Americans. Our coding was first developed for English responses and adapted to fit additional countries and languages. Among four English-speaking coders, 2 were fluent in Spanish, 1 in German, and 1 in Korean. Coders translated non-English responses into English and then coded. All responses were coded. One novelty of our study is allowing multiple attribute codes (e.g., health behaviors, illness) per respondent and tone (e.g., in the direction of positive or negative health or neutral) of the probing responses for each attribute, allowing us 1) to assess respondents’ thinking process holistically and 2) to examine whether and how respondents mix attributes. Our study compares the number of reported attributes and tone by cultural groups and integrates SRH responses in the analysis. This study aims to provide a deeper understanding of SRH by revealing the cognitive processes among diverse populations and is expected to shed light on its cross-cultural comparability.

Declan McNamara

PhD Student
Statistics

Likelihood-Free Inference for Deblending Galaxy Spectra

Abstract

Many galaxies overlap visually with other galaxies from the vantage point of Earth. As new astronomical surveys peer deeper into the universe, the proportion of overlapping galaxies, called blends, that are detected will increase dramatically. Undetected blends can lead to errors in the estimates of cosmological parameters, which are central to the field of cosmology. To detect blends, we propose a generative model based on a state-of-the-art simulator of galaxy spectra and adopt a recently developed likelihood-free approach to performing approximate posterior inference. Our experiments demonstrate the potential of our method to detect blends in high-resolution spectral data from the Dark Energy Spectrographic Instrument (DESI).

Poster Session

March 9th 3:45PM-5:30PM @ East/West Conference Room

1

Wenchu Pan

Master’s Student
Biostatistics

Small Sample Adjustments of Variance Estimators in Clustered Dynamic Treatment Regimen

Abstract

Dynamic Intervention is becoming more and more popular in improving the outcomes of multi-stage or longitudinal treatment. A Dynamic Treatment Regime (DTR) is a sequence of pre-specified decision rules based on the outcome of treatment and the covariates of a subject or group. Sequential multiple-assignment randomized trial (SMART) is a useful tool for analyzing cluster-level dynamic treatment regimens, and in which randomization is on cluster level and outcome observation is on individual level. In the paper, we go through some GEE variance estimators adjusted for small sample inference and adapted them for SMART. The most challenges here lie in the inverse probability weighting and counterfactual nature in the GEE equation. By multiple simulations, we evaluate different working assumptions and estimators on their performance in different setting of parameters, and find that some simple adjustments like the degree-of-freedom adjustment and estimating weights can make great improvements to the small-sample performance of sandwich estimators.

2

Cody Cousineau

PhD Student
Nutritional Sciences

Cross-sectional association between blood cholesterol and calcium levels in genetically diverse strains of mice

Abstract

Genetically diverse outbred mice allow for the study of genetic variation in the context of high dietary and environmental control. Using a machine learning approach we investigated clinical and morphometric factors that associate with serum cholesterol levels in 844 genetically unique mice of both sexes, and on both a control chow and high fat high sucrose diet. We find expected elevations of cholesterol in male mice, those with elevated serum triglycerides and/or fed a high fat high sucrose diet. The third strongest predictor was serum calcium which correlated with serum cholesterol across both diets and sexes (r=0.39-0.48). This is in-line with several human cohort studies which show associations between calcium and cholesterol, and calcium as an independent predictor of cardiovascular events.

3

Xinyu Zhang

PhD Student
Survey and Data Science

Dynamic Time-to-Event Models for Future Call Attempts Required Until Interview or Refusal

Abstract

The rising cost of survey data collection is an ongoing concern. Cost predictions can be used to make more informed decisions about allocating resources efficiently during data collection. However, telephone surveys typically do not provide a direct measure of case-level costs. As an alternative, we propose using the number of call attempts as a proxy cost indicator. To improve cost predictions, we dynamically adjust predictive models for future call attempts required until interview or refusal during the nonresponse follow up. This update is achieved by fitting models on the training set for cases that are still unresolved at the cutoff point. This approach allows us to incorporate additional paradata collected on each case at later call attempts. We use data from the Health and Retirement Study to evaluate the ability of alternative models to predict the number of future call attempts required until interview or refusal. These models include a baseline model with only time-invariant covariates (discrete time hazard regression), accelerated failure time regression, survival trees, and Bayesian additive regression trees within the framework of accelerated failure time models.

4

Savannah Sturla

PhD Student
Environmental Health Sciences

Urinary paraben and phenol concentrations associated with inflammation markers among pregnant women in Puerto Rico

Abstract

Exposure to phenols and parabens may contribute to increased maternal inflammation and adverse birth outcomes, but these effects are not well-studied in humans. This study aimed to investigate relationships between concentrations of 8 phenols and 4 parabens with 6 inflammatory biomarkers (C-reactive protein, matrix metalloproteinases (MMP) 1, MMP2, MMP9, intercellular adhesion molecule-1 (ICAM-1), and vascular cell adhesion molecule-1) repeatedly measured across pregnancy in the Puerto Rican PROTECT birth cohort. Exposures were measured using tandem mass spectrometry in spot urine samples. Serum inflammation biomarkers were measured using customized Luminex assays. Linear mixed models and multivariate regression models were used, adjusting for covariates of interest. Effect modification by fetal sex and study visit was also tested. Results are expressed as the percent change in outcome per interquartile range increase in exposure. In preliminary analyses, significant negative associations were found, for example, between triclosan and MMP2 (-6.18%, CI: -10.34, -1.82), benzophenone-3 and ICAM-1 (-4.21%, CI: -7.18, -1.15), and bisphenol-A and MMP9 (-5.12%, CI: -9.49, -0.55). Fetal sex and study visit significantly modified several associations. We additionally are exploring this association using mixture methods through the summation of the different phenol and paraben metabolites and adaptive elastic net. Thus far, our results suggest that phenols and parabens may disrupt inflammatory processes pertaining to uterine remodeling and endothelial function, with important implications for pregnancy outcomes. The negative relationships may implicate that these exposures contribute to the downregulation of regulatory immune cells or inflammatory imbalances through downstream mechanisms. More research is needed to further understand immune responses in an effort to improve reproductive and developmental outcomes.

5

Youqi Yang

Master’s Student
Biostatistics

What can we learn from observational data wrinkled with selection bias? A case study using COVID-19 Trends and Impact Survey on COVID-19 vaccine uptake and hesitancy in India and the US during 2021

Abstract

Recent years have witnessed rapid growth in the application of online surveys in epidemiological studies and policy-making processes. However, given their non-probabilistic designs, the validity of any statistical inference is challenged by potential selection bias, including coverage bias and non-response bias. In the coronavirus vaccine study, the exitance of the official benchmark data gives researchers the potential to quantify the estimation error and the resulting effective sample size. In this study, we first checked the estimates of Indian adult COVID-19 vaccination rate between May and September 2021 from a big non-probabilistic survey, COVID-19 Trends and Impact Survey (CTIS; average weekly sample size = 25,000), and a small probabilistic survey, the center for Voting Options and Trends in Election Research (CVoter; average weekly sample size = 2,700), compared to the benchmark data from the COVID Vaccine Intelligence Network (CoWIN). We found CTIS overestimated the overall vaccine uptake and had a surprisingly smaller effective sample size than CVoter. In the second part, we intended to investigate if a non-probabilistic survey, CTIS, could provide a more accurate estimate among the following scenarios compared to the overall vaccination rate: (1) successive difference and relative successive difference, (2) gender difference, (3) time of abrupt changes, and (4) model-assisted estimates of vaccine hesitancy using vaccine uptake. We found although CTIS overestimated the overall vaccination rate, it estimated the above four parameters better. Our study confirms the disparity between the benchmark data and non-probabilistic surveys found in the previous studies. We also show a non-probabilistic survey can provide a more accurate estimate in certain settings given its biased selection mechanism.

6

Karen (Kitty) Oppliger

Master’s Student
Nutritional Sciences

Differential disability risk among gender and ethnic groups by substitution of animal- with plant-protein

Abstract

Background: Disability in middle age has increased in prevalence in the US in recent decades, related to chronic illness and non-communicable disease. High intake of animal protein has been associated with many cardiometabolic and cognitive disorders that may contribute to disability risk.

Objective: This study evaluated the differential effect of plant protein intake on disability outcomes among different demographic subgroups.

Methods: Data from 1983 adults aged 55-64 were acquired from the longitudinal Health and Retirement Study from 2012-2018 and used to conduct hazards ratio analyses. Intakes were adjusted using the nutrient density approach to evaluate substitution of animal protein with plant protein and stratified among subgroups.

Conclusions: Protective effects of plant protein intake were observed but not significant in most groups. Inconsistent significant decreases in disability outcomes occurred among women and Hispanic populations. Substitution of animal protein with plant protein had no significant effect on disability outcomes

7

Irena Chen

PhD Student
Biostatistics

Individual variances as a predictor of health outcomes: investigating the associations between hormone variabilities and bone trajectories in the midlife

Abstract

Women are at increased risk of bone loss around the menopausal transition. The longitudinal relationships between hormones such as estradiol (E2) and follicle-stimulating hormone (FSH) and bone health outcomes are complex for women as they transition through the midlife. Furthermore, the effect of individual hormone variabilities on predicting longitudinal bone health outcomes has not yet been well explored. We introduce a joint model that characterizes both mean individual hormone trajectories and the individual residual variances in the first submodel and then uses these estimates to predict bone health trajectories in the second model. We found that higher E2 variability was associated with faster decreases in BMD during the final menstrual period. Higher E2 variability was also associated with lower bone area increases as women transition through menopause. Higher FSH variability was associated with larger declines in BMD around menopause, but this association was moderated over time after the menopausal transition. Higher FSH variability was also associated with lower bone area increases post-menopause.

8

Qikai Hu

Master’s Student
Statistics

Simulation Study for Predicting Solar Flares with Machine Learning

Abstract

We systematically test the gap between the operational prediction and research-based prediction of whether an active region (AR) will produce a flare of class Γ in the next 24 hr. We consider Γ to be ≥M (strong flare) and ≥A (any flare) class. The input features are time sequences of 20 magnetic parameters from the space weather Helioseismic and Magnetic Imager AR patches. Data used is generated under a newly constructed simulation model using Bootstrap and adjustable assumptions by analyzing ARs from 2012 June to 2022 June and their associated flares identified in the Geostationary Operational Environmental Satellite X-ray flare catalogs. The skill scores from the test show statistically significant variance when different split methods, loss functions, and prediction methods in recent published papers are applied to simulated dataset with different positive-negative ratio and distributions of features.

9

Mallika Ajmani

Master’s Student
Epidemiology

Kidney Function is Associated with Cognitive Status in United States Health and Retirement Study

Abstract

Background and Objectives: Kidneys and brain are both susceptible to vascular damage due to similar anatomic and hemodynamic features. Several studies show renal function is associated with brain health; However, most are limited to small sample size and lack representation from under researched populations in the United States. In a large and diverse representative sample of older adults in the United States, we tested the association between glomerular filtration rate (a marker of kidney function) and cognitive status.

Methods: This study used a cross-sectional 2016 sample of the United States Health and Retirement Study. Our analytical sample included 9,126 participants with complete information on important covariates of interest. Cognitive status was categorized as cognitively normal, cognitive impairment non-dementia, and dementia. Glomerular Filtration Rate (eGFR) was estimated using the CKD-EPI creatinine equation in two different ways i) standard implementation, ii) ignoring the race multiplier. We used ANOVA to test associations between eGFR and cognition.

Results: Our sample consisted of 73.4% White and 17.9% Black participants who were an average of 69 years of age. The average eGFR for Black participants using the standard approach was 81.7 ml/min as compared to 70.5ml/min, ignoring the race multiplier. The prevalence of dementia in our sample was 3.8% and cognitive impairment non-dementia was 17%. Lower average eGFR values (66.54ml/min) were observed among participants with dementia and CIND (69.33ml/min) as compared to normal cognition(76.19ml/min) (p value <0.001)

Conclusions: Individuals with low eGFR had higher odds of experiencing worse cognition. Standard eGFR calculations may underestimate kidney impairment among Black participants. Next steps will be to test the association between GFR and cognitive status using multivariable analysis.

10

James Edwards

Undergraduate Student
Data Science

Financial and Information Aggregation Properties of Gaussian Prediction Markets

Abstract

Prediction markets offer an alternative to polls and surveys for the elicitation and combination of private beliefs about uncertain events. The advantages of prediction markets include time-continuous aggregation and score-based incentives for truthful belief revelation. Traditional prediction markets aggregate point estimates of forecast variables. However, exponential family prediction markets (Abernethy et al., 2014) provide a framework for eliciting and combining entire belief distributions of forecast variables. We study a member of this family, Gaussian markets, which combine the private Gaussian belief distributions of traders about the future realized value of some real random variable. Specifically, we implement a multi-agent simulation environment with a central Gaussian market maker and a population of Bayesian traders. Our trader population is heterogeneous, separated on two variables: informativeness, or how much information a trader privately possesses about the random variable, and budget. We provide novel methods for modeling both attributes: we model the trading decision as the solution to a constrained optimization problem, and informativeness as the degree of variance in the traders’ information sources. Within our market ecosystem, we analyze the impact of trader budget and informativeness, as well as the arrival order of traders, on the market’s convergence. We also study financial properties of the market such as trader compensation and market maker loss.

11

Qinmengge Li

PhD Student
Biostatistics

Bregman Divergence-Based Data Integration with Application to Polygenic Risk Score (PRS) Heterogeneity Adjustment

Abstract

Polygenic risk scores (PRS), developed as the sum of single-nucleotide polymorphisms (SNPs) weighted by the risk allele effect sizes as estimated by published genome-wide association studies, have recently received much attention for genetics risk prediction. While successful for the Caucasian population, the PRS based on the minority population cohorts suffers from limited event rates, small sample sizes, high dimensionality and low signal-to-noise ratios, exacerbating already severe health disparities. Due to population heterogeneity, direct trans-ethnic prediction by utilizing the Caucasian model for the minority population also has limited performance. As a result, it is desirable to design a data integration procedure to measure the difference between the populations and optimally balance the information from them to improve the prediction stability of the minority populations. A unique challenge here is that due to data privacy, the individual genotype data is not accessible for either the Caucasian population or the minority population. Therefore, new data integration methods based only on encrypted summary statistics are needed. To address these challenges, we propose a BRegman divergence-based Integrational Genetic Hazard Trans- ethnic (BRIGHT) estimation procedure to transfer the information we learned from PRS across different ancestries. The proposed method only requires the use of published summary statistics and can be applied to improve the performance of PRS for ethnic minority groups, accounting for challenges including potential model misspecification, data heterogeneity, and data sharing constraints. We provide the asymptotic consistency and weak oracle property for the proposed method. Simulations show the prediction and variable selection advantages of the proposed method when applied to heterogeneous datasets. Real data analysis constructing psoriasis PRS scores for a South Asian population also confirms the improved model performance.

12

Neophytos Charalambides

PhD Student
EECS

Approximate Matrix Multiplication and Laplacian Sparsifiers

Abstract

A ubiquitous operation in data science and scientific computing is matrix multiplication. However, it presents a major computational bottleneck when the matrix dimension is high, as can occur for large data size or feature dimension. A common approach in approximating the product, is to subsample row vectors from the two matrices, and sum the rank-1 outer products of the sampled pairs. We propose a sampling distribution based on the leverage scores of the two matrices. We give a characterization of our approximation in terms of the Euclidean norm, analogous to that of a $\ell_2$-subspace embedding. We then show connections between our algorithm; $CR$-multiplication, with Laplacian spectral sparsifiers, which also have numerous applications in data science, and how approximate matrix multiplication can be used to devise sparsifiers. We also review some applications where these approaches may be useful.

13

Rui Nie

Undergraduate Student
Statistics

Exploring Machine Olfaction

Abstract

The sense of smell provides decision bases for various life problems, including hygiene and safety assessment. Uncovering the relationship between odor percepts and molecular structures can gain human knowledge of our surrounding environments and the neuro-structure of the human olfactory system. However, the mismatch between the structural similarity of molecules and odor similarity complicates the olfaction problem. This study primarily exploited two psychophysical datasets in which pre-trained and novice human subjects reported their subjective ratings towards a selection of odor descriptors respectively. In overcoming the difficulties posed by the irregular shape of chemicals, we compared the performance of classical machine learning regression methods on pre-computed physicochemical descriptors with Graph Neural Networks(GNN) on molecular representation. We found that the GNN on atoms and edge connectivity from scratch outperformed the best-performing random forest regression methods in predicting human odor percepts for each of the datasets, and it also demonstrated better generalizability in transfer learning across the aforementioned two datasets on correlated odor descriptors. Such capability of GNN to learn complicated general odor percepts by subjects across domain knowledge levels offers a new perspective on more efficient data utilization in the field of olfaction.

14

Simon Nguyen

Master’s Student
Statistics

Optimal full matching under a new constraint on the sharing of controls Application in pediatric critical care

Abstract

Health policy researchers are often interested in the causal effect of a medical treatment in situations where randomization is not possible. Full matching on the propensity score (Gu & Rosenbaum, 1993) aims to emulate random assignment by placing observations with similar estimated propensity scores into sets with either one treated unit and one or more control units or one control unit and multiple treated units. Sets of the second type, with treatment units forced to share a comparison unit, can be unhelpful from the perspective of statistical efficiency. The sharing of controls are often needed to achieve experiment-like arrangements, but optimal full matching is known to exaggerate the number of many-one matches that are necessary, generating lopsided matched sets and smaller effective sample sizes (Hansen, 2004). In this paper, we introduce an enhancement of the Hansen and Klopfer (2006) optimal full matching algorithm that counteracts this exaggeration by enabling analysis to permit treatment units to share a control while limiting the number that are permitted to do so. The result is a more well-balanced matching structure that prioritizes 1 : 1 pairs as opposed to matches with lopsided, many-to-one configurations of matched sets. This enhanced optimal full matching is then illustrated in a pilot study on the effects of Extracorporeal Membrane Oxygenation (ECMO) for treatment of pediatric acute respiratory distress syndrome. Within this data-scarce pilot study, the existing methods for limiting the sharing of controls have already resulted in an increased effective sample size. The present enhancement of Hansen and Klopfer’s optimal full matching algorithm provides an additional benefit to both effective sample size and covariate balance.

15

Jialu Zhou

Master’s Student
Biostatistics

Application of Statistical Methodology in the High-Frequency Data Volatility Research

Abstract

As an important index of measuring the financial market, volatility is a hot issue among high frequency data research. Due to its non-observability, a good estimation of the volatility is always of great significance. Under the assumption of zero measurement errors, we construct a newly weighted realized volatility which controls the micro-structure noise effect in the estimation. To be specific, firstly, we choose the best measurement frequency, which shows to be 60 minutes with Bias-Variance balance. Secondly, the distribution of hourly volatility data is fitted and the inverse probability is calculated based on Kullback-Leibler_divergence. The updated realized volatility shows better prediction power using traditional heterogeneous autoregressive volatility(HAR) model. A stratified HAR model, which is a combination of GMM and Model Averaging, even obtains a huge improvement about 50%in the aspect of AIC value based on the estimation we construct. This work shows a great complementary of parametric and non-parametric methods used in high-frequency data research.

16

Ziping Xu

PhD Student
Statistics

Adaptive Sampling for Discovery

Abstract

In this paper, we study a sequential decision-making problem, called Adaptive Sampling for Discovery (ASD). Starting with a large unlabeled dataset, algorithms for ASD adaptively label the points with the goal to maximize the sum of responses. This problem has wide applications to real-world discovery problems, for example drug discovery with the help of machine learning models. ASD algorithms face the well-known exploration-exploitation dilemma. The algorithm needs to choose points that yield information to improve model estimates but it also needs to exploit the model. We rigorously formulate the problem and propose a general information-directed sampling (IDS) algorithm. We provide theoretical guarantees for the performance of IDS in linear, graph and low-rank models. The benefits of IDS are shown in both simulation experiments and real-data experiments for discovering chemical reaction conditions.

17

Bach Viet Do

PhD Student
Statistics

Modeling Solar Flares’ Heterogeneity With Mixture Models

Abstract

The physics of solar flares on the surface of the Sun is highly complex and not yet fully understood. However, observations show that solar eruptions are associ- ated with the intense kilogauss fields of active regions (ARs), where free energies are stored with field-aligned electric currents. With the advent of high-quality data sources such as the Geostationary Operational Environmental Satellites (GOES) and Solar Dynamics Observatory (SDO)/Helioseismic and Magnetic Imager (HMI), recent works on solar flare forecasting have been focusing on data driven methods. In particular, black box machine learning and deep learning models are increas- ingly being adopted in which underlying data structures are not modeled explicitly. If the active regions indeed follow the same laws of physics, there should be simi- lar patterns shared among them, reflected by the observations. Yet, these black box models currently used in the space weather literature do not explicitly characterize the heterogeneous nature of the solar flare data, within and between active regions. In this paper, we propose two finite mixture models designed to capture the heteroge- neous patterns of active regions and their associated solar flare events. With extensive numerical studies, we demonstrate the usefulness of our proposed method for both resolving the sample imbalance issue and modeling the heterogeneity for solar flare events, which are strong and rare.

18

Longrong Pan

Master’s Student
Survey and Data Science

Global Health Interactive Visualization

Abstract

This study explores the relationships between medical resources and the mortality of certain diseases by creating an interactive shiny app to visualize global health data collected from World Bank. The app performs spatial-temporal analysis by mapping values to colors over a world map, ranking the values of different regions with bar plots, and showing the trends of the relationships between medical resources and the mortality of certain diseases with interactive bubble charts. By allowing users to customize based on their purposes, this study offers a practical visualization tool to show multiple variables in an integrated way and thus present more information in fewer dimensions.

19

Felipe Maia Polo

PhD Student
Statistics

Conditional independence testing under model misspecification

Abstract

Testing for conditional independence (CI) is a crucial and challenging aspect of contemporary statistics and machine learning. These tests are widely utilized in various areas, such as causal inference, algorithmic fairness, feature selection, and transfer learning. Many modern methods for conditional independence testing rely on powerful supervised learning methods to learn regression functions as an intermediate step. Although the methods are guaranteed to control Type I error when the supervised learning methods accurately estimate the regression function or Bayes predictor, their behavior when the supervised learning method fails due to model misspecification needs to be better understood. We study the performance of conditional independence tests based on supervised learning under model misspecification, i.e., we propose new approximations or upper bounds for the testing errors that explicitly depend on misspecification errors. Finally, we introduce the Rao-Blackwellized Predictor Test (RBPT), a novel CI regression-based test that is robust against model misspecification, i.e., compared with the considered benchmarks, the RBPT can control Type I error under weaker assumptions while maintaining non-trivial power.

20

Alexander Kagan

PhD Student
Statistics

Influence Maximization under Generalized Linear Threshold Models

Abstract

Influence Maximization (IM) is the problem of determining a fixed number of nodes that maximize the spread of information through a network if they were to receive it, with applications in marketing, public health, etc. IM requires an information spread model together with model-specific edge parameters governing the transmission probability between the nodes. In practice, these edge weights can be estimated from observed multiple information diffusion paths, e.g., retweets. First, we generalize a well-known Linear Threshold Model, which assumes each node has a uniformly distributed activation threshold, to allow for arbitrary threshold distributions. For this general model, we then introduce a likelihood-based approach to estimating the edge weights from diffusion paths, and prove the IM problem can be solved with a natural greedy optimization algorithm without the loss of standard optimality guarantee. Extensive experiments on synthetic and real-world networks demonstrate that a good choice of threshold distribution combined with our algorithm for estimating edge weights significantly improves the quality of IM solutions.

15-min Oral Presentation II

March 10th 9:30AM-12:00PM @ Amphitheatre

Jeong Hin Chin

Undergraduate Student
Statistics

Using Statistical Methods to Predict Team Outcomes

Abstract

Team based learning (TBL) was first introduced in literature in 1982 as a solution to problems that arose from large class settings [1], [2]. Although TBL was first implemented in business schools, team-based pedagogy can now be found across engineering, medical, and social sciences programs all around the world. Even though TBL provides students and instructors with many benefits, not all students benefit equally from this learning method due to various reasons such as free-riders, work allocation, and communication issues [3], [4]. Thus, in order to ensure students are able to enjoy the benefits of TBL, teamwork assessment and support tools such as CATME or Tandem can be used to monitor the students’ performances and notice any changes within the team [4]–[6]. This study will utilize data collected by [team support tool removed for confidential review], a teamwork assessment and support tool capable of providing formative feedback to teams and team members [5]. In order to measure the changes within the teams and check on students’ progress, [team support tool removed for confidential review] collects students’ information through surveys. In this study, the authors are interested in three surveys that the tool collects, namely the “beginning of term” survey (BoT), the “end of term” survey (EoT), and the weekly team check surveys (TC). BoT and EoT include survey questions that require students to rate their relevant experiences and self-efficacy for project-related tasks in the class and preferences for approaching team work such as procrastination, academic orientation, and extraversion. On the other hand, weekly TC requires students to rate the team overall on five items (“working well,” “sharing of work,” “sharing of ideas,” “team confidence,” and “logistics/challenges.”). [Tool name removed for review] was first implemented back in 2019, and has collected responses from more than 5000 students. In this paper, roughly 3000 responses collected from first-year engineering students from 2019-2021 will be studied. The authors intend to use information from BoT and weekly TC to predict whether the teams are working well or not weekly and at the end of the semester. The predicted results will be compared to the students’ actual weekly and end of semester “working well” responses to determine the accuracy of the model. Since the team support tool will continue to be used in future semesters, the authors will take a Bayesian approach in building the models to ensure that past information about a parameter can be used to form a prior distribution for future analysis [7], [8]. The authors will also use cognitive diagnostic models (CDMs) to understand the relationship between the variables collected in weekly team checks and student responses to the initial survey. CDMs are psychometric models that provide one with information about a person’s proficiency to solve particular sets of items [9]. The authors recognise that the survey questions in TCs and BoT do not have correct answers and one does not require any proficiency to answer the questions. Nonetheless, CDMs can still be used to capture the relationship between how the students perceive their team experience (questions in the weekly team checks) and how the students perceive their own personality and preferences (questions in the initial survey). Other studies have used CDMs to learn more about team formation and relationships [10] and between questions in surveys [11]. The Bayesian model will include cluster information obtained through the unsupervised learning method as described in [5] together with the additional information obtained from the aforementioned CDM method. By combining these two pieces of information, we hope to predict how a team dynamic changes and why it changes. This type of information will allow faculty members who are using the team support tool to better understand their students and provide necessary feedback and guidance to allow their students to have a better team experience and success in the course.

Margaret Banker

PhD Student
Biostatistics

Regularized Simultaneous Estimation of Changepoint and Functional Parameter in Functional Accelerometer Data Analysis

Abstract

Accelerometry data enables scientists to extract personal digital features useful in precision health decision making. Existing analytic methods often begin with discretizing Physical Activity (PA) counts into activity categories via fixed cutoffs; however, the cutoffs are validated under restricted settings and cannot be generalized across studies. Here, we develop a data-driven approach to overcome this bottleneck in the analysis of PA data, in which we holistically summarize an individual’s PA profile using Occupation-Time-Curves that describe the percentage of time spent at or above a continuum of activity levels. The resulting functional curve is informative to capture time-course individual variability of PA. We investigate functional analytic under an L0 regularization approach, which handles highly correlated microactivity windows that serve as predictors in a scalar-on-function regression model. We develop a new one-step method that simultaneously conducts fusion via change-point detection and parameter estimation through a new L0 constraint formulation, which we evaluate via simulation experiments and a data analysis assessing the influence of PA on biological aging.

Easton Huch

PhD Student
Statistics

Bayesian Randomization Inference: A Distribution-free Approach to Bayesian Causal Inference

Abstract

Randomization inference is a family of frequentist statistical methods that allow researchers to measure and test causal relationships without making any distributional assumptions about the outcomes; in fact, the potential outcomes are treated as known—not random—quantities. The key benefit of this approach is that it can be applied to virtually any causal analysis in which the assignment mechanism is known, regardless of the distribution of the outcome variable. The random treatment assignment itself justifies the statistical inference. In this talk, I develop a Bayesian framework for randomization inference with continuous outcomes that enjoys similar benefits to those listed above. As is typical of Bayesian methods, the approach allows for seamless uncertainty quantification of functions of parameters by integrating over the full parameter space. I illustrate the approach with examples and discuss possible extensions.

Daniele Bracale

PhD Student
Statistics

Semi-Parametric Non-Smoothing Optimal Dynamic Pricing

Abstract

In this paper, we study the contextual dynamic pricing problem where the market value of a product is linear in its observed features plus some market noise. Products are sold one at a time, and only a binary response indicating the success or failure of a sale is observed. Our model setting is similar to Javanmard et. al., except that we expand the demand curve to a semi-parametric model and need to learn dynamically both parametric and non-parametric components, as in Fan et. al. Our setting is still different from Fan et. al., since we do not use any smoothing kernel approach but non-parametric estimates (MLE, LS) that avoid choosing any bandwidth. We propose a dynamic statistical learning and decision-making policy that combines semi-parametric estimation from a generalized linear model with an unknown link and online decision-making to minimize regret (maximize revenue). Under mild conditions, we show that for a market noise c.d.f. F with 2-nd order derivative, our policy achieves a regret upper bound of e O(T^{17/25}), where T is the time horizon which is an improvement of Fan et. al..

Jeffrey Okamoto

PhD Student
Biostatistics

Probabilistic integration of transcriptome-wide association studies and colocalization analysis identifies key molecular pathways of complex traits

Abstract

Integrative genetic association methods have shown great promise in post-GWAS (genome-wide association study) analyses, in which one of the most challenging tasks is identifying putative causal genes (PCGs) and uncovering molecular mechanisms of complex traits. Recent studies suggest that prevailing computational approaches, including transcriptome-wide association studies (TWASs) and colocalization analysis, are individually imperfect, but their joint usage can yield robust and powerful inference results. We present INTACT, an empirical Bayesian framework to integrate probabilistic evidence from these distinct types of analyses and identify PCGs. Capitalizing on the fact that TWAS and colocalization analysis have low inferential reproducibility for implicating PCGs, we show that INTACT has a mathematical connection to Dempster-Shafer theory, especially Dempster’s rule of combination. This procedure is flexible and can work with a wide range of existing integrative analysis approaches. It has the unique ability to quantify the uncertainty of implicated genes, enabling rigorous control of false-positive discoveries. Taking advantage of this highly desirable feature, we further propose an efficient algorithm, INTACT-GSE, for gene set enrichment analysis based on the integrated probabilistic evidence. We examine the proposed computational methods and illustrate their improved performance over the existing approaches through simulation studies. We apply the proposed methods to analyze the multi-tissue eQTL data from the GTEx project and eight large-scale complex- and molecular-trait GWAS datasets from multiple consortia and the UK Biobank. Overall, we find that the proposed methods markedly improve the existing PCG implication methods and are particularly advantageous in evaluating and identifying key gene sets and biological pathways underlying complex traits.

Charlotte Mann

PhD Student
Statistics

Combining observational and experimental data for causal inference considering data privacy

Abstract

Combining observational and experimental data for causal inference can improve treatment effect estimation. However, many observational data sets cannot be released due to data privacy considerations, so one researcher may not have access to both experimental and observational data. Nonetheless, a small amount of risk of disclosing sensitive information might be tolerable to organizations that house confidential data. In these cases, organizations can employ data privacy techniques, which decrease disclosure risk, potentially at the expense of data utility. In this paper, we explore disclosure limiting transformations of observational data, which can be combined with experimental data to estimate the sample and population average treatment effects. We consider leveraging observational data to improve generalizability of treatment effect estimates when a randomized experiment (RCT) is not representative of the population of interest, and to increase precision of treatment effect estimates. Through simulation studies, we illustrate the trade-off between privacy and utility when employing different disclosure limiting transformations. We find that leveraging transformed observational data in treatment effect estimation can still improve estimation over only using data from an RCT.

Jieru Shi

PhD Student
Biostatistics

Debiased machine learning of causal excursion effects to assess time-varying moderation

Abstract

Twin revolutions in wearable technologies and smartphone-delivered digital health interventions have significantly expanded the accessibility and uptake of mobile health (mHealth) interventions in multiple domains of health sciences. Sequentially randomized experiments called micro-randomized trials (MRTs) have grown in popularity as a means to empirically evaluate the effectiveness of these mHealth intervention components. MRTs have motivated a new class of causal estimands, termed “causal excursion effects”, which allows health scientists to assess how intervention effectiveness changes over time or is moderated by individual characteristics, context, or responses in the past. However, current data analysis methods require pre-specified features of the observed high-dimensional history to construct a working model of an important nuisance parameter. Machine learning (ML) algorithms are ideal for automatic feature construction, but their naive application to causal excursion estimation can lead to bias under model misspecification and therefore incorrect conclusions about the effectiveness of interventions. In this paper, the estimation of causal excursion effects is revisited from a meta-learner’s perspective, where ML and statistical methods such as supervised learning and regression have been explored. Asymptotic properties of the novel estimands are presented and a theoretical comparison accompanied by extensive simulation experiments demonstrates relative efficiency gains, supporting our recommendation for a doubly-robust alternative to the existing methods. Practical utility of the proposed methods is demonstrated by analyzing data from a multi-institution cohort of first year medical residents in the United States.

Lap Sum Chan

PhD Student
Biostatistics

Identification and Inference for High-dimensional Pleiotropic Variants in GWAS

Abstract

In a standard analysis, pleiotropic variants are identified by running separate genomewide association studies (GWAS) and combining results across traits. But such two-stage statistical approach may lead to spurious results. We propose a new statistical approach, Debiased-regularized Factor Analysis Regression Model (DrFARM), through a joint regression model for simultaneous analysis of high-dimensional genetic variants and multilevel dependencies, to identify pleiotropic variants in multi-trait GWAS. This joint modeling strategy controls overall error to permit universal false discovery rate (FDR) control. DrFARM uses the strengths of the debiasing technique and the Cauchy combination test, both being theoretically justified, to establish a valid post selection inference on pleiotropic variants. Through extensive simulations, we show that DrFARM appropriately controls overall FDR. Applying DrFARM to data on 1,031 metabolites measured on 6,135 men from the Metabolic Syndrome in Men (METSIM) study, we identify 288 new metabolite associations at loci that did not reach statistical significance in prior METSIM metabolite GWAS. In addition, we identify new pleiotropic loci for 16 metabolite pairs.

15-min Oral Presentation III

March 10th 1:00PM-3:30PM @ Amphitheatre

Di Wang

PhD Student
Biostatistics

Incorporating External Risk Information from Published Prediction Models with the Cox Model Accounting for Population Heterogeneity

Abstract

Polygenic hazard score (PHS) models designed for European ancestry provide ample information regarding survival risk discrimination. Incorporating such information can be useful to improve the performance of risk discrimination in an internal small-sized study of a minority cohort. However, given that external European models and internal individual-level data come from different populations, ignoring heterogeneity among information sources may introduce substantial bias. In this paper, we develop a Kullback-Leibler-based Cox model (CoxKL) to integrate internal individual-level time-to-event data with external risk scores derived from published prediction models, accounting for population heterogeneity. Partial-likelihood-based KL information is utilized to measure the discrepancy between the external risk information and the internal data. Simulation studies show that the integration model by the proposed CoxKL method achieves improved estimation efficiency and prediction accuracy. We apply the proposed method to develop a trans-ancestry PHS model for prostate cancer by integrating a previously published PHS model based on European ancestry with an internal genetic dataset of African American ancestry males.

Mason Ferlic

PhD Student
Statistics

Optimizing Event-triggered Adaptive Interventions in Mobile Health with Sequentially Randomized Trials

Abstract

In mobile and digital health, advances in collecting sensor data and engaging users in self-reporting have enabled real-time monitoring of an individual’s response to treatment. This has led to significant scientific interest in developing technology-assisted dynamic treatment regimes incorporating digital tailoring variables that determine when, if, and what treatment is needed. In such mobile monitoring environments, event-triggered adaptive interventions, in which a patient transitions to the next stage of therapy when pre-specified event criteria are triggered, enable more agile treatment timing to meet the individual’s needs. Sequential, multiple-assignment randomized trial (SMART) designs can be used to develop optimized event-triggered adaptive interventions. We introduce a new estimation approach for analyzing data from SMARTs that addresses four statistical challenges: (i) the need to condition on the event, which is impacted by past treatment assignment (ii) while avoiding causal collider bias in the comparison of adaptive interventions starting with different treatments, and (iii) the need for dimension-reducing models for the distribution of the event given the past and (iv) the relationship between the event and the research outcome, all while avoiding negative impacts of model misspecification bias on the target causal effects. We illustrate the method on data from a SMART to develop an event-triggered adaptive intervention for weight loss.

Saghar Adler

PhD Student
EECS

Learning a Discrete Set of Optimal Allocation Rules in a Queueing System with Unknown Service Rate

Abstract

To highlight difficulties in learning-based optimal control in nonlinear stochastic dynamic systems, we study admission control for a classical Erlang-B blocking system with unknown service rate. At every job arrival, a dispatcher decides to assign the job to an available server or to block it. Every served job yields a fixed reward for the dispatcher, but it also results in a cost per unit time of service. Our goal is to design a dispatching policy that maximizes the long-term average reward for the dispatcher based on observing the arrival times and the state of the system at each arrival. Critically, the dispatcher observes neither the service times nor departure times so that reinforcement learning based approaches do not apply. Hence, we develop our learning-based dispatch scheme as a parametric learning problem a’la self-tuning adaptive control. In our problem, certainty equivalent control switches between an always admit policy (always explore) and a never admit policy (immediately terminate learning), which is distinct from the adaptive control literature. Therefore, our learning scheme judiciously uses the always admit policy so that learning doesn’t stall. We prove that for all service rates, the proposed policy asymptotically learns to take the optimal action, and we also present finite-time regret guarantees. The extreme contrast in the certainty equivalent optimal control policies leads to difficulties in learning that show up in our regret bounds for different parameter regimes. We explore this aspect in our simulations and also follow-up sampling related questions for our continuous-time system. parameter regimes. We explore this aspect in our simulations and also follow-up sampling related questions for our continuous-time system.

Madeline Abbott

PhD Student
Biostatistics

A latent variable approach to jointly modeling emotions and cigarette use in a mobile health study of smoking cessation

Abstract

Ecological momentary assessment (EMA), which consists of frequently delivered surveys sent to individuals’ smartphones, allows for the collection of data in real time and in natural environments. As a result, data collected using EMA can be particularly useful in understanding the temporal dynamics of individuals’ states and how these states relate to outcomes of interest. Motivated by data from a smoking cessation study, we propose a statistical method for analyzing longitudinal EMA data to determine what psychological states represent risk for smoking and to understand the dynamics of these states. Our method consists of a longitudinal submodel—a dynamic factor model—that models changes in time-varying latent psychological states and a cumulative risk submodel—a Poisson regression model—that connects the latent states with risk for smoking. In data motivating this work, both the underlying psychological states (the predictors) and cigarette smoking (the outcome) are partially unobserved. We account for these partially latent predictors and outcomes in our proposed model and estimation method in which we take a two-stage approach to estimate associations between the psychological states and smoking risk. We include weights in the cumulative risk submodel to reduce the bias in our estimates of association. To illustrate our method, we apply it to a subset of data from the smoking cessation study. Although our work is motivated by a mobile health study of smoking cessation, methods presented here are applicable to mobile health data collected in a variety of other contexts.

Jaeshin Park

PhD Student
Industrial and Operations Engineering

Stratified sampling for reliability analysis using stochastic simulation with multi-dimensional input

Abstract

Stratified sampling has been used to reduce the estimation variance for analyzing system reliability in many applications. It divides the input space into disjoint subsets, called strata, and draws samples from each stratum. By partitioning properly and allocating more computational budgets to important strata, it can help accurately estimate system reliability with limited computational budgets. In the literature, how to allocate computational budgets, given the stratification structure, has been extensively studied, however, how to effectively partition the input domain (i.e., how to design strata) has not been fully investigated yet. Stratification design becomes more important as the input dimension increases, due to the curse of dimensionality. This study analytically derives the optimal stratification structure that minimizes the estimation variance. Further, reconciling the idea of decision trees into the optimal stratification, a robust algorithm is devised for high-dimensional input problems. Numerical experiments and wind power case study demonstrate the benefits of the proposed method.

Mengqi Lin

PhD Student
Statistics

Controlling the false discovery rate under dependency with the adaptively weighted bh procedure

Abstract

We introduce a generic adaptively weighted, covariate-assisted multiple testing method for finite-sample false discovery rate (FDR) control with dependent test statistics where the dependence structure is known. Our method employs conditional calibration to address the dependency between test statistics, and we use the conditional statistics to learn adaptive weights while maintaining FDR control. We derive optimal weights under a conditional two-group model, and provide an algorithm to approximate them. Together with the conditional calibration, our adaptively weighted procedure controls the FDR while improving the power when the covariates are useful. For fixed weights, our procedure dominates the traditional weighted BH procedures under positive dependence and the general weighted BY procedure under known generic dependence.

Hanna Venera

Master’s Student
Biostatistics

Data Analytic Approach for Hybrid SMART-MRT Designs: The SMART Weight Loss Case Study

Abstract

Sequential Multiple Assignment Randomized Trials (SMARTs) and Micro-Randomized Trials (MRTs) are existing designs used to assess sequential components at relatively slow timescales (such as weeks or months) and at relatively fast time scales (such as days or hours), respectively. The hybrid SMART-MRT design is a new experimental approach that integrates a SMART design with an MRT design to enable researchers to answer scientific questions about the construction of psychological interventions in which components are delivered and adapted on different timescales. We explain how data from a hybrid SMART-MRT design can be analyzed to answer a variety of scientific questions about the development of multi-component psychological interventions. We use this approach to analyze data from a completed hybrid SMART-MRT to inform the development of a weight loss intervention. We also discuss how the data analytic approach can be modified to accommodate the unique structure of the weight loss SMART-MRT, including micro-randomizations that are restricted to individuals who show early signs of non-responders.

Shota Takeishi

Visiting PhD Student
Statistics

A Shrinkage Likelihood Ratio Test for High-dimensional Subgroup Analysis with a Logistic-Normal Mixture Model

Abstract

In clinical trials, there might be a subgroup with certain personal attributes who benefits from the treatment more than the rest of the population. Furthermore, such attributes can be of high dimension if, for example, biomarkers or genome data are collected for each patient. With this practical application in mind, this study concerns testing the existence of a subgroup with an enhanced treatment effect with the subgroup membership potentially characterized by high-dimensional covariates. The existing literature on testing the existence of the subgroup has the following two drawbacks. First, the asymptotic null distributions of the test statistics proposed in the literature often have the intractable forms. Notably, they are not easy to simulate, and hence, the data analyst have to resort to computationally demanding method, such as bootstrap, to calculate the critical value. Second, most of the methods in the literature assume that the dimension of personal attributes characterizing the membership of subgroup is fixed so that they are not applicable to high-dimensional data. To fix those problems, this research proposes a novel likelihood ratio-based test with a logistic-normal mixture model for testing the existence of the subgroup. The proposed test simplifies the asymptotic null distribution. Namely, we show that, under the null hypothesis, the test statistics weakly converges to half chi-square distribution, which is easy to simulate. Furthermore, this convergence result holds even under high-dimensional regime where the dimension of the personal attributes characterizing the classification of the subgroup exceeds the sample size. Besides the theory, we present some simulation result to assess finite sample performance of the proposed method.