Student Presentations
15-min Oral Presentation I
March 9th 9:30AM-12:00PM @ Amphitheatre
Wenshan Yu
PhD Student
Survey and Data Science
Are interviewer variances equal across modes in mixed-mode studies?
Abstract
As mixed-mode designs become increasingly popular, their effects on data quality have attracted much scholarly attention. Most studies focused on the bias properties of mixed-mode designs; however, few of them have investigated whether mixed-mode designs have heterogeneous variance structures across modes. While many factors can contribute to the interviewer variance component, this study investigates whether interviewer variances are equal across modes in mixed-mode studies. We use data collected with two designs to answer the research question. In the first design, when interviewers are responsible for either face-to-face or telephone mode, we examine whether there are mode differences in interviewer variance for 1) sensitive political questions, 2) international attitudes, 3) and item missing indicators, using the Arab Barometer wave 6 Jordan data with a randomized mixed-mode design. In the second design, we draw on Health and Retirement Study (HRS) 2016 core survey data to examine the question on three topics when interviewers are responsible for both modes. The topics cover 1) the CESD depression scale, 2) interviewer observations, and 3) the physical activity scale. To account for the lack of interpenetrated designs in both data sources, we include respondent-level demographic variables in our models. Given the small power of this study, we find significant differences in interviewer variances on one item (twelve items in total) in the Arab Barometer study; whereas for HRS, the results are three out of seventeen. Overall, we find the magnitude of the interviewer variances larger in FTF than TEL on sensitive items. However, for interviewer observation and non-sensitive items, the pattern is reversed.
Hu Sun
PhD Student
Statistics
Tensor Gaussian Process with Contraction for Tensor Regression
Abstract
Tensor data is a very prevalent data format in the fields such as astronomy and biology. The structured information and the high dimensionality of tensor data makes it an intriguing but challenging topic for statisticians and practitioners. Low-rank scalar-on-tensor regression model, in particular, has received widespread attention and has been re-formulated as a tensor Gaussian Process (Tensor-GP) model with multi-linear kernel. In this paper, we extend the Tensor-GP model by integrating a dimensionality reduction technique called tensor contraction with the Tensor-GP for the tensor regression task. We first estimate a latent, reduced-sized tensor for each data tensor and then use multi-linear Tensor-GP on the latent tensor data for prediction. We introduce anisotropic total-variation regularization when conducting the tensor contraction to obtain sparse and smooth latent tensor and propose an alternating proximal gradient descent algorithm for estimation. We validate our approach via extensive simulation study and real data experiments on solar flare forecasting.
Yilun Zhu
PhD Student
EECS
Mixture Proportion Estimation Beyond Irreducibility
Abstract
The task of mixture proportion estimation (MPE) is to estimate the weight of a component distribution in a mixture, given observations from both component and mixture. Previous work on MPE adopts the \emph{irreducibility} assumption, which ensures identifiablity of the mixture proportion. In this paper, we propose a more general sufficient condition that accommodates several settings of interest where irreducibility does not hold. We further devise a resampling-based algorithm that extends any existing MPE method. This algorithm yields a consistent estimate of the mixture proportion under our more general sufficient condition, and empirically exhibits improved estimation performance relative to baseline methods.
Jing Ouyang
PhD Student
Statistics
Statistical Inference for Noisy Incomplete Binary Matrix
Abstract
We consider the statistical inference for noisy incomplete binary (or 1-bit) matrix. Despite the importance of uncertainty quantification to matrix completion, most of the categorical matrix completion literature focuses on point estimation and prediction. This paper moves one step further toward the statistical inference for binary matrix completion. Under a popular nonlinear factor analysis model, we obtain a point estimator and derive its asymptotic normality. Moreover, our analysis adopts a flexible missing-entry design that does not require a random sampling scheme as required by most of the existing asymptotic results for matrix completion. Under reasonable conditions, the proposed estimator is statistically efficient and optimal in the sense that the Cramer-Rao lower bound is achieved asymptotically for the model parameters. Two applications are considered, including (1) linking two forms of an educational test and (2) linking the roll call voting records from multiple years in the United States Senate. The first application enables the comparison between examinees who took different test forms, and the second application allows us to compare the liberal-conservativeness of senators who did not serve in the Senate at the same time.
Zongyu Li
PhD Student
EECS
Poisson Phase Retrieval in Very Low-count Regimes
Abstract
This paper proposes novel phase retrieval algorithms for maximum likelihood (ML) estimation from measurements following independent Poisson distributions in very low-count regimes, e.g., 0.25 photon per pixel. Specifically, we propose a modified Wirtinger flow (WF) algorithm using a step size based on the observed Fisher information. This approach eliminates all parameter tuning except the number of iterations. We also propose a novel curvature for majorize-minimize (MM) algorithms with a quadratic majorizer. We show theoretically that our proposed curvature is sharper than the curvature derived from the supremum of the second derivative of the Poisson ML cost function. We compare the proposed algorithms (WF, MM) with existing optimization methods, including WF using other step-size schemes, quasi-Newton methods and alternating direction method of multipliers (ADMM) algorithms, under a variety of experimental settings. Simulation experiments with a random Gaussian matrix, a canonical discrete Fourier transform (DFT) matrix, a masked DFT matrix and an empirical transmission matrix demonstrate the following. 1) As expected, algorithms based on the Poisson ML model consistently produce higher quality reconstructions than algorithms derived from Gaussian noise ML models when applied to low-count data. 2) For unregularized cases, our proposed WF algorithm with Fisher information for step size converges faster than other WF methods, e.g., WF with empirical step size, backtracking line search, and optimal step size for the Gaussian noise model; it also converges faster than the quasi-Newton method. 3) In regularized cases, our proposed WF algorithm converges faster than WF with backtracking line search, quasi-Newton, MM and ADMM.
Shihao Wu
PhD Student
Statistics
L0 Constrained Approaches in Learning High-dimensional Sparse Structures: Statistical Optimality and Optimization Techniques
Abstract
Sparse structures are ubiquitous in high-dimensional statistical models. To learn sparse structures from data, non-L0 penalized approaches have been widely used and studied in the past two decades. L0 constrained approaches, however, were understudied due to their computational intractability, and have recently been regaining attention given algorithmic advances in the optimization community and hardware improvements. In this talk, we compare L0 constrained approaches with non-L0 penalized approaches in terms of feature selection in high-dimensional sparse linear regression. Specifically, we focus on false discovery in the early stage of the solution path, which tracks how features enter and leave the model for a selection approach. Su et al. (2017) showed that LASSO, as a non-L0 penalized approach, suffers false discoveries in the early stage. We show that best subset selection, as an L0 constrained approach, results in fewer or even zero false discoveries throughout the early stage of the path. We also identify the optimal condition under which best subset selection achieves exact zero false discovery, which we refer to as sure early selection. Moreover, we show that to achieve the sure early selection, one does not need to obtain an exact solution to best subset selection; a solution within a tolerable optimization error suffices. Extensive numerical experiments also demonstrate the advantages of L0 constrained approaches on the solution path over non-L0 penalized approaches.
Shushu Zhang
PhD Student
Statistics
Estimation and Inference for High-dimensional Expected Shortfall Regression
Abstract
The expected shortfall (also known as the superquantile), defined as the average over the tail below (or above) a certain quantile of a probability distribution, has been recognized as a coherent measure to characterize the tail of a distribution in many applications such as risk analysis. The expected shortfall regression provides powerful tools for learning the relationship between a response variable and a set of covariates, while exploring the heterogeneous effects of the covariates. We are particularly interested in health disparity research in which the lower/upper tail of the conditional distribution of a health-related outcome, given high-dimensional covariates, is of importance. To this end, we propose the penalized expected shortfall regression with the lasso penalty to encourage the resulting estimator to be sparse. We establish explicit non-asymptotic bounds on estimation errors under increasing dimensional settings. To perform statistical inference on a covariate of interest, we propose a debiased estimator and establish its asymptotic normality for valid inference. We illustrate the finite sample performance of the proposed methods through numerical studies and a data application on health disparity.
5-min Speed Oral Presentation
March 9th 2:00PM-3:45PM @ Amphitheatre
Jiazhi Yang
Master’s Student
Survey and Data Science
Weighting Adjustments for Person-Day Nonresponse: An Application to the National Household Food Acquisition and Purchase Survey
Abstract
Multi-day diary surveys, such as the U.S. National Household Food Acquisition and Purchase Survey (FoodAPS), request that participants provide data on a daily basis. These surveys are therefore subject to various nonsampling errors, especially daily nonresponse. Standard post-survey nonresponse adjustment methods include weighting and imputation. Both methods require auxiliary information available for the entire eligible sample to reduce the nonresponse bias in estimates. Previous research using FoodAPS has focused on imputation. In this study, we explore a new weighting methodology that constructs person-day level weights based on an individual’s nonresponse pattern, using FoodAPS data as a case study. In particular, the nonresponse-adjusted household weights in the FoodAPS public-use dataset are not sufficient for addressing the day-level nonresponse in multi-day diary surveys. We first analyze the relationship between the day-level response pattern and auxiliary variables related to key outcomes such as food acquisition events and expenditures, and then we use several analytic approaches, such as logistic regression and classification trees, to predict the response propensity per person per day. At last, we use the model with the highest prediction accuracy to construct individual-level weights on each day, and compare the daily estimates before and after weighting. Our results indicate that the logistic model tends to outperform the classification tree approach, resulting in the highest prediction accuracy. Regarding the estimates, we find that estimates applying the adjusted person-level weights have decreased standard errors relative to the unweighted estimates, and the daily weights also introduce shifts in the daily estimates.
Rupam Bhattacharyya
PhD Student
Biostatistics
BaySyn: Bayesian Evidence Synthesis for Multi-system Multiomic Integration
Abstract
The discovery of cancer drivers/drug targets are often limited to biological systems, e.g., cancer models/patients. While multiomic patient databases have sparse drug response, model systems databases provide lower lineage-specific sample sizes, resulting in reduced power to detect functional drivers and their associations with drug sensitivity. Hence, integrating evidence across model systems can more efficiently deconvolve cancer cellular mechanisms and learn therapeutic associations. To this end, we propose BaySyn – a hierarchical Bayesian evidence synthesis framework for multi-system multiomic integration. BaySyn detects functional driver genes based on their associations with upstream regulators and uses this evidence to calibrate Bayesian variable selection models in the outcome layer. We apply BaySyn to multiomic datasets from CCLE and TCGA, across pan-gynecological cancers. Our mechanistic models implicate several functional genes such as PTPN6 and ERBB2 in the KEGG adherens junction gene set. Further, our outcome model makes more discoveries in drug response models than uncalibrated models at similar Type I error control – such as BCL11A (breast) and FGFRL1 (ovary).
Kevin Smith
PhD Student
Industrial and Operations Engineering
Leveraging Observational Data to Estimate Adherence-Improving Treatment Effects for Stone Formers
Abstract
Randomized, controlled trials (RCTs) are the gold standard protocol for evaluating medication effectiveness on a primary outcome. Once the medication effect is established, it can be prescribed to patients, but many patients, like those prescribed preventative medication for kidney stone formation, exhibit insufficient medication filling patterns, which we define as adherence, that will not realize the medication’s RCT-cited outcome benefit. Insufficient adherence to preventative medications has been observed in a variety of medical domains and may justify a demand for adherence-improving interventions. Before adopting an adherence-improving intervention for clinical practice, an RCT may be proposed to evaluate its effect. Besides their prohibitively high costs, RCTs also suffer from a monitoring bias which may upward bias the true causal effect of an adherence-improving intervention and, thus, may be not be appropriate in this setting. Instead, we leverage observational data, and weighted outcome regression, subclassification, and an augmented inverse probability weighted method, to estimate the average treatment effect of an intervention that may be used to improve medication adherence in stone formers. We also discuss future opportunities for maximizing the utility of observational data using powerful statistical estimators.
Zeyu Sun
PhD Student
EECS
Event Rate Based Recalibration of Solar Flare Prediction
Abstract
Solar flare forecasting can be prone to prediction bias, producing poorly-calibrated and unreliable predictions. A major cause of prediction bias is the varying event rates between the training and the testing phase, which either comes from data collection (e.g., collecting the training set and the test set from different phases in a solar cycle) or data processing (e.g., resampling the training data to tackle the class imbalance). Much of the research on machine learning methods for solar flare forecasting has not addressed the prediction bias caused by shifting event rates. In this paper, we propose a simple yet effective calibration technique when the event rate in the test set is known, and an Expectation-Maximization algorithm when the event rate in the test set is unknown. We evaluated our calibration technique on various machine learning methods (logistic regression, quadratic desciminant analysis, decision trees, gradient boosting trees, and multilayer perceptrons) that are trained and evaluated on Space-Weather HMI Active Region Patches (SHARPs) from 2010 to 2022, roughly covering Solar Cycle 24. The experimental results show that, under the event rate shift, the proposed calibration generally improves the reliability, discrimination, and various skills as measured by BSS and HSS. While the proposed calibration decreases TSS which is insensitive to the event rate on the target domain, it preserves the peak TSS maximized by varying the probability threshold. We provide a decision theoretical explanation for the degradation of TSS. Our method is quite general and can be applied to a wide variety of probabilistic flare forecast methods to improve their calibration and discrimination.
Sehong Oh
Master’s Student
EECS
Anomaly detection via Pattern dictionary and Atypicality
Abstract
Anomaly detection is getting important in diverse research areas to find out fraud or reduce risk. However, anomaly detection is a quite difficult task due to the lack of anomalous data. For this reason, we propose a data-compression-based anomaly detection method for unlabeled time series data and sequence data. This method constructs two features(typicality and atypicality) to distinguish anomalies with clustering methods. The typicality of a test sequence is calculated by how well the sequence is compressed by the pattern dictionary created based on the frequency of all patterns in a training sequence. This typicality itself can be used as an anomaly score to detect anomalous data with a certain threshold. To improve the performance of the pattern dictionary method, we use atypicality to figure out a sequence that can be compressed in itself rather than by using the pattern dictionary. Then, the typicality and the atypicality of each sub-sequence in the test sequence are calculated and we figure out anomalous sub-sequences by clustering them. Thus, we have improved the pattern dictionary method by incorporating typicality and atypicality without thresholds. Namely, our proposed method can cover more anomalous cases than when only typicality or atypicality is considered.
Kiran Kumar
PhD Student
Biostatistics
Meta Imputing Low Coverage Ancient Genomes
Abstract
Background: Low coverage imputation is an essential tool for downstream analysis in ancient DNA (aDNA). However, appropriate reference panels are difficult to obtain. Only small numbers of high-quality ancient sequences are available and imputing aDNA from modern day reference panels leads to reference bias, especially for older non-European samples.
Methods: We evaluate meta-imputation, an imputation strategy that combines estimates from multiple reference panels using dynamically estimated weights, as a tool to optimize the imputation of aDNA samples. We meta-impute by combining imputation results from a smaller, more targeted ancient panel along with a larger, primarily European ancestry reference panel (Human Reference Consortium)
Results: We expect that our low coverage meta-imputed samples provide higher allelic R^2 and higher genomic concordance, thus increasing the accuracy of imputation in ancient samples.
Significance: Increasing the accuracy of imputation in ancient samples allows for better performance in downstream analyses such as PCA or estimating Runs of Heterozygosity (ROH), enlarging our understanding of our shared ancient past.
Mengqi Lin
PhD Student
Statistics
Identifiability of Cognitive Diagnostic Models with polytomous responses
Abstract
Cognitive Diagnostic Models (CDMs) are a powerful tool which allows researchers and practitioners to learn fine-grained diagnostic information about respondents’ latent attributes. Within, there is increasing attention being paid to polytomous responses data as more and more polytomous tests become popular. As many other latent variable models, identifiability is crucial for consistent estimation of the model parameters and valid statistical inference. However, existing identifiability results are mostly focused on binary responses, and identifiability for models with polytomous responses has been scarcely considered. This paper fills this gap and provides the sufficient and necessary condition for the identifiability of the basic and popular DINA models with polytomous responses.
Yumeng Wang
PhD Student
Statistics
ReBoot: Distributed statistical learning via refitting Bootstrap samples
Abstract
With the data explosion in the digital era, it is common for modern data to be distributed across multiple or even a large number of sites. However, there are two salient challenges of analyzing decentralized data: (a) communication of large-scale data between sites is expensive and inefficient; (b) data are not allowed to be shared for privacy or legal reasons. To conquer these challenges, we propose a one-shot distributed learning algorithm via refitting Bootstrap samples, which we refer to as ReBoot. Theoretically, we analyze the statistical rate of ReBoot for generalized linear models (GLM) and noisy phase retrieval, which represent convex and non-convex problems, respectively. ReBoot achieves the full-sample statistical rate in both cases whenever the subsample size is not too small. In particular, we show that the systematic bias of ReBoot, the error that is independent of the number of sub-samples, is O(n^-2) in GLM, where n is the subsample size. The simulation study illustrates the statistical advantage of ReBoot over competing methods, including averaging and CSL (Communication-efficient Surrogate Likelihood). In addition, we propose FedReBoot, an iterative version of ReBoot, to aggregate convolutional neural networks for image classification, which exhibits substantial superiority over FedAvg within early rounds of communication.
Yifan Hu
Master’s Student
Statistics
Establishing An Optimal Individualized Treatment Rule for Pediatric Anxiety with Longitudinal Modeling for Evaluation
Abstract
An Individualized Treatment Rule (ITR) is a special case of a dynamic treatment regimen that inputs information about a patient and recommends a treatment based on this information. This research contributes to the literature by estimating an optimal ITR, which is said to maximizes a pre-specified outcome Pediatric Anxiety Rating Scale (PARS), to guide which of two common treatments to provide for children/adolescents with separation anxiety disorder (SAD), generalized anxiety disorder (GAD) and social phobia (SOP): sertraline medication (SRT) or cognitive behavior therapy (CBT). We use data from the Child and Adolescent Anxiety Multimodal Study (CAMS). CAMS is a completed federally-funded, multi-site, randomized placebo-controlled trial, in which 488 children with anxiety disorders were randomized to: cognitive-behavior therapy (CBT), sertraline (SRT), their combination (COMB), and pill placebo (PBO). There are four steps to the analysis: (1) Split the data for training (70%) and evaluation (30%) along with transforming and scaling the response PARS. In the training data set: (2) Prune the baseline covariates with patients’ demographic information and historical clinical records according to their contribution levels to PARS with a specified variable screening algorithm for subset analysis; (3) Establish an interpretable and parsimonious ITR based on the screened covariates to guide clinicians on deciding personalized treatment plans for patients with pediatric anxiety disorders. Use the evaluation data to (4) Evaluate the effectiveness of ITR versus traditional treatments with only SRT, only CBT, and COMB for pediatric anxiety disorder with causal effect estimation based on comparisons of their clinical outcomes’ longitudinal trajectories from Linear Mixed Effect Models. Our initial result is promising for two reasons: (1) Our final ITR is simple with only two most significant covariates, which should be feasible and easy-to-understand for clinicians in a real-world trial; (2) The longitudinal evaluation on ITR shows that it has a non-inferiority pattern compared with the best treatment COMB in the clinical history.
Xinyu Liang
Master’s Student
Statistics
Social Network Analysis of securities analysts’ academic network and its impact on analysts’ performance taking bank industry star analysts as an example
Abstract
As known, the existence of alumni relationship and co-working relationship between analysts and company executives will promote the performance of analysts. My paper, however, focus on the network between analyst, taking star securities analysts in banking industry as the example to explore the impact of academic cooperation network on the performance of securities analysts. I have verified that analysts use academic networks to interact to promote the exchange of information and enhance their research ability with Wilcoxon rank test. Therefore, the existence of academic cooperation networks between analyst will positively promote their performance. At the same time, I also found that too large academic cooperation network will also have a negative effect on analysts’ performance, since analysts may fall into excessive information exchange and affect their independent judgment. Through Monte Carlo simulation, I proved that the motivation of the formation of academic cooperation network has randomness factor to some extent, that is, it fits to the small world model. Utilize Cramer’V and Rajski methods to detect causation of network: gender is one of the factor of analyst academic network formation, that is, male analysts are more likely to form a network, and the working relationship of analysts also causes analysts’ academic network.
Robert Malinas
PhD Student
EECS
An Improvement on the Hotelling T^2 Test Using the Ledoit-Wolf Nonlinear Shrinkage Estimator
Abstract
Hotelling’s T^2 test is a classical approach for discriminating the means of two multivariate normal samples that share a population covariance matrix. Hotelling’s test is not ideal for high-dimensional samples because the eigenvalues of the estimated sample covariance matrix are inconsistent estimators for their population counterparts. We replace the sample covariance matrix with the nonlinear shrinkage estimator of Ledoit and Wolf 2020. We observe empirically for sub-Gaussian data that the resulting algorithm dominates past methods (Bai and Saranadasa 1996, Chen and Qin 2010, and Li et al. 2020) for a family of population covariance matrices that includes matrices with high or low condition number and many or few nontrivial—i.e., spiked—eigenvalues. Finally, we present performance estimates for the test.
Huda Bashir
Master’s Student
Epidemiology & Public Policy
Racial residential segregation and adolescent birth rates in Brazil: a cross-sectional study in 152 cities, 2014-2016
Abstract
Background: In Brazil, the Adolescent Birth Rate (ABR) is 55 live births per 1,000 adolescent women 15-19 years old, higher than the global ABR. Few studies have examined manifestations of structural racism such as racial residential segregation (RRS) in relation to ABR.
Methods: Using pooled data on ABR (2014-2016), we examined the association between RRS and ABR in 152 Brazilian cities using general estimating equations. RRS was measured for each city using the isolation index for black/brown Brazilians and included in models as a categorical variable (low: ≤ 0.3 ; medium: 0.3 – 0.6; high: > 0.6). ABR was defined as the number of live births for adolescent women aged 15-19 per 1,000. We fit a series of regression models, subsequently adjusting for city-level characteristics that may confound or partially explain the association between RRS and ABR.
Findings: The ABR in our study sample was 54 per 1,000 live births. We observed 27 cities with low RRS, 75 cities with medium RRS, and 50 with high RRS. After adjustment for city-level characteristics, cities with medium RRS were associated with 15% higher ABR (Risk Ratio (RR): 1.15, [1.13, 1.16]) and high RRS was associated with 10% higher ABR (RR: 1.10 [1.00, 1.20]) when compared to cities with low RRS.
Interpretation: RRS is significantly associated with ABR independent of other city-level characteristics. These findings have implications for future policies and programs designed to reduce racial inequities in ABR in Brazilian cities.
Stephanie Morales
PhD Student
Survey and Data Science
Assessing Cross-Cultural Comparability of Self-Rated Health and Its Conceptualization through Web Probing
Abstract
Self-rated health (SRH) is a widely used question across different fields, as it is simple to administer yet has been shown to predict mortality. SRH asks respondents to rate their overall health typically using Likert-type response scales (i.e., excellent, very good, good, fair, poor). Although SRH is commonly used, few studies have examined its conceptualization from the respondents’ point of view and even less so for differences in its conceptualization across diverse populations. This study aims to assess the comparability of SRH across different cultural groups by investigating the factors that respondents consider when responding to the SRH question. We included an open-ended probe asking what respondents thought when responding to SRH in web surveys conducted in five countries: Great Britain, Germany, the U.S., Spain, and Mexico. In the U.S., we targeted six racial/ethnic and linguistic groups: English-dominant Koreans, Korean-dominant Koreans, English-dominant Latinos, Spanish-dominant Latinos, non-Latino Black Americans, and non-Latino White Americans. Our coding was first developed for English responses and adapted to fit additional countries and languages. Among four English-speaking coders, 2 were fluent in Spanish, 1 in German, and 1 in Korean. Coders translated non-English responses into English and then coded. All responses were coded. One novelty of our study is allowing multiple attribute codes (e.g., health behaviors, illness) per respondent and tone (e.g., in the direction of positive or negative health or neutral) of the probing responses for each attribute, allowing us 1) to assess respondents’ thinking process holistically and 2) to examine whether and how respondents mix attributes. Our study compares the number of reported attributes and tone by cultural groups and integrates SRH responses in the analysis. This study aims to provide a deeper understanding of SRH by revealing the cognitive processes among diverse populations and is expected to shed light on its cross-cultural comparability.
Declan McNamara
PhD Student
Statistics
Likelihood-Free Inference for Deblending Galaxy Spectra
Abstract
Many galaxies overlap visually with other galaxies from the vantage point of Earth. As new astronomical surveys peer deeper into the universe, the proportion of overlapping galaxies, called blends, that are detected will increase dramatically. Undetected blends can lead to errors in the estimates of cosmological parameters, which are central to the field of cosmology. To detect blends, we propose a generative model based on a state-of-the-art simulator of galaxy spectra and adopt a recently developed likelihood-free approach to performing approximate posterior inference. Our experiments demonstrate the potential of our method to detect blends in high-resolution spectral data from the Dark Energy Spectrographic Instrument (DESI).
Poster Session
March 9th 3:45PM-5:30PM @ East/West Conference Room
1
Wenchu Pan
Master’s Student
Biostatistics
Small Sample Adjustments of Variance Estimators in Clustered Dynamic Treatment Regimen
Abstract
Dynamic Intervention is becoming more and more popular in improving the outcomes of multi-stage or longitudinal treatment. A Dynamic Treatment Regime (DTR) is a sequence of pre-specified decision rules based on the outcome of treatment and the covariates of a subject or group. Sequential multiple-assignment randomized trial (SMART) is a useful tool for analyzing cluster-level dynamic treatment regimens, and in which randomization is on cluster level and outcome observation is on individual level. In the paper, we go through some GEE variance estimators adjusted for small sample inference and adapted them for SMART. The most challenges here lie in the inverse probability weighting and counterfactual nature in the GEE equation. By multiple simulations, we evaluate different working assumptions and estimators on their performance in different setting of parameters, and find that some simple adjustments like the degree-of-freedom adjustment and estimating weights can make great improvements to the small-sample performance of sandwich estimators.
2
Cody Cousineau
PhD Student
Nutritional Sciences
Cross-sectional association between blood cholesterol and calcium levels in genetically diverse strains of mice
Abstract
Genetically diverse outbred mice allow for the study of genetic variation in the context of high dietary and environmental control. Using a machine learning approach we investigated clinical and morphometric factors that associate with serum cholesterol levels in 844 genetically unique mice of both sexes, and on both a control chow and high fat high sucrose diet. We find expected elevations of cholesterol in male mice, those with elevated serum triglycerides and/or fed a high fat high sucrose diet. The third strongest predictor was serum calcium which correlated with serum cholesterol across both diets and sexes (r=0.39-0.48). This is in-line with several human cohort studies which show associations between calcium and cholesterol, and calcium as an independent predictor of cardiovascular events.
3
Xinyu Zhang
PhD Student
Survey and Data Science
Dynamic Time-to-Event Models for Future Call Attempts Required Until Interview or Refusal
Abstract
The rising cost of survey data collection is an ongoing concern. Cost predictions can be used to make more informed decisions about allocating resources efficiently during data collection. However, telephone surveys typically do not provide a direct measure of case-level costs. As an alternative, we propose using the number of call attempts as a proxy cost indicator. To improve cost predictions, we dynamically adjust predictive models for future call attempts required until interview or refusal during the nonresponse follow up. This update is achieved by fitting models on the training set for cases that are still unresolved at the cutoff point. This approach allows us to incorporate additional paradata collected on each case at later call attempts. We use data from the Health and Retirement Study to evaluate the ability of alternative models to predict the number of future call attempts required until interview or refusal. These models include a baseline model with only time-invariant covariates (discrete time hazard regression), accelerated failure time regression, survival trees, and Bayesian additive regression trees within the framework of accelerated failure time models.
4
Savannah Sturla
PhD Student
Environmental Health Sciences
Urinary paraben and phenol concentrations associated with inflammation markers among pregnant women in Puerto Rico
Abstract
Exposure to phenols and parabens may contribute to increased maternal inflammation and adverse birth outcomes, but these effects are not well-studied in humans. This study aimed to investigate relationships between concentrations of 8 phenols and 4 parabens with 6 inflammatory biomarkers (C-reactive protein, matrix metalloproteinases (MMP) 1, MMP2, MMP9, intercellular adhesion molecule-1 (ICAM-1), and vascular cell adhesion molecule-1) repeatedly measured across pregnancy in the Puerto Rican PROTECT birth cohort. Exposures were measured using tandem mass spectrometry in spot urine samples. Serum inflammation biomarkers were measured using customized Luminex assays. Linear mixed models and multivariate regression models were used, adjusting for covariates of interest. Effect modification by fetal sex and study visit was also tested. Results are expressed as the percent change in outcome per interquartile range increase in exposure. In preliminary analyses, significant negative associations were found, for example, between triclosan and MMP2 (-6.18%, CI: -10.34, -1.82), benzophenone-3 and ICAM-1 (-4.21%, CI: -7.18, -1.15), and bisphenol-A and MMP9 (-5.12%, CI: -9.49, -0.55). Fetal sex and study visit significantly modified several associations. We additionally are exploring this association using mixture methods through the summation of the different phenol and paraben metabolites and adaptive elastic net. Thus far, our results suggest that phenols and parabens may disrupt inflammatory processes pertaining to uterine remodeling and endothelial function, with important implications for pregnancy outcomes. The negative relationships may implicate that these exposures contribute to the downregulation of regulatory immune cells or inflammatory imbalances through downstream mechanisms. More research is needed to further understand immune responses in an effort to improve reproductive and developmental outcomes.
5
Youqi Yang
Master’s Student
Biostatistics
What can we learn from observational data wrinkled with selection bias? A case study using COVID-19 Trends and Impact Survey on COVID-19 vaccine uptake and hesitancy in India and the US during 2021
Abstract
Recent years have witnessed rapid growth in the application of online surveys in epidemiological studies and policy-making processes. However, given their non-probabilistic designs, the validity of any statistical inference is challenged by potential selection bias, including coverage bias and non-response bias. In the coronavirus vaccine study, the exitance of the official benchmark data gives researchers the potential to quantify the estimation error and the resulting effective sample size. In this study, we first checked the estimates of Indian adult COVID-19 vaccination rate between May and September 2021 from a big non-probabilistic survey, COVID-19 Trends and Impact Survey (CTIS; average weekly sample size = 25,000), and a small probabilistic survey, the center for Voting Options and Trends in Election Research (CVoter; average weekly sample size = 2,700), compared to the benchmark data from the COVID Vaccine Intelligence Network (CoWIN). We found CTIS overestimated the overall vaccine uptake and had a surprisingly smaller effective sample size than CVoter. In the second part, we intended to investigate if a non-probabilistic survey, CTIS, could provide a more accurate estimate among the following scenarios compared to the overall vaccination rate: (1) successive difference and relative successive difference, (2) gender difference, (3) time of abrupt changes, and (4) model-assisted estimates of vaccine hesitancy using vaccine uptake. We found although CTIS overestimated the overall vaccination rate, it estimated the above four parameters better. Our study confirms the disparity between the benchmark data and non-probabilistic surveys found in the previous studies. We also show a non-probabilistic survey can provide a more accurate estimate in certain settings given its biased selection mechanism.
6
Karen (Kitty) Oppliger
Master’s Student
Nutritional Sciences
Differential disability risk among gender and ethnic groups by substitution of animal- with plant-protein
Abstract
Background: Disability in middle age has increased in prevalence in the US in recent decades, related to chronic illness and non-communicable disease. High intake of animal protein has been associated with many cardiometabolic and cognitive disorders that may contribute to disability risk.
Objective: This study evaluated the differential effect of plant protein intake on disability outcomes among different demographic subgroups.
Methods: Data from 1983 adults aged 55-64 were acquired from the longitudinal Health and Retirement Study from 2012-2018 and used to conduct hazards ratio analyses. Intakes were adjusted using the nutrient density approach to evaluate substitution of animal protein with plant protein and stratified among subgroups.
Conclusions: Protective effects of plant protein intake were observed but not significant in most groups. Inconsistent significant decreases in disability outcomes occurred among women and Hispanic populations. Substitution of animal protein with plant protein had no significant effect on disability outcomes
7
Irena Chen
PhD Student
Biostatistics
Individual variances as a predictor of health outcomes: investigating the associations between hormone variabilities and bone trajectories in the midlife
Abstract
Women are at increased risk of bone loss around the menopausal transition. The longitudinal relationships between hormones such as estradiol (E2) and follicle-stimulating hormone (FSH) and bone health outcomes are complex for women as they transition through the midlife. Furthermore, the effect of individual hormone variabilities on predicting longitudinal bone health outcomes has not yet been well explored. We introduce a joint model that characterizes both mean individual hormone trajectories and the individual residual variances in the first submodel and then uses these estimates to predict bone health trajectories in the second model. We found that higher E2 variability was associated with faster decreases in BMD during the final menstrual period. Higher E2 variability was also associated with lower bone area increases as women transition through menopause. Higher FSH variability was associated with larger declines in BMD around menopause, but this association was moderated over time after the menopausal transition. Higher FSH variability was also associated with lower bone area increases post-menopause.
8
Qikai Hu
Master’s Student
Statistics
Simulation Study for Predicting Solar Flares with Machine Learning
Abstract
We systematically test the gap between the operational prediction and research-based prediction of whether an active region (AR) will produce a flare of class Γ in the next 24 hr. We consider Γ to be ≥M (strong flare) and ≥A (any flare) class. The input features are time sequences of 20 magnetic parameters from the space weather Helioseismic and Magnetic Imager AR patches. Data used is generated under a newly constructed simulation model using Bootstrap and adjustable assumptions by analyzing ARs from 2012 June to 2022 June and their associated flares identified in the Geostationary Operational Environmental Satellite X-ray flare catalogs. The skill scores from the test show statistically significant variance when different split methods, loss functions, and prediction methods in recent published papers are applied to simulated dataset with different positive-negative ratio and distributions of features.
9
Mallika Ajmani
Master’s Student
Epidemiology
Kidney Function is Associated with Cognitive Status in United States Health and Retirement Study
Abstract
Background and Objectives: Kidneys and brain are both susceptible to vascular damage due to similar anatomic and hemodynamic features. Several studies show renal function is associated with brain health; However, most are limited to small sample size and lack representation from under researched populations in the United States. In a large and diverse representative sample of older adults in the United States, we tested the association between glomerular filtration rate (a marker of kidney function) and cognitive status.
Methods: This study used a cross-sectional 2016 sample of the United States Health and Retirement Study. Our analytical sample included 9,126 participants with complete information on important covariates of interest. Cognitive status was categorized as cognitively normal, cognitive impairment non-dementia, and dementia. Glomerular Filtration Rate (eGFR) was estimated using the CKD-EPI creatinine equation in two different ways i) standard implementation, ii) ignoring the race multiplier. We used ANOVA to test associations between eGFR and cognition.
Results: Our sample consisted of 73.4% White and 17.9% Black participants who were an average of 69 years of age. The average eGFR for Black participants using the standard approach was 81.7 ml/min as compared to 70.5ml/min, ignoring the race multiplier. The prevalence of dementia in our sample was 3.8% and cognitive impairment non-dementia was 17%. Lower average eGFR values (66.54ml/min) were observed among participants with dementia and CIND (69.33ml/min) as compared to normal cognition(76.19ml/min) (p value <0.001)
Conclusions: Individuals with low eGFR had higher odds of experiencing worse cognition. Standard eGFR calculations may underestimate kidney impairment among Black participants. Next steps will be to test the association between GFR and cognitive status using multivariable analysis.
10
James Edwards
Undergraduate Student
Data Science
Financial and Information Aggregation Properties of Gaussian Prediction Markets
Abstract
Prediction markets offer an alternative to polls and surveys for the elicitation and combination of private beliefs about uncertain events. The advantages of prediction markets include time-continuous aggregation and score-based incentives for truthful belief revelation. Traditional prediction markets aggregate point estimates of forecast variables. However, exponential family prediction markets (Abernethy et al., 2014) provide a framework for eliciting and combining entire belief distributions of forecast variables. We study a member of this family, Gaussian markets, which combine the private Gaussian belief distributions of traders about the future realized value of some real random variable. Specifically, we implement a multi-agent simulation environment with a central Gaussian market maker and a population of Bayesian traders. Our trader population is heterogeneous, separated on two variables: informativeness, or how much information a trader privately possesses about the random variable, and budget. We provide novel methods for modeling both attributes: we model the trading decision as the solution to a constrained optimization problem, and informativeness as the degree of variance in the traders’ information sources. Within our market ecosystem, we analyze the impact of trader budget and informativeness, as well as the arrival order of traders, on the market’s convergence. We also study financial properties of the market such as trader compensation and market maker loss.
11
Qinmengge Li
PhD Student
Biostatistics
Bregman Divergence-Based Data Integration with Application to Polygenic Risk Score (PRS) Heterogeneity Adjustment
Abstract
Polygenic risk scores (PRS), developed as the sum of single-nucleotide polymorphisms (SNPs) weighted by the risk allele effect sizes as estimated by published genome-wide association studies, have recently received much attention for genetics risk prediction. While successful for the Caucasian population, the PRS based on the minority population cohorts suffers from limited event rates, small sample sizes, high dimensionality and low signal-to-noise ratios, exacerbating already severe health disparities. Due to population heterogeneity, direct trans-ethnic prediction by utilizing the Caucasian model for the minority population also has limited performance. As a result, it is desirable to design a data integration procedure to measure the difference between the populations and optimally balance the information from them to improve the prediction stability of the minority populations. A unique challenge here is that due to data privacy, the individual genotype data is not accessible for either the Caucasian population or the minority population. Therefore, new data integration methods based only on encrypted summary statistics are needed. To address these challenges, we propose a BRegman divergence-based Integrational Genetic Hazard Trans- ethnic (BRIGHT) estimation procedure to transfer the information we learned from PRS across different ancestries. The proposed method only requires the use of published summary statistics and can be applied to improve the performance of PRS for ethnic minority groups, accounting for challenges including potential model misspecification, data heterogeneity, and data sharing constraints. We provide the asymptotic consistency and weak oracle property for the proposed method. Simulations show the prediction and variable selection advantages of the proposed method when applied to heterogeneous datasets. Real data analysis constructing psoriasis PRS scores for a South Asian population also confirms the improved model performance.
12
Neophytos Charalambides
PhD Student
EECS
Approximate Matrix Multiplication and Laplacian Sparsifiers
Abstract
A ubiquitous operation in data science and scientific computing is matrix multiplication. However, it presents a major computational bottleneck when the matrix dimension is high, as can occur for large data size or feature dimension. A common approach in approximating the product, is to subsample row vectors from the two matrices, and sum the rank-1 outer products of the sampled pairs. We propose a sampling distribution based on the leverage scores of the two matrices. We give a characterization of our approximation in terms of the Euclidean norm, analogous to that of a $\ell_2$-subspace embedding. We then show connections between our algorithm; $CR$-multiplication, with Laplacian spectral sparsifiers, which also have numerous applications in data science, and how approximate matrix multiplication can be used to devise sparsifiers. We also review some applications where these approaches may be useful.
13
Rui Nie
Undergraduate Student
Statistics
Exploring Machine Olfaction
Abstract
The sense of smell provides decision bases for various life problems, including hygiene and safety assessment. Uncovering the relationship between odor percepts and molecular structures can gain human knowledge of our surrounding environments and the neuro-structure of the human olfactory system. However, the mismatch between the structural similarity of molecules and odor similarity complicates the olfaction problem. This study primarily exploited two psychophysical datasets in which pre-trained and novice human subjects reported their subjective ratings towards a selection of odor descriptors respectively. In overcoming the difficulties posed by the irregular shape of chemicals, we compared the performance of classical machine learning regression methods on pre-computed physicochemical descriptors with Graph Neural Networks(GNN) on molecular representation. We found that the GNN on atoms and edge connectivity from scratch outperformed the best-performing random forest regression methods in predicting human odor percepts for each of the datasets, and it also demonstrated better generalizability in transfer learning across the aforementioned two datasets on correlated odor descriptors. Such capability of GNN to learn complicated general odor percepts by subjects across domain knowledge levels offers a new perspective on more efficient data utilization in the field of olfaction.
14
Simon Nguyen
Master’s Student
Statistics
Optimal full matching under a new constraint on the sharing of controls Application in pediatric critical care
Abstract
Health policy researchers are often interested in the causal effect of a medical treatment in situations where randomization is not possible. Full matching on the propensity score (Gu & Rosenbaum, 1993) aims to emulate random assignment by placing observations with similar estimated propensity scores into sets with either one treated unit and one or more control units or one control unit and multiple treated units. Sets of the second type, with treatment units forced to share a comparison unit, can be unhelpful from the perspective of statistical efficiency. The sharing of controls are often needed to achieve experiment-like arrangements, but optimal full matching is known to exaggerate the number of many-one matches that are necessary, generating lopsided matched sets and smaller effective sample sizes (Hansen, 2004). In this paper, we introduce an enhancement of the Hansen and Klopfer (2006) optimal full matching algorithm that counteracts this exaggeration by enabling analysis to permit treatment units to share a control while limiting the number that are permitted to do so. The result is a more well-balanced matching structure that prioritizes 1 : 1 pairs as opposed to matches with lopsided, many-to-one configurations of matched sets. This enhanced optimal full matching is then illustrated in a pilot study on the effects of Extracorporeal Membrane Oxygenation (ECMO) for treatment of pediatric acute respiratory distress syndrome. Within this data-scarce pilot study, the existing methods for limiting the sharing of controls have already resulted in an increased effective sample size. The present enhancement of Hansen and Klopfer’s optimal full matching algorithm provides an additional benefit to both effective sample size and covariate balance.
15
Jialu Zhou
Master’s Student
Biostatistics
Application of Statistical Methodology in the High-Frequency Data Volatility Research
Abstract
As an important index of measuring the financial market, volatility is a hot issue among high frequency data research. Due to its non-observability, a good estimation of the volatility is always of great significance. Under the assumption of zero measurement errors, we construct a newly weighted realized volatility which controls the micro-structure noise effect in the estimation. To be specific, firstly, we choose the best measurement frequency, which shows to be 60 minutes with Bias-Variance balance. Secondly, the distribution of hourly volatility data is fitted and the inverse probability is calculated based on Kullback-Leibler_divergence. The updated realized volatility shows better prediction power using traditional heterogeneous autoregressive volatility(HAR) model. A stratified HAR model, which is a combination of GMM and Model Averaging, even obtains a huge improvement about 50%in the aspect of AIC value based on the estimation we construct. This work shows a great complementary of parametric and non-parametric methods used in high-frequency data research.
16
Ziping Xu
PhD Student
Statistics
Adaptive Sampling for Discovery
Abstract
In this paper, we study a sequential decision-making problem, called Adaptive Sampling for Discovery (ASD). Starting with a large unlabeled dataset, algorithms for ASD adaptively label the points with the goal to maximize the sum of responses. This problem has wide applications to real-world discovery problems, for example drug discovery with the help of machine learning models. ASD algorithms face the well-known exploration-exploitation dilemma. The algorithm needs to choose points that yield information to improve model estimates but it also needs to exploit the model. We rigorously formulate the problem and propose a general information-directed sampling (IDS) algorithm. We provide theoretical guarantees for the performance of IDS in linear, graph and low-rank models. The benefits of IDS are shown in both simulation experiments and real-data experiments for discovering chemical reaction conditions.
17
Bach Viet Do
PhD Student
Statistics
Modeling Solar Flares’ Heterogeneity With Mixture Models
Abstract
The physics of solar flares on the surface of the Sun is highly complex and not yet fully understood. However, observations show that solar eruptions are associ- ated with the intense kilogauss fields of active regions (ARs), where free energies are stored with field-aligned electric currents. With the advent of high-quality data sources such as the Geostationary Operational Environmental Satellites (GOES) and Solar Dynamics Observatory (SDO)/Helioseismic and Magnetic Imager (HMI), recent works on solar flare forecasting have been focusing on data driven methods. In particular, black box machine learning and deep learning models are increas- ingly being adopted in which underlying data structures are not modeled explicitly. If the active regions indeed follow the same laws of physics, there should be simi- lar patterns shared among them, reflected by the observations. Yet, these black box models currently used in the space weather literature do not explicitly characterize the heterogeneous nature of the solar flare data, within and between active regions. In this paper, we propose two finite mixture models designed to capture the heteroge- neous patterns of active regions and their associated solar flare events. With extensive numerical studies, we demonstrate the usefulness of our proposed method for both resolving the sample imbalance issue and modeling the heterogeneity for solar flare events, which are strong and rare.
18
Longrong Pan
Master’s Student
Survey and Data Science
Global Health Interactive Visualization
Abstract
This study explores the relationships between medical resources and the mortality of certain diseases by creating an interactive shiny app to visualize global health data collected from World Bank. The app performs spatial-temporal analysis by mapping values to colors over a world map, ranking the values of different regions with bar plots, and showing the trends of the relationships between medical resources and the mortality of certain diseases with interactive bubble charts. By allowing users to customize based on their purposes, this study offers a practical visualization tool to show multiple variables in an integrated way and thus present more information in fewer dimensions.
19
Felipe Maia Polo
PhD Student
Statistics
Conditional independence testing under model misspecification
Abstract
Testing for conditional independence (CI) is a crucial and challenging aspect of contemporary statistics and machine learning. These tests are widely utilized in various areas, such as causal inference, algorithmic fairness, feature selection, and transfer learning. Many modern methods for conditional independence testing rely on powerful supervised learning methods to learn regression functions as an intermediate step. Although the methods are guaranteed to control Type I error when the supervised learning methods accurately estimate the regression function or Bayes predictor, their behavior when the supervised learning method fails due to model misspecification needs to be better understood. We study the performance of conditional independence tests based on supervised learning under model misspecification, i.e., we propose new approximations or upper bounds for the testing errors that explicitly depend on misspecification errors. Finally, we introduce the Rao-Blackwellized Predictor Test (RBPT), a novel CI regression-based test that is robust against model misspecification, i.e., compared with the considered benchmarks, the RBPT can control Type I error under weaker assumptions while maintaining non-trivial power.
20
Alexander Kagan
PhD Student
Statistics
Influence Maximization under Generalized Linear Threshold Models
Abstract
Influence Maximization (IM) is the problem of determining a fixed number of nodes that maximize the spread of information through a network if they were to receive it, with applications in marketing, public health, etc. IM requires an information spread model together with model-specific edge parameters governing the transmission probability between the nodes. In practice, these edge weights can be estimated from observed multiple information diffusion paths, e.g., retweets. First, we generalize a well-known Linear Threshold Model, which assumes each node has a uniformly distributed activation threshold, to allow for arbitrary threshold distributions. For this general model, we then introduce a likelihood-based approach to estimating the edge weights from diffusion paths, and prove the IM problem can be solved with a natural greedy optimization algorithm without the loss of standard optimality guarantee. Extensive experiments on synthetic and real-world networks demonstrate that a good choice of threshold distribution combined with our algorithm for estimating edge weights significantly improves the quality of IM solutions.
15-min Oral Presentation II
March 10th 9:30AM-12:00PM @ Amphitheatre
Jeong Hin Chin
Undergraduate Student
Statistics
Using Statistical Methods to Predict Team Outcomes
Abstract
Team based learning (TBL) was first introduced in literature in 1982 as a solution to problems that arose from large class settings [1], [2]. Although TBL was first implemented in business schools, team-based pedagogy can now be found across engineering, medical, and social sciences programs all around the world. Even though TBL provides students and instructors with many benefits, not all students benefit equally from this learning method due to various reasons such as free-riders, work allocation, and communication issues [3], [4]. Thus, in order to ensure students are able to enjoy the benefits of TBL, teamwork assessment and support tools such as CATME or Tandem can be used to monitor the students’ performances and notice any changes within the team [4]–[6]. This study will utilize data collected by [team support tool removed for confidential review], a teamwork assessment and support tool capable of providing formative feedback to teams and team members [5]. In order to measure the changes within the teams and check on students’ progress, [team support tool removed for confidential review] collects students’ information through surveys. In this study, the authors are interested in three surveys that the tool collects, namely the “beginning of term” survey (BoT), the “end of term” survey (EoT), and the weekly team check surveys (TC). BoT and EoT include survey questions that require students to rate their relevant experiences and self-efficacy for project-related tasks in the class and preferences for approaching team work such as procrastination, academic orientation, and extraversion. On the other hand, weekly TC requires students to rate the team overall on five items (“working well,” “sharing of work,” “sharing of ideas,” “team confidence,” and “logistics/challenges.”). [Tool name removed for review] was first implemented back in 2019, and has collected responses from more than 5000 students. In this paper, roughly 3000 responses collected from first-year engineering students from 2019-2021 will be studied. The authors intend to use information from BoT and weekly TC to predict whether the teams are working well or not weekly and at the end of the semester. The predicted results will be compared to the students’ actual weekly and end of semester “working well” responses to determine the accuracy of the model. Since the team support tool will continue to be used in future semesters, the authors will take a Bayesian approach in building the models to ensure that past information about a parameter can be used to form a prior distribution for future analysis [7], [8]. The authors will also use cognitive diagnostic models (CDMs) to understand the relationship between the variables collected in weekly team checks and student responses to the initial survey. CDMs are psychometric models that provide one with information about a person’s proficiency to solve particular sets of items [9]. The authors recognise that the survey questions in TCs and BoT do not have correct answers and one does not require any proficiency to answer the questions. Nonetheless, CDMs can still be used to capture the relationship between how the students perceive their team experience (questions in the weekly team checks) and how the students perceive their own personality and preferences (questions in the initial survey). Other studies have used CDMs to learn more about team formation and relationships [10] and between questions in surveys [11]. The Bayesian model will include cluster information obtained through the unsupervised learning method as described in [5] together with the additional information obtained from the aforementioned CDM method. By combining these two pieces of information, we hope to predict how a team dynamic changes and why it changes. This type of information will allow faculty members who are using the team support tool to better understand their students and provide necessary feedback and guidance to allow their students to have a better team experience and success in the course.
Margaret Banker
PhD Student
Biostatistics
Regularized Simultaneous Estimation of Changepoint and Functional Parameter in Functional Accelerometer Data Analysis
Abstract
Accelerometry data enables scientists to extract personal digital features useful in precision health decision making. Existing analytic methods often begin with discretizing Physical Activity (PA) counts into activity categories via fixed cutoffs; however, the cutoffs are validated under restricted settings and cannot be generalized across studies. Here, we develop a data-driven approach to overcome this bottleneck in the analysis of PA data, in which we holistically summarize an individual’s PA profile using Occupation-Time-Curves that describe the percentage of time spent at or above a continuum of activity levels. The resulting functional curve is informative to capture time-course individual variability of PA. We investigate functional analytic under an L0 regularization approach, which handles highly correlated microactivity windows that serve as predictors in a scalar-on-function regression model. We develop a new one-step method that simultaneously conducts fusion via change-point detection and parameter estimation through a new L0 constraint formulation, which we evaluate via simulation experiments and a data analysis assessing the influence of PA on biological aging.
Easton Huch
PhD Student
Statistics
Bayesian Randomization Inference: A Distribution-free Approach to Bayesian Causal Inference
Abstract
Randomization inference is a family of frequentist statistical methods that allow researchers to measure and test causal relationships without making any distributional assumptions about the outcomes; in fact, the potential outcomes are treated as known—not random—quantities. The key benefit of this approach is that it can be applied to virtually any causal analysis in which the assignment mechanism is known, regardless of the distribution of the outcome variable. The random treatment assignment itself justifies the statistical inference. In this talk, I develop a Bayesian framework for randomization inference with continuous outcomes that enjoys similar benefits to those listed above. As is typical of Bayesian methods, the approach allows for seamless uncertainty quantification of functions of parameters by integrating over the full parameter space. I illustrate the approach with examples and discuss possible extensions.
Daniele Bracale
PhD Student
Statistics
Semi-Parametric Non-Smoothing Optimal Dynamic Pricing
Abstract
In this paper, we study the contextual dynamic pricing problem where the market value of a product is linear in its observed features plus some market noise. Products are sold one at a time, and only a binary response indicating the success or failure of a sale is observed. Our model setting is similar to Javanmard et. al., except that we expand the demand curve to a semi-parametric model and need to learn dynamically both parametric and non-parametric components, as in Fan et. al. Our setting is still different from Fan et. al., since we do not use any smoothing kernel approach but non-parametric estimates (MLE, LS) that avoid choosing any bandwidth. We propose a dynamic statistical learning and decision-making policy that combines semi-parametric estimation from a generalized linear model with an unknown link and online decision-making to minimize regret (maximize revenue). Under mild conditions, we show that for a market noise c.d.f. F with 2-nd order derivative, our policy achieves a regret upper bound of e O(T^{17/25}), where T is the time horizon which is an improvement of Fan et. al..
Jeffrey Okamoto
PhD Student
Biostatistics
Probabilistic integration of transcriptome-wide association studies and colocalization analysis identifies key molecular pathways of complex traits
Abstract
Integrative genetic association methods have shown great promise in post-GWAS (genome-wide association study) analyses, in which one of the most challenging tasks is identifying putative causal genes (PCGs) and uncovering molecular mechanisms of complex traits. Recent studies suggest that prevailing computational approaches, including transcriptome-wide association studies (TWASs) and colocalization analysis, are individually imperfect, but their joint usage can yield robust and powerful inference results. We present INTACT, an empirical Bayesian framework to integrate probabilistic evidence from these distinct types of analyses and identify PCGs. Capitalizing on the fact that TWAS and colocalization analysis have low inferential reproducibility for implicating PCGs, we show that INTACT has a mathematical connection to Dempster-Shafer theory, especially Dempster’s rule of combination. This procedure is flexible and can work with a wide range of existing integrative analysis approaches. It has the unique ability to quantify the uncertainty of implicated genes, enabling rigorous control of false-positive discoveries. Taking advantage of this highly desirable feature, we further propose an efficient algorithm, INTACT-GSE, for gene set enrichment analysis based on the integrated probabilistic evidence. We examine the proposed computational methods and illustrate their improved performance over the existing approaches through simulation studies. We apply the proposed methods to analyze the multi-tissue eQTL data from the GTEx project and eight large-scale complex- and molecular-trait GWAS datasets from multiple consortia and the UK Biobank. Overall, we find that the proposed methods markedly improve the existing PCG implication methods and are particularly advantageous in evaluating and identifying key gene sets and biological pathways underlying complex traits.
Charlotte Mann
PhD Student
Statistics
Combining observational and experimental data for causal inference considering data privacy
Abstract
Combining observational and experimental data for causal inference can improve treatment effect estimation. However, many observational data sets cannot be released due to data privacy considerations, so one researcher may not have access to both experimental and observational data. Nonetheless, a small amount of risk of disclosing sensitive information might be tolerable to organizations that house confidential data. In these cases, organizations can employ data privacy techniques, which decrease disclosure risk, potentially at the expense of data utility. In this paper, we explore disclosure limiting transformations of observational data, which can be combined with experimental data to estimate the sample and population average treatment effects. We consider leveraging observational data to improve generalizability of treatment effect estimates when a randomized experiment (RCT) is not representative of the population of interest, and to increase precision of treatment effect estimates. Through simulation studies, we illustrate the trade-off between privacy and utility when employing different disclosure limiting transformations. We find that leveraging transformed observational data in treatment effect estimation can still improve estimation over only using data from an RCT.
Jieru Shi
PhD Student
Biostatistics
Debiased machine learning of causal excursion effects to assess time-varying moderation
Abstract
Twin revolutions in wearable technologies and smartphone-delivered digital health interventions have significantly expanded the accessibility and uptake of mobile health (mHealth) interventions in multiple domains of health sciences. Sequentially randomized experiments called micro-randomized trials (MRTs) have grown in popularity as a means to empirically evaluate the effectiveness of these mHealth intervention components. MRTs have motivated a new class of causal estimands, termed “causal excursion effects”, which allows health scientists to assess how intervention effectiveness changes over time or is moderated by individual characteristics, context, or responses in the past. However, current data analysis methods require pre-specified features of the observed high-dimensional history to construct a working model of an important nuisance parameter. Machine learning (ML) algorithms are ideal for automatic feature construction, but their naive application to causal excursion estimation can lead to bias under model misspecification and therefore incorrect conclusions about the effectiveness of interventions. In this paper, the estimation of causal excursion effects is revisited from a meta-learner’s perspective, where ML and statistical methods such as supervised learning and regression have been explored. Asymptotic properties of the novel estimands are presented and a theoretical comparison accompanied by extensive simulation experiments demonstrates relative efficiency gains, supporting our recommendation for a doubly-robust alternative to the existing methods. Practical utility of the proposed methods is demonstrated by analyzing data from a multi-institution cohort of first year medical residents in the United States.
Lap Sum Chan
PhD Student
Biostatistics
Identification and Inference for High-dimensional Pleiotropic Variants in GWAS
Abstract
In a standard analysis, pleiotropic variants are identified by running separate genomewide association studies (GWAS) and combining results across traits. But such two-stage statistical approach may lead to spurious results. We propose a new statistical approach, Debiased-regularized Factor Analysis Regression Model (DrFARM), through a joint regression model for simultaneous analysis of high-dimensional genetic variants and multilevel dependencies, to identify pleiotropic variants in multi-trait GWAS. This joint modeling strategy controls overall error to permit universal false discovery rate (FDR) control. DrFARM uses the strengths of the debiasing technique and the Cauchy combination test, both being theoretically justified, to establish a valid post selection inference on pleiotropic variants. Through extensive simulations, we show that DrFARM appropriately controls overall FDR. Applying DrFARM to data on 1,031 metabolites measured on 6,135 men from the Metabolic Syndrome in Men (METSIM) study, we identify 288 new metabolite associations at loci that did not reach statistical significance in prior METSIM metabolite GWAS. In addition, we identify new pleiotropic loci for 16 metabolite pairs.
15-min Oral Presentation III
March 10th 1:00PM-3:30PM @ Amphitheatre
Di Wang
PhD Student
Biostatistics
Incorporating External Risk Information from Published Prediction Models with the Cox Model Accounting for Population Heterogeneity
Abstract
Polygenic hazard score (PHS) models designed for European ancestry provide ample information regarding survival risk discrimination. Incorporating such information can be useful to improve the performance of risk discrimination in an internal small-sized study of a minority cohort. However, given that external European models and internal individual-level data come from different populations, ignoring heterogeneity among information sources may introduce substantial bias. In this paper, we develop a Kullback-Leibler-based Cox model (CoxKL) to integrate internal individual-level time-to-event data with external risk scores derived from published prediction models, accounting for population heterogeneity. Partial-likelihood-based KL information is utilized to measure the discrepancy between the external risk information and the internal data. Simulation studies show that the integration model by the proposed CoxKL method achieves improved estimation efficiency and prediction accuracy. We apply the proposed method to develop a trans-ancestry PHS model for prostate cancer by integrating a previously published PHS model based on European ancestry with an internal genetic dataset of African American ancestry males.
Mason Ferlic
PhD Student
Statistics
Optimizing Event-triggered Adaptive Interventions in Mobile Health with Sequentially Randomized Trials
Abstract
In mobile and digital health, advances in collecting sensor data and engaging users in self-reporting have enabled real-time monitoring of an individual’s response to treatment. This has led to significant scientific interest in developing technology-assisted dynamic treatment regimes incorporating digital tailoring variables that determine when, if, and what treatment is needed. In such mobile monitoring environments, event-triggered adaptive interventions, in which a patient transitions to the next stage of therapy when pre-specified event criteria are triggered, enable more agile treatment timing to meet the individual’s needs. Sequential, multiple-assignment randomized trial (SMART) designs can be used to develop optimized event-triggered adaptive interventions. We introduce a new estimation approach for analyzing data from SMARTs that addresses four statistical challenges: (i) the need to condition on the event, which is impacted by past treatment assignment (ii) while avoiding causal collider bias in the comparison of adaptive interventions starting with different treatments, and (iii) the need for dimension-reducing models for the distribution of the event given the past and (iv) the relationship between the event and the research outcome, all while avoiding negative impacts of model misspecification bias on the target causal effects. We illustrate the method on data from a SMART to develop an event-triggered adaptive intervention for weight loss.
Saghar Adler
PhD Student
EECS
Learning a Discrete Set of Optimal Allocation Rules in a Queueing System with Unknown Service Rate
Abstract
To highlight difficulties in learning-based optimal control in nonlinear stochastic dynamic systems, we study admission control for a classical Erlang-B blocking system with unknown service rate. At every job arrival, a dispatcher decides to assign the job to an available server or to block it. Every served job yields a fixed reward for the dispatcher, but it also results in a cost per unit time of service. Our goal is to design a dispatching policy that maximizes the long-term average reward for the dispatcher based on observing the arrival times and the state of the system at each arrival. Critically, the dispatcher observes neither the service times nor departure times so that reinforcement learning based approaches do not apply. Hence, we develop our learning-based dispatch scheme as a parametric learning problem a’la self-tuning adaptive control. In our problem, certainty equivalent control switches between an always admit policy (always explore) and a never admit policy (immediately terminate learning), which is distinct from the adaptive control literature. Therefore, our learning scheme judiciously uses the always admit policy so that learning doesn’t stall. We prove that for all service rates, the proposed policy asymptotically learns to take the optimal action, and we also present finite-time regret guarantees. The extreme contrast in the certainty equivalent optimal control policies leads to difficulties in learning that show up in our regret bounds for different parameter regimes. We explore this aspect in our simulations and also follow-up sampling related questions for our continuous-time system. parameter regimes. We explore this aspect in our simulations and also follow-up sampling related questions for our continuous-time system.
Madeline Abbott
PhD Student
Biostatistics
A latent variable approach to jointly modeling emotions and cigarette use in a mobile health study of smoking cessation
Abstract
Ecological momentary assessment (EMA), which consists of frequently delivered surveys sent to individuals’ smartphones, allows for the collection of data in real time and in natural environments. As a result, data collected using EMA can be particularly useful in understanding the temporal dynamics of individuals’ states and how these states relate to outcomes of interest. Motivated by data from a smoking cessation study, we propose a statistical method for analyzing longitudinal EMA data to determine what psychological states represent risk for smoking and to understand the dynamics of these states. Our method consists of a longitudinal submodel—a dynamic factor model—that models changes in time-varying latent psychological states and a cumulative risk submodel—a Poisson regression model—that connects the latent states with risk for smoking. In data motivating this work, both the underlying psychological states (the predictors) and cigarette smoking (the outcome) are partially unobserved. We account for these partially latent predictors and outcomes in our proposed model and estimation method in which we take a two-stage approach to estimate associations between the psychological states and smoking risk. We include weights in the cumulative risk submodel to reduce the bias in our estimates of association. To illustrate our method, we apply it to a subset of data from the smoking cessation study. Although our work is motivated by a mobile health study of smoking cessation, methods presented here are applicable to mobile health data collected in a variety of other contexts.
Jaeshin Park
PhD Student
Industrial and Operations Engineering
Stratified sampling for reliability analysis using stochastic simulation with multi-dimensional input
Abstract
Stratified sampling has been used to reduce the estimation variance for analyzing system reliability in many applications. It divides the input space into disjoint subsets, called strata, and draws samples from each stratum. By partitioning properly and allocating more computational budgets to important strata, it can help accurately estimate system reliability with limited computational budgets. In the literature, how to allocate computational budgets, given the stratification structure, has been extensively studied, however, how to effectively partition the input domain (i.e., how to design strata) has not been fully investigated yet. Stratification design becomes more important as the input dimension increases, due to the curse of dimensionality. This study analytically derives the optimal stratification structure that minimizes the estimation variance. Further, reconciling the idea of decision trees into the optimal stratification, a robust algorithm is devised for high-dimensional input problems. Numerical experiments and wind power case study demonstrate the benefits of the proposed method.
Mengqi Lin
PhD Student
Statistics
Controlling the false discovery rate under dependency with the adaptively weighted bh procedure
Abstract
We introduce a generic adaptively weighted, covariate-assisted multiple testing method for finite-sample false discovery rate (FDR) control with dependent test statistics where the dependence structure is known. Our method employs conditional calibration to address the dependency between test statistics, and we use the conditional statistics to learn adaptive weights while maintaining FDR control. We derive optimal weights under a conditional two-group model, and provide an algorithm to approximate them. Together with the conditional calibration, our adaptively weighted procedure controls the FDR while improving the power when the covariates are useful. For fixed weights, our procedure dominates the traditional weighted BH procedures under positive dependence and the general weighted BY procedure under known generic dependence.
Hanna Venera
Master’s Student
Biostatistics
Data Analytic Approach for Hybrid SMART-MRT Designs: The SMART Weight Loss Case Study
Abstract
Sequential Multiple Assignment Randomized Trials (SMARTs) and Micro-Randomized Trials (MRTs) are existing designs used to assess sequential components at relatively slow timescales (such as weeks or months) and at relatively fast time scales (such as days or hours), respectively. The hybrid SMART-MRT design is a new experimental approach that integrates a SMART design with an MRT design to enable researchers to answer scientific questions about the construction of psychological interventions in which components are delivered and adapted on different timescales. We explain how data from a hybrid SMART-MRT design can be analyzed to answer a variety of scientific questions about the development of multi-component psychological interventions. We use this approach to analyze data from a completed hybrid SMART-MRT to inform the development of a weight loss intervention. We also discuss how the data analytic approach can be modified to accommodate the unique structure of the weight loss SMART-MRT, including micro-randomizations that are restricted to individuals who show early signs of non-responders.
Shota Takeishi
Visiting PhD Student
Statistics
A Shrinkage Likelihood Ratio Test for High-dimensional Subgroup Analysis with a Logistic-Normal Mixture Model
Abstract
In clinical trials, there might be a subgroup with certain personal attributes who benefits from the treatment more than the rest of the population. Furthermore, such attributes can be of high dimension if, for example, biomarkers or genome data are collected for each patient. With this practical application in mind, this study concerns testing the existence of a subgroup with an enhanced treatment effect with the subgroup membership potentially characterized by high-dimensional covariates. The existing literature on testing the existence of the subgroup has the following two drawbacks. First, the asymptotic null distributions of the test statistics proposed in the literature often have the intractable forms. Notably, they are not easy to simulate, and hence, the data analyst have to resort to computationally demanding method, such as bootstrap, to calculate the critical value. Second, most of the methods in the literature assume that the dimension of personal attributes characterizing the membership of subgroup is fixed so that they are not applicable to high-dimensional data. To fix those problems, this research proposes a novel likelihood ratio-based test with a logistic-normal mixture model for testing the existence of the subgroup. The proposed test simplifies the asymptotic null distribution. Namely, we show that, under the null hypothesis, the test statistics weakly converges to half chi-square distribution, which is easy to simulate. Furthermore, this convergence result holds even under high-dimensional regime where the dimension of the personal attributes characterizing the classification of the subgroup exceeds the sample size. Besides the theory, we present some simulation result to assess finite sample performance of the proposed method.