2022 Presentations

SESSION I

SESSION II

SESSION III

SESSION IV

Order of presentation may vary, alphabetical order shown.

Session I

March 10th, 10:00am – 12:00am
Vandenberg

Moderated by
Curtiss Engstrom
PhD Student, Program in Survey and Data Science

Timothy Baker

Fifth-year PhD Student
EECS

Leveraging Correlation to Improve Accuracy in Stochastic Computing

Abstract

In stochastic computing, streams of random bits are used to perform low-cost computation. For example, two random bitstreams can be multiplied using a single AND gate whereas conventional digital multipliers require hundreds of logic gates. Efficient multiplication has made stochastic computing a promising design paradigm for low-cost hardware implementations of digital filters, image processing algorithms and neural networks. However, due to their inherant randomness, stochastic circuits yield approximate computation results and have a fundamental accuracy-latency tradeoff. Sometimes this trade-off is poor and a stochastic circuit will require high latency to reach practical accuracy thresholds. Surprisingly, our recent work shows how correlation can be leveraged to drastically improve the accuracy of some important stochastic circuit designs and ultimately lower their latency. We introduce two techniques, full correlation and precise sampling, which improve the accuracy of multiplexer-based random bitstream adders by 4x to 16x while reducing the circuit area by about 35%. This accuracy improvement translates into a significantly lower required latency as demonstrated by a digital filtering case study.

Coauthor: John Hayes

Dan Kessler

Fifth-year PhD Student
Statistics

Inference for Canonical Directions in Canonical Correlation Analysis

Abstract

Canonical Correlation Analysis (CCA) is a method for analyzing pairs of random vectors; it learns a sequence of paired linear transformations such that the resultant canonical variates are maximally correlated within pairs while uncorrelated across pairs. The parameters estimated by CCA include both the “canonical correlations” as well as the “canonical directions” which characterize the transformations. CCA has seen a resurgence of popularity with applications including brain imaging and genomics where the goal is often to identify relationships between high-dimensional -omics data with more moderately sized behavioral or phenotypic measurements. Inference in CCA applications is typically limited to testing whether the canonical correlations are nonzero. Inference for the canonical directions has received relatively little attention in the statistical literature and in practice the directions are interpreted descriptively. We discuss several approaches for conducting inference on canonical directions obtained by CCA. We conduct thorough simulation studies to assess inferential validity in various settings and apply the methods to a brain imaging data set.

Coauthor: Elizaveta Levina

Lulu Shang

Fourth-year PhD Student
Biostatistics

Spatially Aware Dimension Reduction for Spatial Transcriptomics

Abstract

Spatial transcriptomics are a collection of genomic technologies that have enabled transcriptomic profiling on tissues with spatial localization information. Analyzing spatial transcriptomic data is computationally challenging, as the data collected from various spatial transcriptomic technologies are often noisy and display substantial spatial correlation across tissue locations. Here, we develop a spatially-aware dimension reduction method, SpatialPCA, that can extract a low dimensional representation of the spatial transcriptomics data with enriched biological signal and preserved spatial correlation structure, thus unlocking many existing computational tools previously developed in single-cell RNAseq studies for tailored and novel analysis of spatial transcriptomics. We illustrate the benefits of SpatialPCA for spatial domain detection and explores its utility for trajectory inference on the tissue and for high-resolution spatial map construction. In the real data applications, SpatialPCA identifies key molecular and immunological signatures in a newly detected tumor surrounding microenvironment, including a tertiary lymphoid structure that shapes the gradual transcriptomic transition during tumorigenesis and metastasis. In addition, SpatialPCA detects the past neuronal developmental history that underlies the current transcriptomic landscape across tissue locations in the cortex.

Coauthor: Xiang Zhou

Yutong Wang

Sixth-year PhD Student
EECS

VC dimension of partially quantized neural networks in the overparametrized regime

Abstract

Vapnik-Chervonenkis (VC) theory has so far been unable to explain the small generalization error of overparametrized neural networks. Indeed, existing applications of VC theory to large networks obtain upper bounds on VC dimension that are proportional to the number of weights, and for a large class of networks, these upper bound are known to be tight. In this work, we focus on a class of partially quantized networks that we refer to as hyperplane arrangement neural networks (HANNs). Using a sample compression analysis, we show that HANNs can have VC dimension significantly smaller than the number of weights, while being highly expressive. In particular, empirical risk minimization over HANNs in the overparametrized regime achieves the minimax rate for classification with Lipschitz posterior class probability. We further demonstrate the expressivity of HANNs empirically. On a panel of 121 UCI datasets, overparametrized HANNs are able to match the performance of state-of-the-art full-precision models.

Coauthor: Clayton Scott

Yuqi Zhai

Fourth-year PhD Student
Biostatistics

Data Integration with Oracle Use of External Information from Heterogeneous Populations

Abstract

It is common to have access to summary information from external studies. Such information can be useful in model fitting for an internal study of interest and can improve parameter estimation efficiency when incorporated. However, external studies may target populations different from the internal study, in which case an incorporation of the corresponding information may introduce estimation bias. We develop a penalized constrained maximum likelihood (PCML) method that simultaneously achieves (i) selecting the external studies whose target populations match the internal study’s such that their information is useful for internal model fitting, and (ii) automatically incorporating the corresponding information into internal estimation. The PCML estimator has the same efficiency as an oracle estimator that knows which external information is useful and fully incorporates that information alone. A detailed theoretical investigation is carried out to establish asymptotic properties of the PCML estimator, including estimation consistency, parametric rate of convergence, external information selection consistency, asymptotic normality, and oracle efficiency. An algorithm for numerical implementation is provided, together with a data-adaptive procedure for tuning parameter selection. Numerical performance is investigated through simulation studies and an application to a prostate cancer study is also conducted.

Coauthor: Peisong Han

Session II

March 11th, 8:30am – 10:00am
Vandenberg

Moderated by
Alexander Ritchie
PhD Student, EECS

Trong Dat Do

Third-year PhD Student
Statistics

Functional Optimal Transport: map estimation and domain adaptation for functional data

Abstract

We introduce a formulation of optimal transport problem for distributions on function spaces, where the stochastic map between functional domains can be partially represented in terms of an (infinite-dimensional) Hilbert-Schmidt operator mapping a Hilbert space of functions to another. For numerous machine learning tasks, data can be naturally viewed as samples drawn from spaces of functions, such as curves and surfaces, in high dimensions. Optimal transport for functional data analysis provides a useful framework of treatment for such domains. To this end, we develop an efficient algorithm for finding the stochastic transport map between functional domains and provide theoretical guarantees on the existence, uniqueness, and consistency of our estimate for the Hilbert-Schmidt operator. We validate our method on synthetic datasets and examine the geometric properties of the transport map. Experiments on real-world datasets of robot arm trajectories further demonstrate the effectiveness of our method on applications in domain adaptation.

Coauthors: Jiacheng Zhu, Aritra Guha, XuanLong Nguyen, Ding Zhao and Mengdi Xu

Jinming Li

Third-year PhD Student
Statistics

Network Latent Space Model with Hyperbolic Geometry

Abstract

Network data are prevalent in various scientific and engineering fields, including sociology, economics, neuroscience, and so on. While latent space models are widely used in analyzing network data, the geometric effect of latent space remains an important but unsolved problem. In this work, we propose a hyperbolic network latent space model with a learnable curvature parameter, which allows the proposed model to fit network data with the most suitable latent space. We theoretically justify that learning the optimal curvature is essential to minimize the embedding error for all hyperbolic embedding methods beyond network latent space models. We also establish consistency rates for maximum-likelihood estimators and develop an estimation approach with manifold gradient optimization, both of which are technically challenging due to the non-linearity and non-convexity of hyperbolic distance metric. We further illustrate the superiority of the proposed model and the geometric effect of latent space with extensive simulation studies followed by a Facebook friendship network application.world datasets of robot arm trajectories further demonstrate the effectiveness of our method on applications in domain adaptation.

Coauthors: Gongjun Xu and Ji Zhu

Jieru Shi

Second-year PhD Student
Biotatistics

Assessing Time-Varying Causal Effect Moderation in the Presence of Cluster-Level Treatment Effect Heterogeneity

Abstract

The micro-randomized trial (MRT) is a sequential randomized experimental design to empirically evaluate the effectiveness of mobile health (mHealth) intervention components that may be delivered at hundreds or thousands of decision points. MRTs have motivated a new class of causal estimands, termed “causal excursion effects”, for which semiparametric inference can be conducted via a weighted, centered least-squares criterion (Boruvka et al., 2018). Existing methods assume between-subject independence and non-interference. Deviations from these assumptions often occur. In this paper, causal excursion effects are revisited under potential cluster-level treatment effect heterogeneity and interference, where the treatment effect of interest may depend on cluster-level moderators. The utility of the proposed methods is shown by analyzing data from a multi-institution cohort of first-year medical residents in the United States.

Coauthors: Zhenke Wu and Walter Dempsey

Natasha Stewart

Third-year PhD Student
Statistics

Post-Selection Inference for Multitask Regression with Shared Sparsity

Abstract

With the growing complexity of modern data, it is increasingly common to select a data model only after performing an exploratory analysis. The field of selective inference has arisen to provide valid inference following the selection of a model through such data-adaptive procedures. The contribution of this work is to develop post-selection inference tools for multitask learning problems. Multitask learning is used to model a set of related response variables from the same set of features, improving predictive performance relative to methods that handle each response variable separately. Ignoring the shared structure for the sake of obtaining valid inference would come at a significant cost in terms of power, and thus new methods are needed. Motivated by applications in neuroimaging, we consider problems where several response variables must each be modeled using some sparse subset of the shared features. This setup can arise, for instance, when a series of related phenotypes are modeled as a function of brain imaging data.
We propose a two-stage protocol for joint model selection and inference. In stage one, we adapt a penalty approximation to jointly identify the relevant covariates for each task, proceeding to fit a series of linear model using the selected features. In stage two, a new conditional approach is proposed to infer about the selected models, utilizing a refinement of the selection event. An approximate system of estimating equations for maximum likelihood inference is developed that can be solved via a single convex optimization problem. This enables us to efficiently form confidence intervals with roughly the desired coverage probability through MLE-based inference. We test our two-stage procedure on simulated data, demonstrating that our methods yield tighter confidence intervals than alternatives such as data splitting. Finally, we consider an application in neuroscience involving high-dimensional fMRI data and several related cognitive tasks.

Coauthors: Snigdha Panigrahi and Elizaveta Levina

Jung Yeon Won

Fifth-year PhD Student
Biotatistics

Integrating food environment exposures from multiple longitudinal databases

Abstract

The majority of built environment health studies rely on secondary sources to enumerate local food environments and conduct analyses to contextualize population health and health behaviors within a neighborhood’s retail environments. Such secondary commercial databases often provide longitudinal point-referenced data, which enables longitudinal studies that characterize health outcomes in relation to the dynamic food environment. However, there are concerns about measurement error when quantifying environmental influences with longitudinal secondary commercial sources due to the incompleteness of listings. To alleviate the ascertainment error problem, combining multiple databases can be a promising strategy in particular for time-varying exposures as field validation is not feasible for historical exposure measures. Given the quality scores of each database, we propose a method that incorporates source quality to integrate conflicting time-varying exposures that are from different data sources. To model the latent time-varying count exposure, we extend the Poisson INAR(1) model and take a Bayesian nonparametric approach to flexibly discover clusters of location-specific time series of exposures. By resolving the discordance between different databases, our method obtains an unbiased health effect of unobservable series of true exposures.

Coauthors: Michael R. Elliott and Brisa N. Sanchez

Session III

March 11th, 10:30am – 12:00am
Vandenberg

Moderated by
Lap Sum Chan
PhD Student, Biostatistics

Rupam Bhattacharyya

Fourth-year PhD Student
Biostatistics

fiBAG: Functional Integrative Bayesian Analysis of High-dimensional Multiplatform Genomic Data

Abstract

Large-scale multi-omics datasets offer complementary, partly independent, high-resolution views of the human genome. Modeling and inference using such data poses challenges like high-dimensionality and structured dependencies but offers potential for understanding the complex biological processes characterizing a disease. We propose fiBAG, an integrative hierarchical Bayesian framework for modeling the fundamental biological relationships underlying such cross-platform molecular features. Using Gaussian processes, fiBAG identifies mechanistic evidence for covariates from corresponding upstream information. Such evidence, mapped to prior inclusion probabilities, informs a calibrated Bayesian variable selection (cBVS) model identifying genes/proteins associated with the outcome. Simulation studies illustrate that cBVS has higher power to detect disease-related markers than non-integrative approaches. A pan-cancer analysis of 14 TCGA cancer datasets is performed to identify markers associated with cancer stemness and patient survival. Our findings include both known associations like the role of RPS6KA1/p90RSK in gynecological cancers and interesting novelties like EGFR in gastrointestinal cancers.

Coauthors: Nicholas Henderson and Veerabhadran Baladandauythapani

Kyle Gilman

Fifth-year PhD Student
EECS

Streaming Probabilistic PCA for Missing Data with Heteroscedastic Noise

Abstract

With the growing complexity of modern data, it is increasingly commStreaming principal component analysis (PCA) has been an integral tool in large-scale machine learning for rapidly estimating low-dimensional subspaces of very high dimensional and high arrival-rate data with missing entries and corrupting noise. However, modern trends increasingly combine data from a variety of sources, meaning they may exhibit heterogeneous quality across samples. Since standard streaming PCA algorithms do not account for non-uniform noise, their subspace estimates quickly degrade. On the other hand, recently proposed heteroscedastic probabilistic PCA (HPPCA) is limited in its practicality since it does not handle missing entries and streaming data, nor can it adapt to non-stationary behavior in time series data. In this work, we propose the Streaming HeteroscedASTic Algorithm for PCA (SHASTA-PCA) to bridge this divide. Our method uses a stochastic alternating expectation maximization approach to jointly learn the low-rank latent factors and unknown noise variances from streaming data with missing entries and heteroscedastic noise, all while maintaining a low memory and computational footprint. Numerical experiments validate the superior performance of our method compared to state-of-the-art streaming PCA algorithms.

Coauthors: David Hong, Laura Balzano and Jeffrey Fessler

Mao Li

First-year PhD Student
Program in Survey and Data Science

Using network analysis and clustering to evaluate the effectiveness of the US Census Bureau’s social media campaign about self-completing the 2020 census

Abstract

From the start of data collection for the 2020 US Census, official and celebrity users tweeted about the importance of everyone being counted in the census and urged followers to complete the questionnaire. At the same time, skeptical social media posts about the census became increasingly common. This study aims to identify and investigate the influence of Twitter user communities on self-completion rate, according to Census Bureau data, for the 2020 Census. Using a network analysis method, Community Detection, and a clustering algorithm, Latent Dirichlet Allocation, three prototypical users were identified: “official” (i.e., government agency), “promoting-census,” and “census skeptics” users. The census skeptics group was motivated by events and speeches about which an influential person had tweeted and became the largest community over the study period. The promoting-census community was less motivated by specific events and was consistently more active than the census skeptics community. The official user community was the smallest of the three, but their messages seemed to have been amplified by promoting-census celebrities and politicians. We found that the daily size of the promoting-census users group – but not the other two – predicted the Census 2020 Internet self-completion rate within 3 days after a tweet was posted, suggesting that the census social media campaign was successful apparently due to the help of promoting-census celebrities, who encouraged people to fill out the census amplifying official user tweets. This finding demonstrates that a social media campaign can positively affect public behavior about an essential national project like the decennial census.

Robert Lunde

Post-doctoral Fellow
Statistics

Conformal Prediction for Network Regression

Abstract

An important problem in network analysis is predicting a node attribute using nodal covariates and summary statistics computed from the network, such as graph embeddings or local subgraph counts. While standard regression methods may be used for prediction, statistical inference is complicated by the fact that the nodal summary statistics often exhibit a nonstandard dependence structure. When the underlying network is generated by a graphon, we show that conformal prediction methods are finite-sample valid under a very mild condition on the network summary statistics. We also prove that a form of asymptotic conditional validity is achievable using standard nonparametric regression methods.

Coauthors: Elizaveta Levina and Ji Zhu

Jing Ouyang

Third-year PhD Student
Statistics

High-dimensional inference on Generalized Linear Models with Unmeasured Confounders

Abstract

In the high-dimensional setting, the inference problems on the relationship between the response and the covariates are extensively studied for their wide applications in medicine, economics, and many other fields. In many applications, the covariates are often associated with unmeasured confounders such as in studying the genetic effect on a certain disease, the gene expressions are confounded by some unmeasured environmental factors. In this case, the standard methods may fail due to the existence of the unmeasured confounders. Recent studies address this problem in the context of linear models whereas the problem in generalized linear framework is less investigated. In this paper, we consider a generalized linear framework and propose a debiasing approach to address this high-dimensional problem, while adjusting for the effect of unmeasured confounders. We establish the asymptotic distribution for the debiased estimator. A simulation study and an application of our method on a genetic data set are performed to demonstrate the validity of this approach.

Coauthors: Kean Ming Tan and Gongjun Xu

Session IV

March 11th, 1:00pm – 2:30am
Vandenberg

Moderated by
Cheoljoon Jeong
PhD Student, IOE

Derek Hansen

Fourth-year PhD Student
Statistics

Scalable Bayesian Inference for Detecting and Deblending Stars and Galaxies in Crowded Fields

Abstract

In images from astronomical surveys, astronomical objects such as stars and galaxies often overlap. Deblending is the task of identifying and characterizing the individual light sources that make up such images. We propose the Bayesian Light Source Separator (BLISS), which enables the detection, characterization, and reconstruction of individual stars and galaxies. BLISS posits a fully generative model of an astronomical image and its associated catalog, which consists of locations, brightness, classification (star or galaxy), and the galaxy shape.
First, to learn a distribution of galaxy shapes, we train a Variational Autoencoder (VAE) on simulated images of single galaxies. The VAE works by associating each galaxy with a low-dimensional latent representation from which all relevant information about its shape can be reconstructed. Then, to efficiently sample from the posterior distribution of the catalog given the image, we use amortized Variational Inference (VI) via a flexible neural network encoder. Our encoder consists of three stages. First, we sample the number of objects in the image and their locations conditional on the image. Then, we calculate the probability each object is a galaxy or star and sample the label. Finally, for each labeled galaxy, we sample the associated latent representation. Using the VAE, these latent representations can be reconstructed into individual galaxies, enabling downstream astronomical tasks that rely on the deblended morphology. Unlike traditional VI, the encoder is trained by alternating between the forward Kullback-Liebler (KL) divergence using simulated images and the reverse KL divergence using real images. Using the Sloan Digital Sky Survey (SDSS) dataset, we demonstrate that BLISS can find, classify, and reconstruct stars and galaxies identified in previous surveys with both high recall and precision.

Coauthors: Ismael Mendoza, Runjing Liu and Jeffrey Regier

Yifan Hu

First-year Master’s Student
Statistics

Estimating An Optimal Individualized Treatment Rule for Guiding the Initial Treatment Decision on Child/Adolescent Anxiety Disorder

Abstract

Designing an Individualized Treatment Rule (ITR) to guide clinicians on deciding personalized treatment plans for patients is an important research goal in treating child/adolescent anxiety disorder. An ITR is a special case of a dynamic treatment regimen when there is a single decision rule. In this research, an ITR is said to be optimal if it maximizes the expectation of a pre- specified clinical outcome when used to assign treatment to a population of interest. Our goal is to establish and to evaluate an optimal IRT, which guides the decision among sertraline (SRT), cognitive behavior therapy (CBT), and their combination (COMB) as the initial treatment for children or adolescents with anxiety referring to Child/Adolescent Anxiety Multimodal Study (CAMS). The Study (CAMS) is a completed federally-funded, multi-site, randomized placebo-controlled trial that examined the relative efficacy of cognitive-behavior therapy (CBT), sertraline (SRT), and their combination (COMB) against pill placebo (PBO) for the treatment of separation anxiety disorder (SAD), generalized anxiety disorder (GAD) and social phobia (SOP) in children and adolescents. Based on the Pediatric Anxiety Rating Scale (PARS) outcome from 412 participants randomly assigned to CBT (139), SRT (133), or COMB (140), we propose a four-step technique, to estimate an optimal ITR using the CAMS data that leads to the minimal symptoms, on average. The four steps are: (1) Split the data for training (70%) and evaluation (30%). In the training data set: (2) Prune the baseline covariates according to their contribution level to model with a specified variable selection algorithm for subset analysis; (3) Use a novel method called “Decision List” to create two prospective interpretable and parsimonious IRTs. (4) Use the test data to evaluate the two candidate IRTs versus providing SRT only, CBT only, or COMB to participants with PARS results.

Coauthors: Tuo Wang, Scott N. Compton and Daniel Almirall

Peter MacDonald

Fourth-year PhD Student
Statistics

Continuous-time latent process network models

Abstract

Network data are often collected through the observation of a complex system over time, leading to time-stamped network snapshots. Methods in statistical network analysis are traditionally designed for a single network, but applying these methods to a time-aggregated network can miss important temporal structure in the data. In this work, we provide an approach to estimating the expected network in continuous time. We parameterize the network expectation through time-varying positions, such that the activity of each node is governed by a low-dimensional latent process. To tractably estimate these processes, we assume their components come from a fixed, finite-dimensional function basis. We provide a gradient descent estimation approach, establish theoretical results for its convergence, compare our method to competitors, and apply it to a real dynamic network of international political interactions.

Coauthors: Elizaveta Levina and Ji Zhu

Ai Rene Ong

Fifth-year PhD Student
Program in Survey and Data Science

Respondent Driven Sampling Design Considerations

Abstract

Respondent Driven Sampling (RDS) has been used as a method to sample hard-to-sample populations, leveraging the social networks of the initial respondents, typically selected through convenience sampling, to reach more people from the target population. Respondents are asked to invite their eligible peers to participate in the study, and this process continues until the sample size is reached. Although there have been some general recommendations for RDS best practices (e.g., conducting formative studies, a small number of seed respondents), efforts to study the contributions of these design decisions on the productivity of RDS peer recruitment have been hindered by incomplete reporting of RDS methodology in the literature. This study presents an exploratory analysis of the associations of various RDS design decisions on peer recruitment productivity. The data used is from a survey of researchers who have published an article using RDS or have grants funded for research using RDS from 2009 to 2020. These researchers were sampled from a database that represents a census of RDS researchers. A hundred and twenty-one researchers completed the survey which asked about the design of their RDS data collection. Preliminary results indicate that fielding an RDS survey on the web is associated with better productivity, and this effect is moderated by the type of population the RDS study is targeting. Giving more than one form of instructions for peer recruitment appeared to help with peer recruitment productivity. However, conducting formative research prior to data collection was not associated with peer recruitment productivity.

Coauthors: Sunghee Lee and Michael Elliott

Lam Tran

Fourth-year PhD Student
Biostatistics

A fast algorithm for fitting the constrained lasso

Abstract

Background: The constrained lasso is a flexible framework that augments the lasso by allowing for imposition of additional structure on regression coefficients. Despite the constrained lasso’s broad utility in compositional microbiome analysis and gene pair discovery, among many other applications, current methods for fitting the constrained lasso do not computationally scale and are limited to linear and logistic models with only simple constraint structures. No existing methods deal with survival data, limiting the range of potential clinical applications.

Methods: We proposed a novel approach for fitting the constrained lasso by leveraging candidate covariate subsets of increasing size from the unconstrained lasso in an efficient alternating direction method of multipliers algorithm. We found that using this approach can accelerate the convergence of the constrained lasso. We tested the ability of our method to quickly fit the constrained lasso with simulated and real-world data types under a variety of constraint structures.

Results: Our proposed algorithm led substantial speedups in solving the regularization path of the constrained lasso for simulated data, even in complex cases where not all predictors are penalized and constrained equally. The utility and speed of our method were maintained when we considered two real-world data examples: a compositional microbiome dataset for binary periodontal disease status and a microarray dataset for multiple myeloma survival, neither of which could be solved when the constrained lasso is naively fit on the full set of predictors.

Significance: Our proposed algorithm dramatically reduces the time required to fit the constrained lasso even in real-data settings with complex constraint structures. Our computationally inexpensive approach increases the range of potential applications for the flexible and robust constrained lasso, being able to quickly perform variable selection with multiple response types and constraints.

Coauthors: Hui Jiang and Gen Li