MSSISS 2015 Abstracts – MSSISS 2024

MSSISS 2015 Abstracts

Abstracts PDF

Oral Presentation Session I   |   9:00 AM – 10:15 AM   |   PDF

Mohamad Kazem Shirani Faradonbeh
Statistics, PhD Student (Advisors: Ambuj Tewari, George Michailidis)

Optimality of Fast Algorithms for Random Networks with Applications to Structral Controllability

Network control refers to a very large and diverse set of problems including controllability of linear time-invariant dynamical systems evolving over time that have inputs and outputs. The network control problem in this setting is to select the appropriate input to steer the network into a desired output state. Examples of the output state include the throughput of a communications network, transcription factor concentration in a gene regulatory network, customer purchases in a marketing context subject to social influences and the amount of flux flowing through a biochemical network. We focus on control of linear dynamical systems under the notion of structural controllability which is intimately connected to finding maximum matchings. Hence, the objective becomes of studying scalable and fast algorithms for this task. We first show the convergence of matching algorithms for different random networks and then analyze a popular, fast and practical heuristic one which is due to Karp and Sipser. The optimality of both KarpSipser Algorithm and a simplification of it as well as the size of maximum matching for an extensive class of random networks will be proved.

Pranav Yajnik
Biostatistics, PhD Student (Advisor: Michael Boehnke)

Characterizing Power of Two-Stage Residual Outcome Regression in Genome-Wide Association Studies with Quantitative Traits

Multiple linear regression (MLR) is an effective and widely used technique for performing inference about association between genetic variants and quantitative traits while controlling for bias due to confounding variables. In genome-wide association studies (GWAS), MLR analysis is performed separately for each of a large number of variants. Typically, the same covariates/confounders are included in each regression model. In addition, effect sizes of the covariates are not of direct interest. Consequently, some analysts choose to regress out the covariates from the trait (stage I) and use the residuals from the first stage to perform simple regression analysis with each genetic variant (stage II). This procedure (called two-stage residual outcome regression or 2SROR) is computationally efficient and simplifies data management. However, the procedure is biased unless the genetic variant is orthogonal to each of the covariates 2 and the resulting hypothesis test for association is typically less powerful than the hypothesis test obtained from MLR. Previous works describe an approximate relationship between the 2SROR and MLR test statistics and study power when Type I error is controlled at the 5% level. We derive the exact relationship between the two test statistics. We use this relationship to compare the power of the two methods under various parameter settings commonly observed in GWAS. In particular, we show that the two-stage method may perform substantially worse than MLR under the extremely stringent Type I error rate levels used in GWAS.

Pin-Yu Chen
Electrical Engineering/Computer Science, PhD Candidate (Advisor: Alfred O. Hero)

Universal Phase Transitions of Spectral Algorithms for Community Detection

Spectral algorithms are widely used venues for data clustering, particularly in the contexts of community detection of social network data. In this presentation we investigate the community detectability of two spectral algorithms, the spectral clustering method and the modularity method, under a general network model. We prove the existence of abrupt phase transitions with respect to the network parameters, where the network transitions from almost perfect detectability to low detectability at some critical threshold. These phase transition results provide fundamental performance limits of community detection and they are universal in the sense that we allow the communities to have different sizes. We also use the results to establish an empirical estimator from data that is capable of evaluating the reliability of spectral algorithms for community detection. The phase transition results are validated via simulated networks and real-world datasets.

Oral Presentation Session II   |    11:15 AM – 12:30 PM

Robert Vandermeulen
Electrical Engineering/Computer Science, PhD Candidate (Advisor: Clayton Scott)

On The Identifiability of Mixture Models from Grouped Samples

Finite mixture models are statistical models which appear in many problems in statistics and machine learning. In such models it is assumed that data are drawn from random probability measures, called mixture components, which are themselves drawn from a probability measure P over probability measures. When estimating mixture models, it is common to make assumptions on the mixture components, such as parametric assumptions. In this work, we make no assumption on the mixture components, and instead assume that observations from the mixture model are grouped, such that observations in the same group are known to be drawn from the same component. We show that any mixture of m probability measures can be uniquely identified provided there are 2m − 1 observations per group. Moreover we show that, for any m, there exists a mixture of m probability measures that cannot be uniquely identified when groups have 2m − 2 observations. Our results hold for any sample space with more than one element. 3

Yebin Tao
Biostatistics, PhD Candidate (Advisor: Lu Wang)

Optimal Dynamic Treatment Regimes for Treatment Initiation with Continuous Random Decision Points

Identifying optimal dynamic treatment regimes (DTRs) allows patients to receive the best personalized treatment prescriptions given their specific characteristics. For type 2 diabetic patients, finding the optimal time to initiate insulin therapy given their own clinical outcomes is of great significance for disease control. We consider estimating optimal DTRs for treatment initiation using observational data, where biomarkers of disease severity are monitored continuously during follow-up and a decision of whether or not to initiate a specific treatment is made each time the biomarkers are measured. Instead of considering multiple fixed decision stages as in most DTR literature, our study undertakes the task of dealing with continuous random decision points for treatment modification based on patients’ up-to-date clinical records. Under each DTR, we employ a flexible survival model with splines for time-varying covariates to estimate the probability of adherence to the regime for patients given their own biomarker history. With the estimated probability, we construct an inverse probability weighted estimator for the counterfactual mean utility (i.e., prespecified criteria) to assess the DTR. We conduct simulations to demonstrate the performance of our method and further illustrate the application process with the example of type 2 diabetic patients enrolled to initiate insulin therapy.
(Co-Authors: Lu Wang, Haoda Fu)

Naveen Naidu Narisetty
Statistics, PhD Candidate (Advisor: Vijay Nair)

Extremal Notion of Depth and Central Regions for Functional Data

Functional data are becoming increasingly common form of data, where each data point is a function observed in a continuous domain. Examples of functional data include ECG curves of patients observed over time, spectrometry curves recorded for a range of wavelengths, daily temperature curve data, etc. “Data depth” is a concept that provides an ordering of the data in terms of how close an observation is to the center of the data cloud. This concept has been widely used for multivariate data for both exploratory data analysis and robust inference. However, depth notions for functional data have received much less attention partly due to their complexity. In this work, we propose a new notion of depth for functional data called Extremal Depth (ED), discuss its properties, and compare its performance with existing concepts. The proposed notion is based on a measure of extreme “outlyingness”, similar to that for projection depth in the multivariate case. ED has many desirable properties as a measure of depth and is well suited for obtaining central regions of functional data. For constructing central regions, ED satisfies two important properties that are not shared by other notions: a) the central region achieves the nominal (desired) simultaneous coverage probability; and b) the width of the simultaneous region is proportional to 4 that of the pointwise central regions. The empirical performance of the method is examined, and its usefulness is demonstrated for constructing functional boxplots and outlier detection.

Oral Presentation Session III   |    1:45 PM – 3:00 PM

Huitian Lei
Statistics, PhD Candidate (Advisors: Ambuj Tewari, Susan Murphy)

An Actor-Critic Contextual Bandit Algorithm for Personalized Interventions using Mobile Devices

Increasing technological sophistication and widespread use of smartphones and wearable devices provide opportunities for innovative individualized health interventions. An Adaptive Intervention (AI) personalizes the type, mode and dose of intervention based on users’ ongoing performances and changing needs. A Just-In-Time Adaptive Intervention (JITAI) employs the real-time data collection and communication capabilities that modern mobile devices provide to adapt and deliver interventions in real-time. The lack of methodological guidance in constructing data-based high quality JITAI remains a hurdle in advancing JITAI research despite the increasing popularity JITAIs receive from clinical and behavioral scientists. In this article, we make a first attempt to bridge this methodological gap by formulating the task of tailoring interventions in real-time as a contextual bandit problem. Interpretability concerns in the domain of mobile health lead us to formulate the problem differently from existing formulations intended for web applications such as ad or news article placement. Under the assumption of linear reward function, we choose the reward function (the “critic”) parameterization separately from a lower dimensional parameterization of stochastic policies (the “actor”). We provide an online actor-critic algorithm that guides the construction and refinement of a JITAI. Asymptotic properties of actor-critic algorithm, including consistency, rate of convergence and asymptotic confidence intervals of reward and JITAI parameters are developed and verified by numerical experiments. To the best of our knowledge, our work is the first application of the actor-critic architecture to contextual bandit problems.

Kuang Tsung (Jack) Chen
Survey Methodology, PhD Candidate (Advisor: Michael Elliott)

Population Inference from Web Surveys with LASSO Calibration

The costs for traditional face-to-face, telephone, and mail data collection methods continue to rise, while the costs for internet-based surveys steadily fall. Practitioners that perform market research, election forecasts, and opinion polls are among increasingly many people who have turned to the cheaper and faster web-based platforms for large sample sizes. However, web-based samples typically lack a well-defined probability selection framework, resulting in highly skewed samples and hindering the inference to population. Traditional post-survey adjustments that correct for sample imbalance tend to work well for probabilistic samples but perform poorly for web-based 5 samples due to severe under-coverage and complex participation mechanism. This paper introduces LASSO regression in model-assisted calibration for post-survey adjustments to improve the efficiency in estimating population totals based on web volunteer samples. We compare LASSO calibration with the widely used generalized regression estimator (GREG) given different sample and benchmark sizes, as well as different choices of assisting models. The results demonstrate the potential of LASSO calibration to make accurate population inference based on web surveys.

Zihuai He
Biostatistics, PhD Candidate (Advisors: Min Zhang, Bhramar Mukherjee)

Set-based Tests for Genetic Association in Longitudinal Studies

Genetic association studies with longitudinal markers of chronic diseases (e.g., blood pressure, body mass index) provide a valuable opportunity to explore how genetic variants affect traits over time by utilizing the full trajectory of longitudinal outcomes. Since these traits are likely influenced by the joint effect of multiple variants in a gene, a joint analysis of these variants considering linkage disequilibrium (LD) may help to explain additional phenotypic variation. In this article, we propose a longitudinal genetic random field model (LGRF), to test the association between a phenotype measured repeatedly during the course of an observational study and a set of genetic variants. Generalized score type tests are developed, which we show are robust to misspecification of within-subject correlation, a feature that is desirable for longitudinal analysis. In addition, a joint test incorporating gene-time interaction is further proposed. Computational advancement is made for scalable implementation of the proposed methods in large-scale genome-wide association studies (GWAS). The proposed methods are evaluated through extensive simulation studies and illustrated using data from the Multi-Ethnic Study of Atherosclerosis (MESA). Our simulation results indicate substantial gain in power using LGRF when compared with two commonly used existing alternatives: (i) single marker tests using longitudinal outcome and (ii) existing gene-based tests using the average value of repeated measurements as the outcome.
(Co-Authors: Min Zhang, Seunggeun Lee, Jennifer A. Smith, Xiuqing Guo, Walter Palmas, Sharon L.R. Kardia, Ana V. Diez Roux, Bhramar Mukherjee)

Poster Session I   |   10:15 AM – 11:15 AM

Shrijita Bhattacharya
Statistics, PhD Student (Advisors: Stilian Stoev, George Michailidis)

Statistics on Data Streams with Applications on Mining High-Impact Computer Network Events

The security and management of traffic in modern fast computer networks involves the rapid analysis of large volumes of data. Such types of data, referred to as data streams cannot be analyzed with conventional statistical tools that often require storing and postprocessing the entire data set. Data streams can only be accessed sequentially under relatively stringent computing time and space constraints. This poses novel challenges to the types of statistics that can be used in this setting. In this work, we start by addressing the canonical problem of the detection and estimation of the number of high activity entities, also called heavy hitters in a data stream. The problem is motivated by the detection of anomalies such as the onset of denial of service attacks in computer networks. To handle the paucity of memory relative to the size of the data stream, we use data structures derived from pseudo-random hash functions. We develop statistics that can be efficiently computed on fast data streams and allow one to track the number of heavy users in real time. We provide estimates on the accuracy of these estimators and identify their optimal regime with respect to tuning parameters. In the future, we plan to extend these techniques to the on-line analysis of multiple data streams.
(Co-Authors: George Michailidis, Stilian Stoev, Michael Kallitsis)

Jedidiah Carlson
Biostatistics, Master’s Student (Advisor: Sebastian Zöllner)

Identifying Regional Variation and Context Dependence of Human Germline Mutation using Rare Variants

Mutation is the ultimate source of genetic variation and one of the driving forces of evolution. In both germline and somatic tissues, mutation rates vary along the genome, and are affected by local features such as GC content and chromatin structure. Characterizing regional variation of mutation patterns is important for understanding genome evolution and to identify variants causing genetic diseases. However, many aspects of the interplay between genomic features and mutation patterns are poorly understood. Despite its central importance, mutation rate and molecular spectrum are difficult to measure in an unbiased, genomewide fashion. Estimates based on common variants (polymorphisms) and substitutions are confounded by natural selection, population demographic history, and biased gene conversion (BGC). Methods relying on quantifying population incidence rate of single-gene diseases or finding de novo variants by trio sequencing do not provide sufficient data genomewide to assess more than the most basic parameters. We overcome these limitations by using a collection of > 30 million singleton variants observed in our whole-genome sequencing study of bipolar disorder (n = 4,000 unrelated subjects). Compared to polymorphisms or substitutions, these extremely rare variants (ERVs) arose very recently and are much less affected 7 by the confounding effects of selection, BGC, etc. Compared to trio sequencing studies, the high density of ERVs (>1,000 ERVs per 100kb) provides substantially more power to detect subtle effects of genomic and epigenomic context. With this approach, we assess subtle regional differences in the mutation process across the human genome at a 1-10kb scale. We explore the impact of genomic features, such as GC content, functional annotation, and replication timing on such regional variations. Moreover, we evaluate the impact of local sequence context on mutation rates, thus possibly providing insight into the underlying biological processes creating mutant alleles. By comparing ERVs and common variants across the genome we are able to assess the effect of evolutionary processes. For example, we observe a significant enrichment for AT>GC transitions among common variants but not ERVs, indicating biased gene conversion favoring the derived allele has a strong impact on the human genome. These results will provide a framework for developing a comprehensive atlas of mutation in the human genome and ultimately improve our knowledge of the core process of genetic variation.
(Co-Authors: Jun Li, Sebastian Zöllner)

Michael Hornstein
Statistics, PhD Candidate (Advisors: Kerby Shedden, Shuheng Zhou)

A Matrix-Variate Approach to Large-Scale Inference with Dependent Observations

Large-scale inference involves testing many hypotheses simultaneously, for example testing for group-wise differences in mean expression levels of thousands of genes. Genomics researchers have claimed that correlations among individuals may be present in such data, due to batch effects or latent variables, violating the traditional independent samples framework. Such correlations change the distribution of test statistics, leading to incorrect assessments of differential expression. In the setting of two-group hypothesis testing with correlated rows and columns, Allen and Tibshirani proposed a matrix-variate model in which the covariances have Kronecker product structure. The Kronecker product model allows the correlation among subjects to be estimated without prior knowledge of its structure. Under this model, we propose a likelihood-based method with increased power and accurate calibration. We assess the performance of the approach using simulations and compare the results to the sphering approach of Allen and Tibshirani. We apply our method to data from two genomic studies, one with only a few correlated samples, and one with heavier dependencies due to batch effects.

Can Le
Statistics, PhD Candidate (Advisors: Elizaveta Levina, Roman Vershynin)

Sparse random graphs: regularization and concentration of the Laplacian

We study random graphs with possibly different edge probabilities in the challenging sparse regime of bounded expected degrees. Unlike in the dense case, neither the graph adjacency matrix nor its Laplacian concentrate around their expectations due to the highly irregular distribution of node degrees. It has been empirically observed that simply adding a constant of order 1/n to each 8 entry of the adjacency matrix substantially improves the behavior of Laplacian. Here we prove that this regularization indeed forces Laplacian to concentrate even in sparse graphs. As an immediate consequence in network analysis, we establish the validity of one of the simplest and fastest approaches to community detection — regularized spectral clustering, under the stochastic block model. Our proof of concentration of regularized Laplacian is based on Grothendieck’s inequality and factorization, combined with paving arguments.

Zhuqing Liu
Biostatistics, PhD Candidate (Advisor: Timothy D. Johnson)

Pre-Surgical fMRI Data Analysis Using a Spatially Adaptive Conditionally Autoregressive Model

Spatial smoothing is an essential step in the analysis of functional magnetic resonance imaging (fMRI) data. One standard smoothing method is to convolve the image data with a threedimensional Gaussian kernel that applies a fixed amount of smoothing to the entire image. In presurgical brain image analysis where spatial accuracy is paramount, this method, however, is not reasonable as it can blur the boundaries between activated and deactivated regions of the brain. Moreover, while in a standard fMRI analysis strict false positive control is desired, for pre-surgical planning false negatives are of greater concern. To this end, we propose a novel spatially adaptive conditionally autoregressive model with smoothing variances that are proportional to error variances, allowing the degree of smoothing to vary across the brain and present a new loss function that allows for the asymmetric treatment of false positives and false negatives. We compare our proposed model with two existing spatially adaptive smoothing models. Simulation studies show that our model outperforms these other models; as a real model application, we apply the proposed model to the pre-surgical fMRI data of a patient to assess peri- and intra-tumoral brain activity.
(Co-Authors: Veronica J. Berrocal, Andreas J. Bartsch, Timothy D. Johnson)

Kevin Moon
Electrical Engineering/Computer Science, PhD Candidate (Advisor: Alfred O. Hero)

Multivariate f-Divergence Estimation with Confidence

The problem of f-divergence estimation is important in the fields of machine learning, information theory, and statistics. While several divergence estimators exist, relatively few have known convergence properties. In particular, even for those estimators whose MSE convergence rates are known, the asymptotic distributions are unknown. We establish the asymptotic normality of a recently proposed ensemble estimator of f-divergence between two distributions from a finite number of samples. This estimator has MSE convergence rate of O(1/T), is simple to implement, and performs well in high dimensions. This theory enables us to perform divergence-based inference tasks such as testing equality of pairs of distributions based on empirical samples. We 9 experimentally validate our theoretical results and, as an illustration, use them to empirically bound the best achievable classification error.

Rebecca Rothwell
Biostatistics, PhD Student (Advisor: Sebastian Zoellner)

Estimating the Bottleneck OF MtDNA Transmission in Humans Previous studies implicate mutations in mitochondrial DNA (mtDNA) as the cause of major health problems, including colorectal cancer susceptibility, tissue aging, and postlingual deafness. However, the process and magnitude of mtDNA transmission remains unknown. Current experimental results primarily support two models of mtDNA: the simple bottleneck in which all mtDNA molecules in the cell behave as independent genetic entities, and the nucleoid model in which mtDNA form polyploid genetic units containing 5-10 identical mtDNA molecules called nucleoids, which then replicate together as a unit. In this study, we analyzed short read sequences of the mitochondrial DNA of mothers and children from 189 trios from the Genome of the Netherlands and Biobanking and Biomolecular Research Infrastructure of the Netherlands. The allele frequencies in the mitochondrial DNA between generations differed considerably, indicating strong genetic drift due to a reduced number of molecules. Using the minor allele counts and read coverage, we developed a method to estimate the size and nature of the bottleneck based on a maximum likelihood equation and model comparisons. We estimate the size of the bottleneck for a model of individual mtDNA transmission and for a model of nucleoid transmission. Comparing the likelihoods of these estimates using the Aikaike Information Criterion suggests the nucleoid transmission model is the better fit with this data. We estimated 18 nucleoids are transmitted from mother to child each generation. This method can be applied to larger data sets and different tissue samples to better understand the mechanism of mitochondrial DNA inheritance in humans.
(Co-Authors: Mingkun Li, Mark Stoneking, Sebastian Zoellner)

Krithika Suresh
Biostatistics, PhD Student (Advisors: Jeremy Taylor, Alex Tsodikov)

Evaluation of the PCPT Risk Calculator using a simulation model

Including data collected on the placebo group of the Prostate Cancer Prevention Trial (PCPT), logistic regression was used to develop a predictive model for the risk of prostate cancer. Using this model, the Prostate Cancer Prevention Trial Risk Calculator (PCPTRC) was created and published online to allow individuals to input their demographic, prostate specific antigen (PSA), and digital rectal exam (DRE) information and obtain their probability of having prostate cancer should they have a biopsy presently performed. A caveat of this model is that although longitudinal PSA and DRE information were collected during the trial, the logistic methods used only the PSA and DRE results closest to the final biopsy. A computer model was used to simulate the prostate cancer progression, PSA growth, and clinical diagnosis for a cohort of simulated individuals to match the rates and distribution of PSA and cancer diagnosis in PCPT. A screening scheduling 10 matching that used by PCPT was then applied to this cohort. By simulating the natural history of a comparable group of individuals, the aim is to evaluate the effectiveness of the published calculator and whether alternative models that take into account the longitudinal data patterns of PSA and tumor growth are necessary.

Ye Yang
Biostatistics, PhD Candidate (Advisor: Roderick Little)

A Comparison of Doubly Robust Estimators of the Mean with Missing Data

We consider data with a continuous outcome that is missing at random and a fully observed set of covariates. We compare by simulation a variety of doubly-robust (DR) estimators for estimating the mean of the outcome. An estimator is DR if it is consistent when either the regression model for the mean function or the propensity to respond is correctly specified. Performance of different methods is compared in terms of root mean squared error of the estimates and width and coverage of confidence intervals or posterior credibility intervals in repeated samples. Overall, the DR methods tended to yield better inference than the incorrect model when either the propensity or mean model is correctly specified, but were less successful for small sample sizes, where the asymptotic DR property is less consequential. Two methods tended to outperform the other DR methods: penalized spline of propensity prediction [R.J.A. Little and H. An. Robust likelihoodbased analysis of multivariate data with missing Values. Statistica Sinica. 2004; 14: 949–968] and the robust method proposed in [W. Cao, A.A. Tsiatis, and M. Davidian. Improving efficiency and robustness of the doubly robust estimator for a population mean with incomplete data. Biometrika. 2009; 96: 723-734].

Yuan Zhang
Statistics, PhD Student (Advisors: Elizaveta Levina, Ji Zhu)

Nonparametric Network Denoising

In this work we address the problem of estimating the probability matrix of an exchangeable network. Previous methods are either computationally infeasible or based on strong assumptions. We propose a computable local smoothing method under much weaker conditions. The method is consistent with a competitive minimax error rate, given the network is generated from a Holder class graphon or stochastic blockmodel. Numerical studies show high accuracy and wide applicability of our method compared to benchmark methods.

Poster Session II   |   12:30 PM – 1:45 PM

Wenting Cheng
Biostatistics, PhD Candidate (Advisor: Jeremy M.G. Taylor)

Regression Model Estimation and Prediction Incorporating Coefficients Information

We consider a situation where there is a rich amount of historical data available for the coefficients and their standard errors in a regression model of E(Y|X) from large studies, and we would like to utilize this summary information for improving inference in an expanded model of interest, say, E(Y| X, B). The additional variables B could be thought of as a set of new biomarkers, measured on a modest number of subjects in a new dataset. We formulate the problem in an inferential framework where the historical information is translated in terms of non-linear constraints on the parameter space. We propose several frequentist and Bayes solutions to this problem. In particular, we show that the transformation approach proposed in Gunn and Dunson (2005) is a simple and effective computational method to conduct Bayesian inference in this constrained parameter situation. Our simulation results comparing the methods indicate that historical information on E(Y|X) can indeed boost the efficiency of estimation and enhance predictive power in the regression model of interest E(Y|X, B).
(Co-Authors: Jeremy M.G. Taylor, Bhramar Mukherjee)

Jen Durow
Survey Methodology, Master’s Student (Advisor: James Wagner)

Interviewer-Respondent Interactions in Conversational and Standardized Interviewing: Results from a National Face-to-face Survey in Germany.

In recent years, researchers have given greater attention to the interviewing techniques used in survey data collection. The field is presently dominated by one main technique, often referred to as standardized interviewing. This is a technique whereby an interviewer reads a question exactly as authored and reacts to respondent confusion by employing neutral probes, without giving the respondent any further information. Conversely, in the “conversational” interviewing technique, interviewers are given greater liberty to provide clarification in response to respondent confusion. Proponents of standardized interviewing argue that all respondents should be exposed to the same stimuli. Contrariwise, conversational interviewing may elicit higher data quality by making sure that each respondent understands the questions in the way they were intended. In this study, we explore respondents’ reactions to both interviewing techniques in a national faceto-face survey conducted in Germany, where interviewers were randomly assigned to use one of the two techniques for their interviews. The survey items underscore a range of question difficulty, from simple and well-defined to ambiguous and cognitively taxing. We find that respondents show more evidence of confusion when they recognize the interviewer’s ability to respond. We also explore the interesting case when a difficult question is met with ease by respondents assigned to a standardized interviewer, yet with confusion by those in the conversational group. We address 12 another concern in conversational interviewing that interviewers help too much and provide definitions without any evidence of respondent’s confusion. We conclude with suggestions for practice and future research in this area.
(Co-Authors: Felicitas Mittereder, Brady West, Frauke Kreuter, Fred Conrad)

Kristjan Greenewald
Electrical Engineering/Computer Science, PhD Candidate (Advisor: Alfred O. Hero)

Robust Kronecker Product PCA for Spatio-Temporal Covariance Estimation

Kronecker PCA involves the use of a space vs. time Kronecker product decomposition to estimate spatio-temporal covariances. In this work the addition of a sparse correction factor is considered, which corresponds to a model of the covariance as a sum of Kronecker products and a sparse matrix. This sparse correction extends the diagonal corrected Kronecker PCA of [Greenewald et al 2013, 2014] to allow for sparse unstructured “outliers” anywhere in the covariance matrix, e.g. arising from variables or correlations that do not fit the Kronecker model well, or from sources such as sensor noise or sensor failure. This paper introduces a robust PCA-based algorithm to estimate the covariance under this model, extending the nuclear norm penalized LS Kronecker PCA approaches of [Tsiligkaridis et al 2013, Greenewald et al 2014]. An extension to Toeplitz temporal factors is also provided, producing a parameter reduction for temporally stationary measurement modeling. High dimensional MSE performance bounds are given for these extensions. Finally, the proposed extension of KronPCA is evaluated and compared on both simulated and real data coming from yeast cell cycle experiments. This establishes the practical utility of the sparse correction in biological and other applications.

James Henderson
Statistics, PhD Candidate (Advisor: George Michailidis)

Order-Mediated Importance Sampling with Applications to Active Learning for Network Reconstruction

Networks are important for understanding complex systems and reconstructing unknown networks from data is an active research area. Intervention data are often used to learn directed networks, formally represented as DAGs. In this presentation I will presenting a framework for active learning by sequentially choosing interventions. I then turn to techniques for addressing the computational challenge of estimating the proposed improvement function, taken to be the entropy of certain posterior marginals. Working with the posterior distribution is difficult in network models due to the combinatorial complexity and discrete nature of DAG-space. It is known that MCMC methods for sampling DAGs mix slowly, but order sampling has been shown to improve mixing. Order sampling constructs a Markov Chain over linear orderings, from which DAGs can then be sampled. Methods for correcting the bias this introduces have also been studied. We build on this work by showing how to construct an order-mediated importance sample using the 13 hierarchy among linear orders, partial orders, and DAGs so that the bias correction can be directly estimated from the sampled orders.

Nhat Ho
Statistics, PhD Candidate (Advisor: Long Nguyen)

Optimal Rates of Parameter Estimation for Gaussian and Other Weakly Identifiable Finite Mixture Models

This talk addresses the convergence behaviors for parameter estimation of weakly identifiable families of distributions and the effects of model fitting with extra mixing components. General theory of strong identifiability is shown as an important ingredient of obtaining optimal rate of convergences of parameter estimations in finite mixture models Chen [1995], Nguyen [2013], and Nhat [2015]. This theory, however, is not applicable to several important model classes, including location-covariance multivariate Gaussian mixtures, shape-scale Gamma mixtures, and locationscale-shape skew-normal mixtures. The main part of this talk is devoted to demonstrating that for these “weakly identifiable” classes, the underlying algebraic structures of the density family play a fundamental role in determining convergence rates of the model parameters, which displays a very rich spectrum of behaviors. For instance, the optimal rate of parameter estimation − , where < 1 2 , of location and covariance parameters for the over-fitted Gaussian mixture is precisely determined by the solvability order of a system of polynomial equations — these rates deteriorate rapidly as more extra components are added to the model. To the best of our knowledge, it is the first result that answers the long-standing open question posed by machine learners and statisticians about the highly non-trivial slow convergence rate of parameters under over-fitted Gaussian mixture in practice. Finally, the established rates for a variety of settings of weakly identifiable classes are illustrated by careful simulation studies.

Brandon Oselio
Electrical Engineering/Computer Science, PhD Candidate (Advisor: Alfred Hero)

Pareto Frontiers of Multi-layer Social Networks

Social media provides a rich source of networked data. This data can be represented by a set of nodes and a set of relations (edges). It is often possible to obtain or infer multiple types of relations from the same set of nodes, such as observed friend connections, inferred links via semantic comparison, or relations based off of geographic proximity. These edge sets can be represented by a statistical multi-layer network. We introduce a novel method to extract information from such networks, where each type of link forms its own layer. Using the concept of Pareto optimality, community detection in this multi-layer setting is formulated as a multiple criterion optimization problem. We propose an algorithm for finding an approximate Pareto frontier containing a family of solutions. The power of this approach is demonstrated on Twitter datasets, where multi-layer networks can be formed via semantic, temporal, and relational information.
(Co-Author: Alex Kulesza, Alfred Hero)

Nicholas J. Seewald
Biostatistics, Master’s Student (Advisor: Kelley M. Kidwell)

A SMART Web-Based Sample Size Calculator

Sample size is often a primary concern among clinicians seeking to run any trial. Simple-to-use sample size calculators do not yet exist for the design of sequential multiple assignment randomized trials (SMART) in which the primary aim is a comparison of two of the embedded dynamic treatment regimes (DTRs). We present a new, easy-to-use, online tool for computing sample size and power for two-stage SMART studies in which the primary aim is to compare two embedded DTRs with binary or continuous outcomes. The online tool was developed with Shiny, an open-source framework from RStudio for building web applications in R. It will enable clinicians to size any one of four most commonly used SMART design schemes, and has options for users to provide inputs in multiple ways. Users enter specific details of their trial, including probability of response to first stage treatment, probabilities of success for each DTR for binary outcomes or effect size for continuous outcomes, and may customize type-I error and power. Ultimately, we believe that our comprehensive, user-friendly application is capable of both powering trials and empowering clinicians to consider SMART designs more often in practice.
(Co-authors: Daniel Almirall, Kelley M. Kidwell)

Maxwell Spadafore
Undergraduate (Advisors: Zeeshan Syed, Ben Hansen)

Under the Influence: Automated Detection of Benzodiazepine Dosage in ICU Patients via Morphological Analysis of ECG

Current in-vehicle systems for driving under the influence (DUI) detection are expensive and limited in the range of substances they can detect. As an alternative, we demonstrate the feasibility of a fully automated system that leverages noisy, low resolution, and low sampling rate electrocardiogram (ECG) from a single lead (Lead II) to detect the presence of a class of mindaltering drugs known as benzodiazepines. Starting with features commonly examined manually by cardiologists searching for evidence of benzodiazepine poisoning, we extended the previously manual annotation and extraction of these features to a fully automated process. We then tested the predictive power of these features using nine subjects from the MIMIC II clinical database in a matched design, where each subject’s control consisted of his/her ECG before any benzodiazepine had been administered. Features were found to be indicative of a binary (yes/no) relationship between dose and ECG morphology, but our simplified, preliminary linear model for dose interactions was unable to find evidence of a predictable continuous relationship. Fitting the binary relationship to a support vector machine classifier with a radial basis function kernel, we were able to detect the influence of benzodiazepines with a sensitivity of 89% and a specificity of 95% — the latter surpasses that of an evidential breathalyzer commonly used by police patrols.

Vincent Tan
Biostatistics, Master’s Student (Advisors: Michael R. Elliott, Carol A.C. Flannagan)

Development of a Realtime Prediction Model of Driver Behavior at Intersections Using Kinematic Time Series Data

As autonomous vehicles enter the fleet, there will be a long period when autonomous vehicles will interact with human drivers. One of the challenges for autonomous vehicles is that human drivers do not communicate their plans. However, the kinematic behavior of a human-driven vehicle may be a good predictor of driver intent within a short time frame. We analyzed the kinematic timeseries data (e.g., speed, lane position) for a set of drivers making left turns at intersections to predict whether the driver would stop before executing the turn or not. We used Principal Components Analysis to generate independent dimensions along which the kinematics vary. The dimensions remained relatively consistent throughout the maneuver, allowing us to compute independent scores on these dimensions for different time windows throughout the approach to the intersection. The PCA scores were then used as predictors of stopping. The performance of the prediction model was evaluated at different distances from the intersection and using different time windows.

Lu Tang
Biostatistics, PhD Student (Advisor: Peter X.K. Song)

Regularized Lasso Approach for Parameter Fusion in Data Integration

Combining data sets collected from multiple similar studies are routinely undertaken in practice to achieve a larger sample size and higher power. A major challenge arising from such data integration lies in the existence of heterogeneity among studies in terms of underlying population, study coordination, or experimental protocols. Ignoring such data heterogeneity in data analysis may result in biased estimation and misleading inference. Traditional remedial techniques to address the heterogeneity include (1) including interaction with study indicator or (2) treating study as a random effect. However, the former method is shy of gaining the maximal power when the number of studies is large; and neither method can provide grouping pattern of homogeneous parameters. In this paper, we propose a regularized fusion method that allows us identify and merge homogenous parameter groups for generalized linear models in data integration without the use of hypothesis testing approach. Utilizing the fused lasso in parameter fusion, we establish a computationally easy procedure to deal with potentially high-dimensional parameters, where the existing statistical software is readily applicable. The use of estimated parameter ordering in the fused lasso facilitates computing speed with no loss of statistical power. We conduct extensive simulation studies to demonstrate the performance of the new method with a comparison to the conventional methods. Two real data application examples are provided for illustration.

Tianshuang Wu
Statistics, PhD Candidate (Advisor: Susan Murphy)

Identifying a Set that Contains the Best Dynamic Treatment Regimes

A dynamic treatment regime (DTR) is a treatment design that seeks to accommodate patient heterogeneity in response to treatment. DTRs can be operationalized by a sequence of decision rules that map patient information to treatment options at specific decision points. The Sequential Multiple Assignment Randomized Trial (SMART) is a trial design that was developed specifically for the purpose of obtaining data that informs the construction of good (i.e., efficacious) decision rules. One of the scientific questions motivating a SMART concerns the comparison of multiple DTRs that are embedded in the design. Typical approaches for identifying the best DTRs involve all possible comparisons between DTRs that are embedded in a SMART, at the cost of greatly reduced power to the extent that the number of embedded DTRs increase. Here, we propose a method that will enable investigators use SMART study data more efficiently to identify the set that contains the most efficacious embedded DTRs. Our method ensures that the true best embedded DTRs are included in this set with at least a given probability. The Extending Treatment Effectiveness of Naltrexone SMART study data are analyzed to illustrate its application. (Co-Authors: Ashkan Ertefaie, Kevin Lynch, Inbal Nahum-Shani)

Chia Chye Yee
Statistics, PhD Candidate (Advisor: Yves Atchade)

On The Sparse Bayesian Learning of Linear Models

This work is a re-examination of the sparse Bayesian learning (SBL) of linear regression models of Tipping (2001) in a high-dimensional setting. We propose a hard-thresholded version of the SBL estimator that achieves, for orthogonal design matrices, the non-asymptotic estimation error rate of √ () , where n is the sample size, p the number of regressors, is the regression model standard deviation, and s the number of non-zero regression coefficients. We also establish that with high-probability the estimator identifies the non-zero regression coefficients. In our simulations we found that sparse Bayesian learning regression performs better than lasso (Tibshirani 1996) when the signal to be recovered is strong.

Robert Yuen
Statistics, PhD Candidate (Advisor: Stilian Stoev)

Universal Bounds on Extreme Value-at-Risk under Fixed Extremal Coefficients

When estimating extreme value-at-risk for the sum of dependent losses, it is imperative to determine the nature of dependencies in the tails of said losses. Characterizing the tail dependence of regularly varying losses involves working with the spectral measure, an infinite dimensional parameter that is difficult to infer and in many cases intractable. Conversely, various summary 17 statistics of tail dependence such as extremal coefficients are manageable in the sense that they are finite dimensional and efficient estimates are obtainable. While extremal coefficients alone are not sufficient to characterize tail dependence, it was not previously known how they constrain the theoretical range of value-at-risk. The answer involves optimization over an infinite dimensional space of measures. In this work, we establish the solution and determine exact bounds on the asymptotic value-at-risk for the sum of regularly varying dependent losses when given full or partial knowledge of the extremal coefficients. We show that in-practice, the theoretical range of value-at-risk can be reduced significantly when relatively few, low dimensional extremal coefficients are given.
(Co-Authors: Stilian Stoev, Dan Cooley)

Poster Session III   |    3:00 PM – 4:00 PM

Andrew Brouwer
PhD Candidate (Advisors: Rafael Meza, Marisa Eisenberg)

HPV as the Etiological Agent of Oral Cancer

The human papillomavirus (HPV) infects multiple sites in the human epithelium (genitals, oral cavity, anal canal) and is responsible for over 90% of anogenital cancer and an increasing percentage of cancer of the oral cavity, primarily in the oropharynx. We leverage age-periodcohort (APC) epidemiological models combined with multistage clonal expansion (MSCE) models (stochastic models of cancer biology) to consider temporal trends and demographic differences in incidence of oral squamous cell carcinomas in the Surveillance, Epidemiology, and End Results (SEER) cancer registry for three groups of subsites: presumed HPV-related, presumed HPV-unrelated, and oral tongue. This method allows us to distinguish between period and birth cohort temporal effects as well as make inferences about the underlying cancer biology.

Yu-Hui Chen
Electrical Engineering/Computer Science, PhD Candidate (Advisor: Alfred O. Hero)

Parameter Estimation in Spherical Symmetry Groups

This paper considers statistical estimation problems where the probability distribution of the observed random variable is invariant with respect to actions of a finite topological group. It is shown that any such distribution must satisfy a restricted finite mixture representation. When specialized to the case of distributions over the sphere that are invariant to the actions of a finite spherical symmetry group G, a group-invariant extension of the Von Mises Fisher (VMF) distribution is obtained. The G-invariant VMF is parameterized by location and scale parameters that specify the distribution’s mean orientation and its concentration about the mean, respectively. Using the restricted finite mixture representation these parameters can be estimated using an Expectation Maximization (EM) maximum likelihood (ML) estimation algorithm. This is 18 illustrated for the problem of mean crystal orientation estimation under the spherically symmetric group associated with the crystal form, e.g., cubic or octahedral or hexahedral. Simulations and experiments establish the advantages of the extended VMF EM-ML estimator for data acquired by Electron Backscatter Diffraction (EBSD) microscopy of a polycrystalline nickel alloy sample.
(Co-Authors: Dennis Wei, Gregory Newstadt, Marc Degraef, Jeffrey Simmons, Alfred Hero)

Yanzhen Deng
Statistics, PhD Student (Advisor: Susan Murphy)

Machine Learning Methods for Constructing Real-Time Treatment Policies in Mobile Health

Mobile devices are increasingly used to collect symptoms and other information as well provide interventions in real-time. These interventions are often provided via treatment policies. The policies specify how patient information should be used to determine when, where and which intervention to provide. Here we present generalizations of “Actor–Critic” learning methods from the field of Reinforcement Learning for use, with existing data sets, in constructing treatment policies. We provide a first evaluation of the actor–critic method via simulation and illustrate its use with data from a smartphone study aimed at reducing heavy drinking and smoking.
(Co-Authors: S.A. Murphy, E.B. Laber, H.R. Maei, R.S. Sutton, K. Witkiewitz)

Hossein Keshavarz
Statistics, PhD Candidate (Advisors: Xuanlong Nguyen, Clayton Scott)

Optimal Detection of Abrupt Changes in Gaussian Processes: Fixed and Increasing Domain Analysis

We study minimax optimal detection of shift in mean (SIM) in a univariate Gaussian process. The asymptotic analysis of existing algorithms of SIM detection such as CUSUM mainly adopt an unrealistic assumption about the underlying process (e.g., independent samples). Besides, the majority of former studies on SIM detection in time series with dependent samples, are restricted to the increasing domain asymptotic framework, in which the smallest distance between sampling points is bounded away from zero. Motivated by abrupt change detection in locally stationary processes and change zone detection in spatial processes, we analyze SIM detection of Gaussian processes in fixed domain regime, in which samples gets denser in a bounded domain. To our knowledge, this is the first work on SIM detection in fixed domain regime. We show that despite optimality of CUSUM in increasing domain, it exhibits a poor performance in fixed domain. We also propose a minimax optimal algorithm using exact or approximated generalized likelihood ratio test. Our results demonstrate a strong connection between detection rate and the smoothness of the covariance function of underlying process in fixed domain.

Peng Liao
Statistics, PhD Student (Advisor: Susan A.Murphy)

Micro-Randomized Trials in mHealth

The use and development of mobile interventions is experiencing rapid growth. In “just-in-time” mobile interventions, treatments are provided via a mobile device that are intended to help an individual make healthy decisions “in the moment,” and thus have a proximal, near future impact. Currently the development of mobile interventions is proceeding at a much faster pace than that of associated data science methods. A first step toward developing data-based methods is to provide an experimental design for use in testing the proximal effects of these just-in-time treatments. In this poster, we propose a “micro-randomized” trial design for this purpose. In a micro-randomized trial, treatment components are sequentially randomized throughout the conduct of the study, with the result that each participant may be randomized at the 100s or 1000s of occasions at which a treatment component might be provided. Further, we develop a test statistic for assessing the proximal effect of a treatment component as well as an associated sample size calculator. We conduct simulation evaluations of the sample size calculator in various settings. Rules of thumb that might be used in designing the micro- randomized trial are discussed at the end. This work is motivated by our collaboration on the HeartSteps mobile application designed to increase physical activity.
(Co-Authors: Ambuj Tewari, Predrag Klasnja, Susan A.Murphy)

Dao Nguyen
Statistics, PhD Candidate (Advisor: Edward Ionides)

Particle Iterated Smoothing

Second-order particle Markov Chain Monte Carlo is an attractive class of parameter inference for state-space models due to the fact that exploiting the score and the observed information matrix estimated from particle filter, it can improve the estimations in term of i) shortening the burn-in period, (ii) accelerating the mixing of the Markov chain at the stationary period, and (iii) simplifying tuning. Unfortunately, current approaches rely on the ability to sample from the derivative and Hessian of transition and observation densities, which are rather unrealistic. We, therefore, propose a simpler approach, namely particle iterated smoothing. We not only derive theoretical analysis of the asymptotic properties of our approach, but also we show better empirical results compared to standard methods.

Karen Nielsen
Statistics, PhD Student (Advisor: Rich Gonzalez)

Comparing Modeling Approaches to EEG Data for Event-Related Potentials

There are many ways to think about and process EEG data. Researchers may be interested in peaks during 1000-millisecond windows (for assessing response to controlled stimuli in an experimental setting), or general waveforms over longer windows (such as for identifying sleep stages). Filtering is considered a necessity, primarily to reduce noise, but a wide variety of filters are available with only heuristic (not theoretical) recommendations for use. Here, we focus on Event-Related Potentials, which generally involve waveforms with only one or a few oscillations. The traditional approach to signal averaging for these waveforms involves averaging values at each time point to produce an estimate of the true waveform. However, current techniques for dealing with longitudinal data with such forms may allow a more natural fitting process. Since EEG readings consist of highly-correlated multi-channel readings, an ideal modeling approach should incorporate random effects in an informed way. Proposed approaches to EEG data and ERPs are compared in terms of landmark estimation and alignment with scientific goals.

Martha Rozsi
Survey Methodology, Master’s Student

Creating a Flexible and Scalable PSU Sample for NHTSA’s Redesign of the National Automotive Sampling

The redesign of the National Automotive Sampling System (NASS) required a flexible and scalable PSU sample to be able to respond to future and changing budget levels and precision needs. It was assumed that the future sample for NASS could have between 16 primary sampling units (PSUs) and 101 PSUs for the Crash Report Sampling System module, and between 16 PSUs and 96 PSUs for the Crash Investigation Sampling System module. This paper describes an approach that allows the number of PSU strata, and thus depth of stratification, to change in response to changes in budget and thus total PSU sample sizes. Conditional probabilities are calculated that allow the PSUs in the largest PSU sample to be subsampled as needed to meet future budgetary levels.

Jingchunzi Shi
Biostatistics, PhD Candidate (Advisor: Seunggeun Lee)

Novel Statistical Model for GWAS Meta-analysis and its Application to Trans-ethnic Metaanalysis

Trans-ethnic Genome-wide association studies (GWAS) meta-analysis has proven to be a practical and profitable approach for identifying loci which contribute to the missing heritability of complex traits. However, the expected genetic effects heterogeneity cannot be easily accommodated through existing approaches. In response, we propose a novel trans-ethnic meta-analysis 21 methodology with flexibly modeling of the expected genetic effects heterogeneity across diverse populations. Specifically, we consider a modified random effect model in which genetic effect coefficients are random variables whose correlation structure across ancestry groups reflects the expected heterogeneity (or homogeneity) among ancestry groups. To test for associations, we derive the data-adaptive variance component test with adaptive selection of the correlation structure to increase the power. Simulations demonstrate that our proposed method performs with substantial improvements in comparison to the traditional meta-analysis methods. Furthermore, our proposed method provides scalable computing time for genome-wide data. For real data analysis, we re-analyzed the published type 2 diabetes GWAS meta-analyses from Consortium et al. (2014), and successfully identified one additional SNP which clearly exhibits genetic effects heterogeneity among different ancestry groups but could not be detected by the traditional metaanalysis methods.

Ji Sun
Statistics, Undergraduate (Advisor: S.A. Murphy)

Sample Size Calculator for Sizing Micro-Randomized Trials in Mobile Health

The use and development of mobile interventions is experiencing rapid growth. In “just-in-time” mobile interventions, treatments are provided via a mobile device that are intended to help an individual make healthy decisions “in the moment,” and thus have a proximal, near future impact. Micro-randomized trial can be used to provide data to develop these just-in-time mobile interventions. In these trials treatments are sequentially randomized throughout the conduct of the trial, with the result that each participant may be randomized at the 100s or 1000s of occasions at which a treatment might be provided. This work provides a web-based sample size calculator that can be used to determine the sample size needed for micro-randomized trials. The effect that we want to detect is the main proximal effect of an “in-the-moment” treatment provided via a mobile device in a setting of micro-randomized trials. The calculator requires specification of a standardized time-varying effect (quadratic in our case) and also specification of a vector of time-varying expected availability. This work is motivated by the HeartSteps mobile application designed to increase physical activity and builds on the sample size formula derived in the paper “Micro-Randomized Trials in mHealth” (P. Liao, A. Tewari, P. Klasnja & S.A. Murphy).
(Co-Authors: Peng Liao, Nick Seewald, S.A. Murphy)

Fan Wu
Biostatistics, PhD Candidate (Advisor: Yi Li)

A Pairwise-Likelihood Augmented Estimator for the Cox Model under Left-Truncation

Survival data collected from prevalent cohorts are subject to left-truncation. The conventional conditional approach using Cox model disregards the information in the marginal likelihood of truncation time thus can be inefficient. On the other hand, the stationary assumption under lengthbiased sampling (LBS) methods to incorporate the marginal information can lead to biased 22 estimation when it is violated. In this paper, we propose a semiparametric estimation method by augmenting the Cox partial likelihood with a pairwise likelihood, by which we eliminate the unspecified truncation distribution in the marginal likelihood, yet retain the information about regression coefficients and the baseline hazard. Exploring self-consistency of the estimator, we give a fast algorithm to solve for the regression coefficients and the cumulative hazard simultaneously. The proposed estimator is shown to be consistent and asymptotically normal with a sandwich-type consistent variance estimator. Simulation studies show a substantial efficiency gain in both the regression coefficients and the cumulative hazard over Cox model estimators, and that the gain is comparable to LBS methods when the stationary assumption holds. For illustration, we apply the proposed method to the RRI-CKD data.
(Co-Authors: Sehee Kim, Jing Qin, Yi Li)

Tianpei Xie
Electrical Engineering/Computer Science, PhD Candidate (Advisor: Alfred O. Hero)

Semi-Supervised Multi-Sensor Classification via Consensus-Based Multi-View Maximum Entropy Discrimination

In this paper, we consider multi-sensor classification when there is a large number of unlabeled samples. The problem is formulated under the multi-view learning framework and a Consensusbased Multi-View Maximum Entropy Discrimination (CMV-MED) algorithm is proposed. By iteratively maximizing the stochastic agreement between multiple classifiers on the unlabeled dataset, the algorithm simultaneously learns multiple high accuracy classifiers. We demonstrate that our proposed method can yield improved performance over previous multi-view learning approaches by comparing performance on three real multi-sensor data sets. (Co-Authors: Nasser M. Nasrabadi, Alfred O. Hero)

Gregory Zajac
Biostatistics, Master’s Student (Advisor: Goncalo Abecasis)

Detecting Sample Misidentifications in GWAS Using Multiple Marker-Phenotype Associations

Genome-wide association studies (GWAS) attempt to establish statistical evidence for association between a disease or trait and alleles at genetic loci. Errors in handling the genetic samples or in data entry can apply the wrong phenotypic information to a set of genotypes and reduce the power to detect these associations. Checks of genetic sex will only find half of these errors on average, and relatedness checks are uninformative in studies that use unrelated individuals, so there is a need for a statistical method to find these errors. In this presentation, I derive a method to find these errors using associations between a set of phenotypes with thousands of independent genetic markers, and show some preliminary results.

Tianyu Zhan
Biostatistics, Master’s Student (Advisor: Hui Jiang)

Unit-free and Robust Detection of Differential Expression from RNA-Seq Data

Ultra high-throughput sequencing of transcriptomes (RNA-Seq) has recently become one of the most widely used methods for quantifying gene expression levels due to its decreasing cost, high accuracy and wide dynamic range for detection. However, the nature of RNA-Seq makes it nearly impossible to provide absolute measurements of transcript concentrations. Several units or data summarization methods for transcript quantification have been proposed to account for differences in transcript lengths and sequencing depths across genes and samples. However, none of these methods can reliably detect differential expression directly without further proper normalization. We propose a statistical model for joint detection of differential expression and data normalization. Our method is independent of the unit in which gene expression levels are summarized. We also introduce an efficient algorithm for model fitting. Due to the L0-penalized likelihood used by our model, it is able to reliably normalize the data and detect differential expression in some cases when more than half of the genes are differentially expressed in an asymmetric manner. The robustness of our proposed approach is demonstrated with simulations.

lsa logoum logoU-M Privacy StatementAccessibility at U-M