Poster abstracts | From Industrial Statistics to Data Science

Jesus Daniel Arroyo, University of Michigan

High-dimensional graph classification with applications to brain connectomics

We study the problem supervised classification of labeled graphs. Although statistical analysis of a single network has received a lot of attention in the recent years, with a focus on social networks, analysis of a sample of networks presents its own challenges which require a different set of analytic tools. The main motivation for our work is the study of brain networks, which are constructed from imaging data to represent functional connectivity between regions of the brain. Previous work has shown the potential of such networks as diagnostic biomarkers for brain disorders. Existing approaches to graph classification tend to either treat the edges as a long multivariate vector, ignoring the network structure, or focus on the graph topology while ignoring the edge weights. Our goal here is to incorporate both the individual edge information and the overall graph structure in a computationally efficient way. We are also interested in obtaining a parsimonious and interpretable representation of differences in brain connectivity patterns between classes, in addition to achieving high classification accuracy. We achieve this by introducing penalties that encourage structured sparsity, and implement them via efficient convex optimization algorithms. The method shows good performance both on simulated networks and on real data from fMRI studies of schizophrenic patients.

Shrijita Bhattacharya, University of Michigan

AMON: An Open Source Architecture for Online Monitoring, Statistical Analysis and Forensics of Multi-gigabit Streams

The Internet, as a global system of interconnected networks, provides an extensive array of information resources and services. Key requirements include good quality-of-service and protection of the infrastructure from nefarious activity (e.g. cyber-attacks such as distributed denial of service (DDoS), password or port scans, etc). The real-time monitoring and rapid, online, statistical analysis of network traffic is essential for understanding its structure, identifying, and preventing cyber-attacks on the underlying infrastructure and users. Yet, the vast multi-gigabit traffic volume poses formidable challenges, which require a specialized approach that integrates: (i) advanced network engineering packet capture techniques; (ii) algorithms and data structures, tailored for fast data streams; and (iii) novel statistical methods that can be implemented “online” with a single-pass over the data. This motivated us to develop an open source architecture, AMON (All-packet MONitor), for online monitoring and sequential analysis of multi-gigabit data streams under relatively stringent time and space constraints. AMON examines all packets passing through an interface, partitions traffic into sub-streams by using rapid hashing and computes certain real-time statistical summaries. The resulting data structures provide views of the intensity and connectivity structure of network traffic at the time-scale of routing. AMON integrates several statistical modules, which can help automatically detect statistically significant heavy hitters (outliers) in traffic volume, relative volume and connectivity. The statistical methods address the heavy-tailed nature of traffic flows by using extreme value theory and properties of the order statistics to provide data adaptive methodology, which is robust to changing traffic conditions. We demonstrate our framework in the context of detection / identification of heavy-hitters as well as the visualization and statistical detection at the time-of-onset of high-connectivity events such as DDoS. This allows operators to quickly visualize and detect attacks and limit offline time-consuming post-mortem analysis. AMON deploys the high-performance software packet monitor PF-RING for processing 10Gbps+ live Internet traffic at Merit Network. The AMON framework does not require specialized hardware, it is readily deployable, and extensible allowing the addition of further statistical and/or filtering modules for real-time forensics.

Carles Breto, University of Michigan

Time series analysis for ecological systems

We have developed statistical techniques to enable time series analysis via mechanistic models for disease systems. These mathematical models have long played a role in understanding the ecology underlying fluctuations in population abundance. We require that such models capture not just the qualitative features of the dynamics but also to explain quantitatively all available observations on the system, for which it is critical that stochastic variation in the system be represented adequately. Inference for such general classes of nonlinear stochastic dynamic models with both process and observation variability has been made possible by recent advances in statistical methodology.

Philipp Burckhardt, Carnegie Mellon University

Powered by the Web. Harnessing web technologies for large-scale data analysis

This is joint work with Athan Reines, PhD, data scientist at Vurb.

Background: Today’s web applications have nothing in common with the websites we used twenty years ago. Ever-growing parts of business life happen online, allowing cloud-based providers to capture unparalleled amounts of data. Integrating, aggregating, and analyzing this data is now critical to the success of many businesses. In order to minimize the lag from insight to action, enterprises perform statistical analyses as on-line operations on data streams. This approach helps bridge the gap from ex-post analyses to direct integration into the business process, allowing for continuous on-line experiments and optimization.

One particular technology which is driving massive change in how enterprises address web-scale problems is JavaScript. For a long time, JavaScript was maligned as a primitive tool for simple interactions and animations for websites. Today, JavaScript is everywhere, powering complex web and mobile applications and becoming a crucial force in the emerging Internet of Things. Node.js, a high performance web server framework written in JavaScript, is used by many enterprises to help scale business operations. NPM (http://www.npmjs.com/), a Node.js package manager, has eclipsed all other language package managers both in terms of sheer size, as well as rate of growth. JavaScript packages exist for incredibly diverse applications such as robotics, cryptography, infrastructure management, and natural language processing.

In our work, we present an approach for peer-to-peer distributed computing using browser-based technologies. In previous approaches, projects such as fold-it and SETI required peers to download and install applications. These applications required server infrastructure to coordinate peer jobs, data transfer, and data collection. We demonstrate that such complex orchestration is no longer needed. Browser technologies now provide facilities for direct peer-to-peer communication without server mediation.

Previous approaches were also limited in that the approaches were designed for single use cases and did not allow arbitrary computation. We further demonstrate that arbitrary computation is now possible using an extensive numerical analysis package we have written in JavaScript and openly released to the community. The package is comparable in terms of functionality to the more traditional languages used for analysis, R and Python, and in many cases more performant.

Lastly, we outline an approach which combines browser-based peer-to-peer communication, web torrents, and our numerical analysis package to perform arbitrary distributed computation at web-scale. In recent years, enterprises have invested significantly in developing and managing distributed frameworks for data storage and processing; e.g., Hadoop, Storm, Spark, and more. These frameworks require massive investments in physical infrastructure and entire development teams to manage and maintain. While extremely powerful, we believe these technologies are fundamentally limited both in terms of computation scale, as well as accessibility due to significant upfront costs and investment. We believe our work hints at a future where peer-to-peer distributed computation both democratizes access to supercomputing and promises a whole new class of applications previously impossible

Sougata Chaudhuri, University of Michigan

Online Learning to Rank with Feedback at the Top

We consider online learning to rank for information retrieval, where a ranking system updates itself sequentially using explicit but highly restricted feedback from users. We formalize our problem as an online game where a learner interacts with an oblivious adversary over T rounds. In each round, the adversary generates a list of m documents, pertaining to a query, and the learner ranks the documents. The adversary then generates a relevance vector for the document list and the learner updates its ranker according to the feedback received. We consider the challenging setting where the feedback received is the relevance levels of only the top k documents in the ranked list, where k << m. However, the performance of learner is judged, cumulatively over T rounds, based on the full relevance vectors, using any of the popular learning to rank surrogate loss functions. We develop efficient algorithms for pointwise, pairwise and listwise ranking surrogates and derive formal regret bounds. We apply our algorithms on benchmark ranking datasets, demonstrating the ability to efficiently learn a ranking function, online, from highly restricted feedback.

Pin-Yu Chen, University of Michigan

Embracing Data Science with Graph Mining: Action Recommendations for Cyber Security, Clustering, and Beyond

Many modern datasets are represented as graphs or stored in graph structures, ranging from social, cyber and physical network data, biological and chemical reactions to network flows and transportation system, among others. However, the graph nature prohibits utility from traditional data analysis techniques developed for multivariate data samples, both in theory and practice. This work aims to bridge this gap by proposing novel graph models and investigating their utility and limitation to graph data analysis. Moreover, we develop several efficient and effective graph mining methods involving graph data mining, action recommendations and interactive decision making, particularly for the following fields: (a) Cyber intrusion detection (b) Implementation of action recommendations against real-time cyber attacks using cloud services (c) Data-driven network resilience enhancement (d) Automated model order selection for graph clustering (e) Interactive graph data analysis. For the theory side, we establish fundamental principles that govern the feasibility of graph models and develop efficient graph mining algorithms with performance guarantees. For the practice side, we demonstrate the success of the proposed data-driven graph mining methods in the aforementioned fields.

Hao Cheng, Iowa State University

Parallel Computing to Speed up Whole-Genome Bayesian Regression Analyses

Bayesian multiple regression methods, which are widely used for whole-genome analyses, are computationally intensive. However, the computing time for these methods is linear in the amount of data. Thus, parallel computing, taking advantage of multiple cores on computers, can be used to speed up computing time. In a widely used Markov chain Monte-Carlo (MCMC) strategy, a single-locus mixed model equation is solved to sample the effect for each locus. Here, the most time consuming calculation is a dot product of a vector of length n, the number of observations. This dot product is split up and done in parallel on multiple cores; results from each core are combined to get the dot product. This same strategy is also applied to other calculations of order n. When n was a million, with 120 cores a speedup of about 58 times was achieved. When n is any larger than p, the number of markers, sampling marker effects is more efficient by setting up the full mixed model equation for all loci. Once the full mixed model equation is formed, the dot product required to sample a marker effect is of order p rather than n. However, setting up the full mixed model equation involves dot products of order n. Thus, these dot products for setting up the full mixed model equations are done in parallel. When n was a million, with 2400 cores a speedup of about 1279 times was achieved. In addition, a strategy using Independent MetropolisÂHastings (IMH) sampling to parallelize MCMC sampling for whole-genome analyses has been shown. We also propose a strategy to construct the proposal distribution in IMH.

Ruihua Cheng New Jersey Institute of Technology

Learning-Based Method with Valence Shifters for Sentiment Analysis

Automatic sentiment classification is becoming a popular and effective way to help online users or companies to process and make sense of customer reviews. In this paper, we propose an extension of learning-based methods for classification that achieves better classification accuracy. The method combines two recent developments. First, valence shifters and individual opinion words are combined as bigrams to use in an ordinal margin classifier. Second, relational information between unigrams expressed in the form of a graph, is used to constrain the parameters of the classifier. By combining these two components, our method is able to extract more of the unstructured information present in the data than previous methods, like random forest, hence gaining the potential of better performance. Indeed our results show a higher classification accuracy on empirical real data with ground truth as well as on simulated data.

Youngjun Choe, University of Michigan

Extension and Information Criterion for Importance Sampling

Importance Sampling (IS) is one of the most popular variance reduction techniques in the simulation literature. IS achieves reducing the variance of an estimator by focusing the sampling effort on important simulation inputs that are likely to cause the event of interest. However, the applicability of IS is limited to the deterministic simulation model where the same input always generates the same output. Our recent work (Technometrics, 57(3):351-361, 2015) extends IS to the stochastic simulation model that instead produces the random output for the same input. We also establish the central limit theorem for the new estimator under mild assumptions and construct an asymptotically valid confidence interval. Next, we tackle the model selection problem inherent in IS. The implementation of IS involves selecting the IS distribution, which guides the sampling effort and determines the extent of variance reduction. Although the optimal IS distribution is known theoretically, it is not implementable in practice, requiring some approximation such as the cross-entropy (CE) method. This method finds the IS distribution by minimizing an estimate of CE between the candidate distribution and the optimal distribution. By correcting the bias in the existing CE estimator, we derive a model selection criterion analogous to Akaike information criterion. We apply the proposed methods to the reliability evaluation of the wind turbine using aeroelastic simulators developed by the U.S. National Renewable Energy Laboratory.

Walter Dempsey, University of Michigan

Atypical scaling behavior persists in real world interaction networks

Scale-free power law structure describes complex networks derived from a wide range of real world processes. The extensive literature focuses almost exclusively on networks with power law exponent strictly larger than 2, which can be explained by constant vertex growth and preferential attachment. The complementary scale-free behavior in the range between 1 and 2 has been mostly neglected as atypical because there is no known generating mechanism to explain how networks with this property form. However, empirical observations reveal that scaling in this range is an inherent feature of real world networks obtained from repeated interactions within a population, as in social, communication, and collaboration networks. A generative model explains the observed phenomenon through the realistic dynamics of constant edge growth and a positive feedback mechanism. Our investigation, therefore, yields a novel empirical observation grounded in a strong theoretical basis for its occurrence.

Alexander Giessing, University of Michigan

Model Selection in Misspecified Quantile Regression

Statistical models are at best approximations to reality; in practice, we have to presume that all models are misspecified. Since quantile regression (QR) belongs to the family of robust statistical models and requires only mild assumptions on the error distribution, it seems to be a natural candidate to be analyzed under misspecification. Yet, with few exceptions research on QR has focused on correct specification of the model only. We analyze linear QR models under misspecification and present two new results: First, we prove an estimation consistency result about high-dimensional misspecified QR problems. Second, we derive a generalized Akaike information criterion (AIC) for model selection in smoothed QR problems. For the first time, we propose an AIC-type model selection criterion for QR that has a quantile dependent penalty term which penalizes model complexity and model misspecification. Along with our methodological developments for misspecified QR models, we also make several technical and theoretical contributions: We prove a new implicit function theorem for Hadamard differentiable functionals in general Banach spaces, derive a Bahadur-type representation for misspecified quantile regression models, and propose a second-order asymptotic representations for an approximate solution to the QR problem.

Jun Guo, University of Michigan

Stochastic optimization for high-dimensional Generalized Linear Mixed Models

We propose and study stochastic nonconvex optimization methods for fitting high-dimensional generalized linear mixed effect models, or GLMM. GLMM can be viewed as an extension of generalized linear models for clustered data via adding random effects to the linear predictors. High dimensional GLMM has wide and potential applications in many areas, including in genome wide association studies (GWAS), but there is very few efficient algorithms developed to fit GLMM in high dimension and especially for these potential applications. The fitting is computationally challenging because the log-likelihood function is an intractable high dimensional integration and is typically non-conave. We propose a stochastic block coordinate descent algorithm and an ADMM algorithm incorporating Markov chain Monte Carlo techniques to optimize the $\ell_1$ penalized nonconcave log-likelihood function and estimate the model parameters. We compare our proposed algorithms to GLMMLasso algorithm by Schelldorfer et al. and the most recent stochastic proximal gradient algorithm by Atchade et al. in both statistical accuracy and computational efficiency, and present an interesting application to analyze FIFA World Cup 2014.

Fei He, University of Michigan

An Uncertainty Quantification Framework for Multiple Parameters: Case Study on Evaluating the Sensitivities of AGCM-Simulated Tropical Cyclones to Initial Conditions

This work explores the utility of proven quantitative sensitivity analysis techniques for analysis of dynamical systems in atmospheric general circulation models (AGCMs). Specifically, it uses an Extended Multivariate Adaptive Regression Splines (EMARS) algorithm to examine the response of cloud radiative forcing, cloud content, and precipitation rate in idealized AGCM-simulated tropical cyclones (TCs) to changes in TC initial conditions. The use of a multivariate sensitivity analysis algorithm allows the simultaneous examination of multiple controlling factors and their interactions. Use of a well-established idealized TC test case framework (1) allows for a thorough evaluation of the sensitivity analysis algorithm, and (2) extends the test case methodology to a quantitative and multivariate examination of dynamical system sensitivity. Consistent with previous coarse and fine resolution idealized and case study analyses of TCs, the EMARS algorithm yields increases in TC intensity for higher sea surface temperature and larger atmospheric lase rates. However, multivariate perturbation of initial conditions and environment allow a more nuanced exploration of the sensitivities in the system. It is found that: (1) atmospheric lapse rate and sea surface temperature behave strong but different impact on the TC characteristics. Larger SST is beneficial for all the TC characteristics, such as TC intensity, shortwave cloud radiative forcing (SWCF), and precipitation rate. In contrast, larger atmospheric lapse rate is beneficial for TC intensity, longwave cloud radiative forcing, and cloud ice water path (IWP), but harmful for SWCF, cloud liquid water path (LWP) and precipitation; (2) Most TC characteristics show very weak sensitivity to the size and strength of initial disturbance except SWCF and IWP; (3) TC intensity shows linear relationship with TC LWCF, but no dependence on SWCF. More relationships are quantified and the implication is discussed.

Joint work with Derek J. Posselt, Naveen N. Narisetty, Colin M. Zarzycki, and Vijayan N. Nair

Nhat Ho, University of Michigan

Intrinsic difficulties for the inference of mixing measures in finite mixtures of univariate skew normal distributions

Skew normal distribution originally was proposed by Azzalini in his series of influential papers [Azzalini, 1986, Azzalini and Valle, 1996, Azzalini and Capitanio, 1999] and increasingly became popular in recent years due to its flexibility in modeling data. This density class generalizes the Gaussian distributions, by having an extra parameter, shape, which controls density skewness. However, skew normal distribution suffers from the singularity of Fisher matrix when the skew parameter becomes 0 [Chiogna, 2005, Ley and Paindaveine, 2010, Hallin and Ley, 2012, 2014]. This property leads to the slow convergence rate n^{-1/6} of parameter estimation in the vicinity of symmetry. Additionally, Hallin and Ley [2014] demonstrated that high-order singularities cannot lead to convergence rate worse than n^{-1/8} of general class of skew-symmetric distribution. In this paper, we show that under finite mixture of univariate skew normal distributions [Lin et al., 2007], the parameter estimations can suffer from any high-order singularities, such as quadruple singularities (n^{-1/10} convergence rate) or quintuple singularities (n^{-1/12} convergence rate) or even â€infiniteâ€ singularities (not in form n^{-1/s} for any s \geq 2). These singularities happen not only in the vicinity of symmetry but also in the setting of “cousin” set. It leads to an extremely broad spectrum of behavior, some of which shared with the Gamma family, some of which shared with the Gaussian family [Ho and Nguyen, 2015b].

Elizabeth Hou, University of Michigan

A Penalized Ensemble Kalman Filter for Atmospheric Data Assimilation

Data assimilation is a scheme in which observations are incorporated into a modelled process in order to estimate the hidden states of the system. The most popular model for data assimilation is the ensemble Kalman filter, for which there are many variations. However, in the high dimensional case the ensemble Kalman filter suffers from sampling errors, and due to the recursive nature of Kalman filters, these errors propagate through time leading the model to degenerate. We propose a new variation of the ensemble Kalman filter, which uses regularization to learn a sparse precision matrix in order to provide a better estimate for the sample covariance matrix. Our method does not require the domain knowledge necessary in popular localization methods, nor does it require additional variance inflation. We apply our model to simulations of atmospheric systems and compare it to the current state of the art models.

Yirui Hu, Rutgers University

Detecting Network Anomaly: A Novel Bayesian Framework Solution

This paper investigates several machine learning techniques to detect network anomaly. Various anomalies present different behaviors in wireless networks. Not all anomalies are known to networks. Unsupervised algorithms are desirable to automatically characterize the nature of traffic behavior and detect anomalies from normal behaviors. Essentially all anomaly detection techniques learn a model of the normal patterns in training data set, and then determine the anomaly score of a given testing data point based on the deviations from the learned patterns. Multi-cluster based analysis are valuable because they can obtain the insights of human behaviors and learn similar patterns in temporal traffic data. This paper leverages co-occurrence data that combine traffic data with generating entities, and then compares Gaussian probabilistic latent semantic analysis (GPLSA) model to a Gaussian Mixture Model (GMM) with temporal network data. A novel quantitative â€œDonutâ€ algorithm of anomaly detection on the basis of model log-likelihood is proposed in this paper.

Can Le, University of Michigan

Estimating the number of communities in networks by spectral methods

Community detection is a fundamental problem in network analysis with many methods available to estimate communities. Most of these methods assume that the number of communities is known, which is often not the case in practice. We propose a simple and very fast method for estimating the number of communities based on the spectral properties of certain graph operators, such as the non-backtracking matrix and the Bethe Hessian matrix. We show that the method performs well under several models and a wide range of parameters, and is guaranteed to be consistent under several asymptotic regimes. We compare the new method to several existing methods for estimating the number of communities and show that it is both more accurate and more computationally efficient.

Huitian Lei, University of Michigan

An Actor-Critic Contextual Bandit Algorithm for Personalized Interventions using Mobile Devices

Increasing technological sophistication and widespread use of smartphones and wearable devices provide opportunities for innovative health interventions. In particular, their ability to collect a large amount of personal-level information and their accessibility almost anywhere anytime make mobile devices a great platform to provide highly personalized interventions. An Adaptive Intervention (AI) personalizes the type, mode and dose of intervention based on users’ ongoing performances and changing needs. A Just-In-Time Adaptive Intervention (JITAI) employs the real-time data collection and communication capabilities that modern mobile devices provide to adapt and deliver interventions in real-time. The lack of methodological guidance in constructing data-based high quality JITAI remains a hurdle in advancing JITAI research despite the increasing popularity JITAIs receive from clinical and behavioral scientists. In this article, we make a first attempt to bridge this methodological gap by formulating the task of tailoring interventions in real-time as a contextual bandit problem. Interpretability concerns in the domain of mobile health lead us to formulate the problem differently from existing formulations intended for web applications such as ad or news article placement. Under the assumption of linear reward function, we choose the reward function (the “critic”) parameterization separately from a lower dimensional parameterization of stochastic policies (the “actor”). We provide an online actor-critic algorithm that guides the construction and refinement of a JITAI. Asymptotic properties of actor-critic algorithm, including consistency, rate of convergence and asymptotic confidence intervals of reward and JITAI parameters are developed and verified by numerical experiments. To the best of our knowledge, our work is the first application of the actor-critic architecture to contextual bandit problems.

Tianxi Li, University of Michigan

Network model selection by edge cross-validation

Network is one of the most important ways to describe relationships and various statistical models have been proposed for networks in the past two decades. However, the model selection remains difficult due to the lack of general strategies. The popular cross-validation in classical settings is not directly applicable to network problems due to the notorious difficulty of network sampling. We propose an edge cross-validation (ECV) strategy for network model selection and parameter tuning. We provide an error bound for the cross-validation estimation and prove the consistency of the process when selecting the number of communities under stochastic block model. Empirically, we show that our proposal gives better performance than the recently proposed network cross-validation (NCV) in all settings that NCV is applicable but also demonstrate that ECV is more general by showing its effectiveness in tuning regularized spectral clustering.

Peng Liao, University of Michigan

A Batch off-policy Actor-critic Algorithm for constructing optimal policy

Ideally mobile devices can be used to provide treatment/support when-ever needed and adapted to the context of the user. Treatment policy (a.k.a. Just-in-time adaptive interventions) are composed of decision rules that use context (userâ€™s behaviors, location, current time, social activity, stress, urges to smoke, etc.) to select a treatment that is delivered to the user via a mobile device. Possible treatments include behavioral, cognitive, social, motivational, self-monitoring messages. We present a data analysis method for constructing the decision rules underlying a just-in-time adaptive intervention. This data analysis method, called an Batch off-policy actor-critic algorithm, uses data on multiple individuals to construct a just-in-time adaptive intervention with the aim of optimizing long-term average outcomes, such as minimizing average stress and urges to smoke.

Ashwini Maurya, Michigan State University

A Well-Conditioned and Sparse Estimate of Covariance and Inverse Covariance Matrix Using Joint Penalty(JPEN)

We develop a method for estimating a well-conditioned and sparse covariance and inverse covariance matrix from a sample of vectors drawn from a sub-Gaussian distribution in high- dimensional setting. The proposed estimator is obtained by minimizing the squared loss function and joint penalty of L-1 norm and sum of squared deviation penalty on the eigenvalues. The joint penalty plays two important roles: i) L-1 penalty on each entry of covariance matrix reduces the effective number of parameters and consequently the estimate is sparse and ii) the sum of squared deviations penalty on the eigenvalues controls the over-dispersion in the eigenvalues of sample covariance matrix. In contrast to some of the existing methods of covariance and inverse covariance matrix estimation, where often the interest is to estimate a sparse matrix, the proposed method is flexible in estimating both a sparse and well-conditioned covariance matrix simultaneously. We establish the theoretical consistency of the proposed estimators in both Frobenius and Operator norm. The proposed algorithm is very fast and efficient. We compare the performance of the proposed estimator to some other existing methods on simulated as well as using colon tumor data.

Kevin Moon, University of Michigan

Meta learning of bounds on the Bayes classifier error

Meta learning uses information from base learners (e.g. classifiers or estimators) as well as information about the learning problem to improve upon the performance of a single base learner. For example, the Bayes error rate of a given feature space, if known, can be used to aid in choosing a classifier, as well as in feature selection and model selection for the base classifiers and the meta classifier. Recent work in the field of f-divergence functional estimation has led to the development of simple and rapidly converging estimators that can be used to estimate various bounds on the Bayes error. We estimate multiple bounds on the Bayes error using an estimator that applies meta learning to slowly converging plug-in estimators to obtain the parametric convergence rate. We compare the estimated bounds empirically on simulated data and then estimate the tighter bounds on features extracted from an image patch analysis of sunspot continuum and magnetogram images.

Siddhartha Nandy, Michigan State University

A low rank kriging for large spatial data sets

Providing a best linear unbiased estimator (BLUP) is always a challenge for a non-repetitive, irregularly spaced, spatial data. The estimation process as well as spatial prediction involves inverting an $n\times n$ covariance matrix of which computation typically requires $\uO(n^3)$. Studies showed that often times the complete covariance matrix ($\uXi$) can be additively decomposed into two matrices: one from the measurement error modeled as white noise ($\uV$) and the other due to the observed process which can be nonstationary ($\uSigma$). If nonstationarity is needed, $\uSigma$ is often assumed to be low rank. Method of fixed rank kriging (FRK) developed in Cressie and Johannesson ($2008$), where the benefit of smaller rank has been used to improve the computation time as $\uO(nr^2)$ where $r$ is the rank of $\uSigma$. In this work, we consider cholesky decomposition for FRK and use a group-wise penalized likelihood where each row of the lower triangular matrix is penalized. More precisely, we present a two-step approach using group LASSO type shrinkage estimation technique for estimating the rank of the covariance matrix and finally the matrix itself. We implement the block co-ordinate decent algorithm to obtain the local optimizer to our likelihood problem. We investigate our findings over a set of simulation study and finally apply to a rainfall data obtained on Colorado, US.

Rodrigue Ngueyep Tzoumpe, IBM T.J. Watson Research

High-Dimensional Multivariate Additive Regression for Uncovering Contributing Factors to Health Care Expenditure

Many studies in health services research rely on regression models for multivariate responses with a large number of covariates or predictors. In this paper, we introduce novel methodology to estimate and perform model selection for high-dimensional non-parametric multivariate regression problems, with application to many healthcare studies. We particularly focus on multi-response or multi-task regression models. Because of the complexity of the dependence between predictors and the multiple responses, we exploit model selection approaches that consider various level of groupings between and within responses. The novelty of the method lies in its ability to account simultaneously for between and within group sparsity in the presence of nonlinear effects. We also propose a new set of algorithms that can identify inactive and active predictors that are common to all responses or to a subset of responses. Our modeling approach is applied to uncover factors that impact healthcare expenditure for children insured through the Medicaid benefits program. We compare our findings on the association between healthcare expenditure and a large number of well-cited factors on data for two neighboring states, Georgia and North Carolina, which have similar demographics but different Medicaid systems. We also validate our methods with a benchmark cancer dataset and simulated data examples.

Dao Nguyen, University of Michigan

Second-order Iterated Smoothing

“Simulation-based” inference is attractive class but suffers from several major drawbacks, including computational expense and slow convergence rate. This paper introduces a new simulation-based algorithm, second-order iterated smoothing, to address these limits. A key strength of this approach is efficient estimate of the observed information matrix, resulting in theoretical soundness and practical feasibility. Another key point is that we generalize the random walk noise, making the likelihood surface exploration process more computational efficient. In a toy example and in a real data set, we show our proposed approach, with the same computational resources outweighing the other approaches in term of empirical performance.

Karen Nielsen, University of Michigan

Regression Spline Mixed Models for Analyzing EEG Data and Event-Related Potentials

Analysis of EEG data tends to be a nuanced, subjective process. For example, filtering is common, primarily to reduce noise, but a wide variety of filters are available with only heuristic (not theoretical) recommendations for use. This work focuses on Event-Related Potentials (ERP), which generally involve waveforms with only one or a few oscillations. Since EEG readings consist of highly-correlated multi-channel readings, an ideal modeling approach should make use of this structure. Here, we will show how Regression Spline Mixed Models (RSMM) can combine the features of splines with a hierarchical framework to explore EEG data at any of the many levels that are collected and of interest to researchers.

Chenhui Shao, University of Michigan

Spatial and Temporal Modeling for Surface Variation Monitoring

New 3D sensing technologies have made it possible to generate large volumes of surface data from many manufacturing processes, such as ultrasonic metal welding and high precision machining. Successive measurement of the surfaces over time forms complex and spatiotemporally correlated data. The availability of such data provides new opportunities for manufacturing process monitoring and control. However, high-resolution 3D sensing systems are costly and time-consuming, and systematic measurement strategies are desirable to improve the measurement efficiency. Moreover, systematic modeling approaches are lacking to incorporate multi-source information with spatial measurements.

In this research, we first develop a novel dynamic sampling design algorithm to characterize time-variant surfaces in manufacturing, which fall into the category of spatiotemporal processes. A state-space model is adopted to describe spatiotemporal data, and Kalman filter is used to predictively determine the measurement locations based on some function of the estimation error variance. The determination of the measurement locations is formulated as a binary integer programming problem, and Genetic Algorithm (GA) is applied for searching the optimal design. Additionally, test statistics based on efficient score vector are utilized to monitor and update the temporal transition parameter.

Furthermore, an improved surface modeling approach is developed to cost-effectively model and monitor spatial variations by integrating an engineering model with multi-task Gaussian process (GP) learning. The proposed model decomposes the surface variation into a global trend which is induced by process physics and a local variation part modeled by a zero-mean GP. The local variation parts from multiple similar-but-not-identical processes are jointly estimated using multi-task GP learning. An iterative multi-task Gaussian process learning algorithm is developed to learn the model parameters. Compared with conventional multi-task learning, the proposed method has the advantages in incorporating engineering knowledge and differentiating the process parameters that can be learned jointly from those that can only be learned from the process of interest.

Jin Wang, University of Illinois at Urbana-Champaign

A Scalable Algorithm for Bayesian Variable Selection (SAB)

In this paper, we propose a new computational framework for Bayesian variable selection. The key idea is to seek an approximation of the posterior distribution via an optimization procedure. The classical EM algorithm that gives the MAP estimate and the variational Bayes algorithm (VB) are special cases of this framework. A main feature of our algorithm (SAB) is its scalability. It is very fast in handling high dimensional data with large p and small n. We show that SAB achieves asymptotic consistency, albeit an approximation procedure. Numerically, we evaluate its performance in finite sample settings when the number of variables varies from small to large.

Tianshuang Wu, University of Michigan

Identifying a Set that Contains the Best Dynamic Treatment Regimes

A dynamic treatment regime (DTR) is a treatment design that seeks to accommodate patient heterogeneity in response to treatment. DTRs can be operationalized by a sequence of decision rules that map patient information to treatment options at specific decision points. The Sequential Multiple Assignment Randomized Trial (SMART) is a trial design that was developed specifically for the purpose of obtaining data that informs the construction of good (i.e., efficacious) decision rules. One of the scientific questions motivating a SMART concerns the comparison of multiple DTRs that are embedded in the design. Typical approaches for identifying the best DTRs involve all possible comparisons between DTRs that are embedded in a SMART, at the cost of greatly reduced power to the extent that the number of embedded DTRs increase. Here, we propose a method that will enable investigators use SMART study data more efficiently to identify the set that contains the most efficacious embedded DTRs. Our method ensures that the true best embedded DTRs are included in this set with at least a given probability. The Extending Treatment Effectiveness of Naltrexone SMART study data are analyzed to illustrate its application.

Yun-Jhong Wu, University of Michigan

Learning network dynamics via regularized tensor decomposition

Networks in the real world often evolve over time, and dependency among nodes in networks is usually observed only at certain specific time points. Here we consider network data with time-stamped edges, which can be represented by tensors. We propose an approach to learn network dynamics via canonical tensor decomposition with a smoothness regularization in the time dimension. This model can be fitted by a tensor power iteration algorithm, which is efficient particularly for sparse networks. We apply the method to several datasets to demonstrate its usefulness in characterizing network dynamics.

Mikhail Yurochkin, University of Michigan

Geometric Topic Modeling

Topic Modelling is a class of exploratory methods applied to text, image, audio or video data. The goal is to find latent topics, which are distributions over the vocabulary, to summarize and understand huge collections of data. We focus our attention on one of the most popular topic models, called Latent Dirichlet Allocation (Blei et al., 2003). We consider the geometry of topics and documents imposed by the data generating process of LDA and propose a new algorithm to estimate topics based on it. Moreover, our method allows for easier and often better estimation of document topic proportions for both training and testing data. We demonstrate via simulations that our algorithm is faster, while being either better or comparable to commonly used existing algorithms. Theoretical reasoning for the method is provided.

Yaser Zerehsaz, University of Michigan

Surface Defect Detection Using Robust Generalized Singular Value Decomposition

The advent of advanced imaging technologies creates a unique opportunity of garnering a substantial amount of information about a process. Using image processing techniques, such as edge detection and decomposition methods, practitioners are able to extract some important features from the images in order to form a basis for process monitoring and defect detection. In practice, the images usually contain a certain amount of noise covering the signal in the image. The noise components in these images sometimes suggest spatial or spatiotemporal correlation that if neglected, the performance of existing methodologies is affected. Furthermore, under some environmental conditions, like high temperature the images taken from the process might be abnormal in the sense of unclear visibility or excessive amount of noise. The existence of such outliers affects the performance of current techniques in extracting the critical features used for process monitoring. In this paper, a robust generalized singular value decomposition (RGSVD) is proposed to extract the features of the images when the noise components are correlated and some of the images are contaminated. The extracted features, then, can be used for monitoring a process in order to detect the defects on the images. The performance of the proposed RGSVD method is evaluated using a simulation study. Besides, the RGSVD method is applied to a real dataset in rolling-bar process for detecting the defects on the rolling bars.

Yaohua Zhang, University of Connecticut

Statistical Assessment of Pedestrian Safety

Pedestrian safety is a topic of serious concern in the United States, as it is in many countries. On average, a pedestrian is killed every two hours and one is injured every seven minutes in traffic crashes in the US (NHTSA, 2015). Providing a safe environment for pedestrian continues to remain a major concern for traffic professionals. When traffic engineers talk about the quality of roads related to safety, they discuss the possibility of using the pedestrian-vehicle interaction, or conflict, as a proxy for a crash. It is important to study whether knowledge of the incidence, frequency or severity of a conflict at a particular location (intersection or road segment) gives us information about the safety characteristics of that location with respect to crash incidence or frequency or severity. To study this, we investigate where there is significant association between conflicts and crashes, and whether this association varies with location and with time periods. One traditional way to answer this question might be via regression models in which we assess whether or not the number and/or severity of conflicts is a useful predictor of the number of crashes. We investigate an alternate semi-parametric approach for quantifying the statistical association between crash and conflict counts, and for rank-ordering the locations with data on conflicts and crashes by decreasing magnitude of the association. For instance, due to certain road characteristics at some locations, the relationship between conflicts and crashes may be strong, while at other locations with different road characteristics, the relationship may be weak. Identifying the subgroup of locations within which a strong association exists can help us have a better assessment of the quality of safety on our roads. This approach is also relevant to more general data mining settings, where there is often a need to identify relationships between any pair of variables when the relationship may hold only over an unspecified sub-population that is represented by a sub-sample of the data we have available.

Yuan Zhang, University of Michigan

Nonparametric graphon estimation using neighborhood smoothing

The problem of estimating network probability matrices under general network structures, usually named \emph{graphon estimation}, has drawn increasing attention recently. Most existing methods use step-function approximations to approach the general structure but usually require combinatorial optimization and are thus computationally infeasible. In this paper, we propose a novel neighborhood smoothing estimator that is computationally efficient and has competitive mean-squared error rate. Our method, applied to the problem of predicting missing edges in networks, shows outstanding performances compared to a wide spectrum of popular benchmarks.

Yiwei Zhang, University of Michigan

Spectral Regularization Algorithms for Learning Corrupted Low-Rank Matrices

Robust principle component analysis attracted much attention in recent years. It is assumed that the data matrix is the superposition of a low rank component and a sparse component, in which case, a few data points contain corrupted elements at arbitrary positions. Recent work have also considered another situation in which entire data points are completely corrupted, leading to a row sparse component. To recover the low rank matrix and the element-wise or row sparse matrix, convex-optimization-based algorithms are generally applied. However, there is a lack of systematic algorithms for the estimation, especially in the scenario with perturbations. We propose several spectral regularization algorithms for denoising corrupted low rank matrices. Comparing with existing algorithms, the proposed algorithms are easy to implement and have less computational complexity. Convergence properties for the proposed algorithms have also been developed under certain conditions. Numerical results support the applicability of the proposed algorithms in practice.

Ruofei Zhao, University of Michigan

Exploratory Tools for Analyzing and Characterizing Driving Behavior

There is considerable interest in characterizing similarities and differences in the behavior of drivers with a primary motivation being safety considerations. Although naturalistic driving studies collect continuous and high resolution measurements of driver’s speed and range (distance between adjacent vehicles), methods to analyze such data are underdeveloped. This project examines several exploratory tools for characterizing similarities (and differences) among drivers based on data on speed and range to a leading vehicle. We propose a novel method based on the earth mover’s distance (EMD), also called Mallow’s distance or Wasserstein metric, to characterize the level of similarity between drivers in the study. The performance of EMD is studied and compared to several other metrics. Our findings show that the EMD has a number of good properties. For instance, the distance structure is robust to initial parameter settings and can be better approximated by two dimensional Euclidean distance. A few examples will be used to illustrate its properties and advantages.

Lang Zhou, University of New Mexico

Neyman Smooth-type Goodness of Fit Tests in Complex Survey

Data generated from cluster sampling follows the weighted multinomial distribution. In our study, we have extended several Neyman smooth-type tests (Eubank, 1997) to complex survey data by incorporating consistent estimators under cluster sampling design, which is accomplished by a data-driven order selection method. Simulation results show that these proposed methods potentially improve the statistical power while controlling the type I error very well compared to those commonly used test procedures, especially for the cases with slow-varying probabilities. The study of the asymptotic distributions of the test statistics are also in progress.

Feiyun Zhu, University of Michigan

10,000+ Times Accelerated Robust Subset Selection

Subset selection from massive data with noised information is increasingly popular for various applications. This problem is still highly challenging as current methods are generally slow in speed and sensitive to outliers. To address the above two issues, we propose an accelerated robust subset selection (ARSS) method. Specifically in the subset selection area, this is the first attempt to employ the $\ell_{p}(0<p\leq1)-norm$ based measure for the representation loss, preventing large errors from dominating our objective. As a result, the robustness against outlier elements is greatly enhanced. Actually, data size is generally much larger than feature length, i.e. $N\gg L$. Based on this observation, we propose a speedup solver (via ALM and equivalent derivations) to highly reduce the computational cost, theoretically from $O(N^{4})$ to $O(N{}^{2}L)$. Extensive experiments on ten benchmark datasets verify that our method not only outperforms state of the art methods, but also runs 10,000+ times faster than the most related method.