Invited talk abstracts

Friday October 2

9:00-9:30am Karen Kafadar, University of Virginia  
Contributions to Industrial Statistics and their Impact on Medical Screening

Vijay Nair made numerous contributions to Industrial Statistics, through his publications, his work at Bell Laboratories, his joint collaborations with industrial engineers, and as Editor and author for articles in Technometrics. I will mention some of the highlights in Technometrics during his Editorship (1989-1991), and discuss one of his articles (Nair and Wang 1989) that considered the impact of length biased sampling (LBS) in oil well discovery. LBS arises also in the evaluation of medical screening tests, so I will explain the connection and illustrate its importance for evaluating common screening procedures such as mammography (female breast cancer) and PSA (male prostate cancer).

The work on medical screening involves Dr Philip C Prorok, National Cancer Institute.

9:30-10:00am William Cleveland, Purdue University 
Divide & Recombine with Tessera: High Performance Computing for Data Analysis

The widely used term “big data” carries with it a notion of computational performance for the analysis of big datasets. But for data analysis, computational performance depends very heavily, not just on size, but on the computational complexity of the analytic routines used in the analysis. Data small in size can be a big challenge, too. Furthermore, the hardware power available to the data analyst is an important factor. High performance computing for data analysis can be provided for wide ranges of dataset size, computational complexity, and hardware power by the divide & recombine (D&R) statistical approach, and the Tessera D&R software implementation that makes programming D&R easy.

10:00-10:30am  Trevor Hastie, Stanford University
GAM selection via convex optimization

While smoothing and additive models were the rage in the 80s and 90s, convex optimization is one of the present-day tools of choice – the lasso and its relatives induce sparsity in models. In this talk we describe a family of penalties that induce the right kind of sparsity in generalized additive models: from zero, to linear, to nonlinear.

*joint work with Alexandra Chouldechova

11:00-11:30am  Jeff Wu, Georgia Tech
CME Analysis: a New Method for Unraveling Aliased Effects in Two-Level Fractional Factorial Experiments

Ever since the founding work by Finney, it has been widely known and accepted that aliased effects in two-level regular designs cannot be “de-aliased” without adding more runs. A surprising result by Wu in his 2011 Fisher Lecture showed that aliased effects can sometimes be “de-aliased” using a new framework based on the concept of conditional main effects (CMEs). This idea is further developed into a methodology that can be readily used. Some key properties are derived that govern the relationships among CMEs or between them and related effects. As a consequence, some rules for data analysis are developed. Based on these rules, a new CME-based methodology is proposed. Three real examples are used to illustrate the methodology. The CME analysis can offer substantial increase in the R-squared value with fewer effects in the chosen models. Moreover, the selected CME effects are often more interpretable. (joint work with Heng Su).

11:30-12:00pm  William Meeker, Iowa State University 
Estimating a Parametric Component Lifetime Distribution from a Collection of Superimposed Renewal Processes

Maintenance data can be used to make inferences about the lifetime distribution of system components. Typically a fleet contains multiple systems. Within each system there is a set of nominally identical replaceable components of particular interest (e.g., two automobile head- lights, eight DIMM modules in a computing server, sixteen cylinders in a locomotive engine). For each component replacement event, there is system-level information that a component was replaced, but not information on which particular component was replaced. Thus the observed data is a collection of superpositions of renewal processes (SRP), one for each system in the fleet. This paper proposes a procedure for estimating the component lifetime distribution using the aggregated event data from a fleet of systems. We show how to compute the likelihood function for the collection of SRPs and provide suggestions for efficient computations. We compare performance of this incomplete-data ML estimator with the complete-data ML estimator and study the performance of confidence interval methods for estimating quantiles of the lifetime distribution of the component.

12:00-12:30pm  Nozer Singpurwalla, City University of Hong Kong
The Indicative and Irrealis Moods in Bayesian Inference

Coherence is the declared hallmark of Bayesian inference. By coherence it is here meant a strict adherence to the calculus of probability. But to achieve coherence a Bayesian is required to undergo mood swings from the indicative to the irrealis, and back to the indicative. This is because all of probability is in the irrealis (or subjunctive) mood, whereas with data at hand, the Bayesian must operate in the indicative mood. A consequence is that to strive for coherence, the Bayesian leans on a notion that is external to probability, namely, the likelihood, and having done so invokes a behavioristic principle called “Bayesian Conditionalization” . The purpose of this talk is to raise awareness of these matters, both of which are implicit to what we know as “turning the Bayesian crank.”

2:00- 2:25pm  Earl Lawrence, Los Alamos National Labs
An In Situ Approach to Partitioning a Complex Simulation

As computer simulations continue to grow in size and complexity, they provide a particularly challenging example of big data. Many application areas are moving toward exascale (i.e. 1,000,000,000,000,000,000 floating-point operations per second). Analyzing these simulations is difficult because their output may exceed both the storage capacity and the bandwidth required for transfer to storage. One approach is to embed some level of analysis in the simulation while the simulation is running, often called in situ analysis. In this talk, I’ll describe a simple method that uses piecewise linear regression for identifying the time steps in a simulation where the behavior changes. We can save just these time steps and the fitted model to greatly reduce storage and I/O requirements, while still capturing the broad behavior of the simulation. We illustrate the method using a massively parallel radiation-hydrodynamics simulation performed by Korycansky et al. (2009) in support of NASA’s 2009 Lunar Crater Observation and Sensing Satellite mission (LCROSS).

2:25-2:50pm Roshan Vengazhiyil, Georgia Tech
Uncertainty Quantification and Robust Parameter Design in Machining Simulations

Uncertainty quantification and robust parameter design are two topics in which Professor Vijay Nair has made important contributions. In this talk, I will use an industry sponsored project on machining simulations to illustrate some advancements in these two topics. Some key ideas in the talk include an efficient experimentation strategy using maximum projection designs and an in situ emulator methodology for handing functional responses in computer simulations.

2:50-3:15pm Judy Jin, University of Michigan
Signals by Integrating ICA and SCA Methods

The wide deployment and application of distributed sensing and computer systems have provided unprecedented opportunities for understanding and improving the operation of complex systems. Meanwhile, it also brings out research challenges on data analysis and diagnostic inference when available sensor measurements are the mixture responses of multiple embedded operations. In this talk, I will discuss how to separate immeasurable embedded individual source signals from the mixture sensor measurements for enhancing the system diagnosis. In this research, a new method is proposed by integrating Independent Component Analysis (ICA) and Sparse Component Analysis (SCA) for source signal separations. Going beyond the existing ICA method, the proposed method can estimate not only independent source signals but also the dependent source signals if they have dominant sparse components in either the time or linear transform domains. Based on those identified source signals, it can facilitate direct monitoring of individual source signals with explicit diagnostic information. A case study in a multiple die forging process is conducted to demonstrate the effectiveness of the proposed method.

3:15-3:40pm Casey Diekman, New Jersey Institute of Technology
Discovering Functional Neuronal Connectivity from Serial Patterns in Spike Train Data

Repeating patterns of precisely timed activity across a group of neurons (called frequent episodes) are indicative of networks in the underlying neural tissue. In this talk we present statistical methods to determine functional connectivity among neurons based on nonoverlapping occurrences of episodes. We study the distribution of episode counts and develop a two-phase strategy for identifying functional connections. For the first phase, we develop statistical procedures that are used to screen all two-node episodes and identify possible functional connections (edges). For the second phase, we develop additional statistical procedures to prune the two-node episodes and remove false edges that can be attributed to chains or fan-out structures. The restriction to nonoverlapping occurrences makes the counting of all two-node episodes in phase 1 computationally efficient. The second (pruning) phase is critical since phase 1 can yield a large number of false connections. The scalability of the two-phase approach is examined through simulation. The method is then used to reconstruct the graph structure of observed neuronal networks, first from simulated data and then from recordings of cultured cortical neurons.

4:10-4:40pm Eric Laber, North Carolina State University
Online estimation of optimal treatment allocation strategies

Emerging infectious diseases are responsible for a number of environmental, public health, and humanitarian crises across the world. Technological advances have made it possible to collect, curate, and access large amounts of data on the progression an infectious disease. We derive a framework for using this data, in real-time, to inform disease management. This work is motivated by the spread of white-nose syndrome, an emerging infectious disease that is decimating hibernating bat populations in the U.S. and Canada. The economic impacts of this disease are estimated to be several billion dollars per year and the ecological impacts, including the potential extinction of the Gray and Indiana bat species, are immeasurable. Potential treatments for white-nose syndrome are still under development. When these treatments become field-ready it will be critically important to apply them when and where they will have the biggest impact on the spread of the disease.

We formalize a treatment allocation strategy as a sequence of functions, one per treatment period, that map up-to-date information on the spread of an infectious disease to a subset of locations for treatment. An optimal allocation strategy optimizes some cumulative outcome, e.g., the number of uninfected locations, the geographic footprint of the disease, or the cost of the epidemic. Estimation of an optimal allocation strategy for an emerging infectious disease is challenging because spatial proximity induces interference among locations, the number of possible allocations is exponential in the number of locations, and because disease dynamics and intervention effectiveness are unknown at outbreak. We derive a Bayesian online estimator of the optimal allocation strategy that combines simulation-optimization with Thompson sampling. The proposed estimator performs favorably in simulation experiments and is illustrated using data on the spread of white-nose syndrome.

4:40-5:10pm  Nalini Ravishanker, University of Connecticut
Clustering Sets of Nonlinear and Nonstationary Time Series

Accurate clustering of a number of time series can often be a challenging problem for data arising in financial markets, biomedical studies, environmental sciences, etc. This is especially true when the time series exhibit nonstationarity and nonlinearity. Frequency domain clustering based on second-order spectra has been widely discussed in the literature for linear, stationary time series such as AR, MA, or ARMA processes. Bispectral based clustering is effective for stationary, nonlinear time series such as bilinear, EXPAR, SETAR, or GARCH processes. If the time series are linear and nonstationary, a consistent estimate of the time-varying spectrum has been obtained via the smooth localized complex exponential (SLEX) approach, and enables clustering. In this work, we discuss clustering for time series that exhibit nonlinearity and nonstationarity. We extend the SLEX approach to estimate the time-varying bispectra and construct quasi-distances between them which enable use of a hierarchical clustering scheme. The performance of the proposed approach is illustrated via a simulation study and an application from the financial sector. This is joint research with Jane Harvill, Baylor University and Priya Kohli, Connecticut College.

5:10-5:40pm Kwok Leung Tsui, City University of Hong Kong
Evolution of Big Data Analytics

Due to the advancement of computation power and data storage/collection technologies, the field of data modelling and applications have been evolving rapidly over the last two decades, with different buzz words as knowledge discovery in databases (KDD), data mining (DM), business analytics, big data analytic, … . There are tremendous opportunities in interdisciplinary research and education in data science, system informatics, and big data analytics; as well as in complex systems optimization and management in various industries of finance, healthcare, transportation, and energy, etc.   In this talk we will present our views and experience in the evolution of big data analytics, challenges and opportunities, as well as applications in various industries.

Saturday October 3

9:00-9:30am  Jianjun Shi, Georgia Tech
Statistics Methods Driven by Engineering Model for System Performance Improvement

The rapid advances in cyber-infrastructure ranging from sensor technology and communication networks to high-powered computing have resulted in temporally and spatially dense data-rich environments. With massive data readily available, there is a pressing need to develop advanced methodologies and associated tools that will enable and assist (i) the handling of the rich data streams communicated by the contemporary complex engineering systems, (ii) the extraction of pertinent knowledge about the environmental and operational dynamics driving these systems, and (iii) the exploitation of the acquired knowledge for more enhanced design, analysis, and control of them.

Addressing this need is considered very challenging because of a collection of factors, which include the inherent complexity of the physical system itself and its associated hardware, the uncertainty associated with the system’s operation and its environment, the heterogeneity and the high dimensionality of the data communicated by the system, and the increasing expectations and requirements posed by real-time decision-making. It is also recognized that these significant research challenges, combined with the extensive breadth of the target application domains, will require multidisciplinary research and educational efforts.

This presentation will discuss some research challenges, advancements, and opportunities in “statistical methods-driven by engineering models” for system performance improvement. Specific examples will be provided on research activities related to the integration of statistics, engineering knowledge, and control theory in various applications. Real case studies will be provided to illustrate the key steps of system research and problem solving, including (1) the identification of the real need and potential in problem formulation; (2) acquisition of a system perspective of the research; (3) development of new methodologies through interdisciplinary methods; and (4) implementation in practice for significant economical and social impacts. The presentation will emphasize the introduction of research achievements, as well as how the achievements were achieved.

9:30-10:00am  Derek Bingham, Simon Fraser University
Prediction Using Outputs From Multi-fidelity Simulators

Computer simulators are often used to augment field observations and increase our understanding of physical processes. Oftentimes, the simulator can be run with different degrees of fidelity and resulting computational burdens. In this talk, methodology for combining field observations and deterministic simulator output at differing levels of fidelity to build a predictive model for the real process is presented. The approach is illustrated through simple examples, as well as an application in predictive science that Vijay was involved in at the University of Michigan.

10:00-10:30am  David Higdon, Virginia Tech    
Connecting Model-Based Predictions to Reality

In the presence of relevant physical observations, one can usually calibrate a computer model, and even estimate systematic discrepancies of the model from reality. Estimating and quantifying the uncertainty in this model discrepancy can lead to reliablepredictions – so long as the prediction “is similar to” the available physical observations. Exactly how to define “similar” has proven difficult in many applications.   Clearly it depends on how well the computational model captures the relevant physics in the system, as well as how portable the model discrepancy is in going from the available physical data to the prediction. This talk will discuss these concepts using computational models ranging from simple to very complex.

11:00-11:30am  Peter Bickel, UC Berkeley  
Identifying Erdos-Renyi nodes in block models 

It can be argued that a block B o f vertices i in a block model which satisfy, P[Edge between i and j | i in B and j in C]=r*p whatever be j and C , corresponds to pure noise. B is a set of E-R vertices according to the classical definition. We study a number of methods of detecting such blocks and apply them to a number of data sets.

11:30-12:00pm  Kjell Doksum, University of Wisconsin, Madison   
Perspectives on small and large data

Some of the statistical methods developed in the last 50 years will be discussed, and ways of extending methods designed for the p less than n case to large data will be suggested.

12:00-12:30pm Jerry Lawless, University of Waterloo  
Big Data and Scientific Inference

Very large administrative or observational data bases are increasingly available for research. In this talk I will consider some of the possibilities and challenges in using such data for learning and scientific inference in areas such as medicine, public health and product performance. Some related methodological issues will be discussed, including measurement error, adjustment for missing data, the use of auxiliary information, consistency checks and multiphase studies, emphasizing the use of both longitudinal cohort studies and administrative data.

2:00-2:25pm  Xiao Wang, Purdue University
Optimal Estimation for the Functional Cox Model

Functional covariates are common in many medical, biodemographic, and neuroimaging studies. The aim of this paper is to study functional Cox models with right-censored data in the presence of both functional and scalar covariates. We study the asymptotic properties of the maximum partial likelihood estimator and establish the asymptotic normality and efficiency of the estimator of the finite-dimensional estimator. Under the framework of reproducing kernel Hilbert space, the estimator of the coefficient function for a functional covariate achieves the minimax optimal rate of convergence under a weighted L2-risk. This optimal rate is determined jointly by the censoring scheme, the reproducing kernel and the covariance kernel of the functional covariates. Implementation of the estimation approach and the selection of the smoothing parameter are discussed in detail. The finite sample performance is illustrated by simulated examples and a real application.

 

2:25-2:50pm  Adam Rothman, University of Minnesota
Indirect multivariate response linear regression

We propose a new class of estimators of the multivariate response linear regression coefficient matrix that exploits the assumption that the response and predictors have a joint multivariate Normal distribution. This allows us to indirectly estimate the regression coefficient matrix through shrinkage estimation of the parameters of the inverse regression, or the conditional distribution of the predictors given the responses. We establish a convergence rate bound for estimators in our class and we study two examples. The first example estimator exploits an assumption that the inverse regression’s coefficient matrix is sparse. The second example estimator exploits an assumption that the inverse regression’s coefficient matrix is rank deficient. These estimators do not require the popular assumption that the forward regression coefficient matrix is sparse or has small Frobenius norm. Using simulation studies, we show that our example estimators outperform relevant competitors for some data generating models.

2:50-3:15pm  Bowei Xi, Purdue University
Adversarial Data Mining

Many real world applications are facing malicious adversaries who actively transform the objects under their control to avoid detection. Data mining techniques are highly useful tools for cyber defense, since they play an important role in distinguishing the legitimate from the destructive. Unfortunately, traditional data mining techniques are insufficient to handle such adversarial problems directly. The adversaries adapt to the data miner’s reactions, and data mining algorithms constructed based on a training dataset will degrade quickly. In this talk we discuss the theory, the techniques, and the applications of our proposed adversarial data mining framework. We model the adversarial data mining applications as a Stackelberg game, with an emphasis on the sequential actions of the adversary and the data miner, allowing both parties to maximize their
own utilities.

This is a joint work with Dr. Murat Kantarcioglu.

3:15-3:40pm  Aijun Zhang, Hong Kong Baptist University 
Big Data Analytics in Online Education

Online educational systems generate large amounts of real-time streaming data, especially after 2012 the year of the MOOC. Recent innovations in big data research can be adopted to develop learning analytics for online education systems. In this talk we will discuss a few real examples of online learning analytics based on statistical methods, machine learning and distributed computing.