My research focuses on the analysis of high-dimensional data, particularly those that arise from the fields of genetics and genomics. As the data generated by experiments in these fields continues to grow in size and complexity, there is an ever greater need for sound statistical procedures that yield scientific insight from large amounts of information. Methodologically, my primary area of research is in penalized regression. Specific areas I am working in are listed below.

Inference for penalized regression estimators

Penalized regression is an attractive methodology for dealing with high-dimensional data where classical likelihood approaches to modeling break down. However, its widespread adoption has been hindered by a lack of inferential tools. In particular, penalized regression is very useful for variable selection, but how confident should one be about those selections? How many of those selections would likely have occurred by chance alone? The papers below represent my ongoing work to estimate false discovery rates for penalized regression models.

Nonconvex penalties

Although the lasso has many attractive properties, it also introduces significant bias toward 0 for large regression coefficients. The MCP and SCAD penalties have been proposed as alternatives designed to diminish this bias, and shown to have attractive theoretical and empirical properties. However, the penalty functions for SCAD and MCP are nonconvex, which introduces numerical challenges in fitting these models, as well as additional practical considerations in tuning parameter selection. The first paper develops algorithms for fitting nonconvex models in high dimensions and proposes local convexity as a diagnostic measure. The second extends these concepts to elastic net-type estimators and further explores the issue of tuning parameter selection. The third paper discusses methods for accelerating convergence in very high-dimensional problems.

Grouped (hierarchical) variable selection

In regression modeling, explanatory variables can often be thought of as grouped. Taking this grouping information into account in the modeling process should improve both the interpretability and the accuracy of the model. These gains are likely to be particularly important in high-dimensional settings where sparsity and variable selection play important roles in estimation accuracy. The first paper provides a review of this subject, while the second extends the ideas of nonconvex penalization to grouped variable selection and proposes efficient algorithms to fit these models. The third paper extends the idea of group selection to the problem of overlapping groups.

Bi-level variable selection

Most of the methods developed for grouped variable selection produce estimates that are sparse at the group level and not at the level of individual variables. This is not always appropriate for the data. In many applications (e.g., genetic association studies), the goal is to identify important individual markers, but to increase the power of the search by incorporating grouping information. The first paper below introduces this topic; the second identifies some shortcomings of the method proposed in the first and proposes a new method with many advantages over the first.

Copy-number association studies

The vast majority of effort spent on understanding the genetic basis of human variation and disease has focused on single-nucleotide polymorphisms (SNPs). Methods for carrying out genetic association studies involving copy-number variation (CNVs), on the other hand, still suffer from many shortcomings and are in need of further development. The first paper is a brief editorial summarizing these shortcomings. The second investigates the issues of smoothing and testing in two-step CNV association tests. The third paper, follows up on and extends the ideas of the second.

Visualization of regression models

The importance of visualizing data is widely recognized. Visualization of models and estimates and predictions deriving from those models is just as important, yet tools for easily carrying out these visualizations are less well-developed. The paper below describes our development of software to provide tools for visualizing a wide class of regression models fit in R.

Genetics and genomics

I have been particularly motivated by genetic association studies and gene expression studies. These studies are high dimensional, yet have structures imposed upon them by the underlying biology. Consequently, penalization, shrinkage, hierarchical modeling, visualization, and empirical Bayes methods are particularly useful tools here. Below are examples of the research I have been involved with in this area; several contain methodological innovation, although the focus of the article is generally on the scientific results rather than the methodology.