My research focuses on the analysis of high-dimensional data, particularly those that arise from the fields of genetics and genomics. As the data generated by experiments in these fields continues to grow in size and complexity, there is an ever greater need for sound statistical procedures that yield scientific insight from large amounts of information. Methodologically, my primary area of research is in penalized regression. Specific areas I am working in are listed below.
Inference for penalized regression estimators
Penalized regression is an attractive methodology for dealing with high-dimensional data where classical likelihood approaches to modeling break down. However, its widespread adoption has been hindered by a lack of inferential tools. In particular, penalized regression is very useful for variable selection, but how confident should one be about those selections? How many of those selections would likely have occurred by chance alone? The papers below represent my ongoing work to estimate false discovery rates for penalized regression models.
Marginal false discovery rates for penalized regression models
Breheny PJ.
Biostatistics, 20: 299–314. Journal PDF R Reproduce
Marginal false discovery rate control for likelihood-based penalized regression models
Miller RE and Breheny P
Biometrical Journal, 61: 889–901. Journal PDF R Reproduce
Feature-specific inference for penalized regression using local false discovery rates
Miller R and Breheny P
Statistics in Medicine, 42: 1412–1429. Journal PDF R
Nonconvex penalties
Although the lasso has many attractive properties, it also introduces significant bias toward 0 for large regression coefficients. The MCP and SCAD penalties have been proposed as alternatives designed to diminish this bias, and shown to have attractive theoretical and empirical properties. However, the penalty functions for SCAD and MCP are nonconvex, which introduces numerical challenges in fitting these models, as well as additional practical considerations in tuning parameter selection. The first paper develops algorithms for fitting nonconvex models in high dimensions and proposes local convexity as a diagnostic measure. The second extends these concepts to elastic net-type estimators and further explores the issue of tuning parameter selection. The third paper discusses methods for accelerating convergence in very high-dimensional problems.
Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection
Breheny P and Huang J
Annals of Applied Statistics, 5: 232–253. Journal PDF R
The Mnet method for variable selection
Huang J, Breheny P, Lee S, Ma S and Zhang C
Statistica Sinica, 26: 903–923. Journal PDF R
Strong Rules for Nonconvex Penalties and Their Implications for Efficient Algorithms in High-Dimensional Regression
Lee S and Breheny P
Journal of Computational and Graphical Statistics, 24: 1074–1091. Journal PDF R
Grouped (hierarchical) variable selection
In regression modeling, explanatory variables can often be thought of as grouped. Taking this grouping information into account in the modeling process should improve both the interpretability and the accuracy of the model. These gains are likely to be particularly important in high-dimensional settings where sparsity and variable selection play important roles in estimation accuracy. The first paper provides a review of this subject, while the second extends the ideas of nonconvex penalization to grouped variable selection and proposes efficient algorithms to fit these models. The third paper extends the idea of group selection to the problem of overlapping groups.
A Selective Review of Group Selection in High-Dimensional Models
Huang J, Breheny P and Ma S
Statistical Science, 27: 481–499. Journal PDF
Group descent algorithms for nonconvex penalized linear and logistic regression models with grouped predictors
Breheny P and Huang J
Statistics and Computing, 25: 173–187. Journal PDF R
Overlapping group logistic regression with applications to genetic pathway selection
Zeng Y and Breheny P
Cancer Informatics, 15: 179–187. Journal PDF R
Bi-level variable selection
Most of the methods developed for grouped variable selection produce estimates that are sparse at the group level and not at the level of individual variables. This is not always appropriate for the data. In many applications (e.g., genetic association studies), the goal is to identify important individual markers, but to increase the power of the search by incorporating grouping information. The first paper below introduces this topic; the second identifies some shortcomings of the method proposed in the first and proposes a new method with many advantages over the first.
Penalized methods for bi-level variable selection
Breheny P and Huang J
Statistics and Its Interface, 2: 369–380. Journal PDF R
The group exponential lasso for bi-level variable selection
Breheny P.
Biometrics, 71: 731–740. Journal PDF R
Copy-number association studies
The vast majority of effort spent on understanding the genetic basis of human variation and disease has focused on single-nucleotide polymorphisms (SNPs). Methods for carrying out genetic association studies involving copy-number variation (CNVs), on the other hand, still suffer from many shortcomings and are in need of further development. The first paper is a brief editorial summarizing these shortcomings. The second investigates the issues of smoothing and testing in two-step CNV association tests. The third paper, follows up on and extends the ideas of the second.
Statistical challenges and opportunities in copy number variant association studies
Breheny P, Li Y and Charnigo R
Journal of Biometrics and Biostatistics, 3: e118. Journal PDF
Genetic Association Studies of Copy-Number Variation: Should Assignment of Copy Number States Precede Testing?
Breheny P, Chalise P, Batzler A, Wang L and Fridley BL
PLoS ONE, 7: e34262. Journal PDF
Kernel-based aggregation of marker-level genetic association tests involving copy-number variation
Li Y and Breheny P
Microarrays, 2: 265–283. Journal PDF
Visualization of regression models
The importance of visualizing data is widely recognized. Visualization of models and estimates and predictions deriving from those models is just as important, yet tools for easily carrying out these visualizations are less well-developed. The paper below describes our development of software to provide tools for visualizing a wide class of regression models fit in R.
Visualization of regression models using visreg
Breheny P and Burchett W
R Journal, 9: 56–71. Journal PDF R Website
Genetics and genomics
I have been particularly motivated by genetic association studies and gene expression studies. These studies are high dimensional, yet have structures imposed upon them by the underlying biology. Consequently, penalization, shrinkage, hierarchical modeling, visualization, and empirical Bayes methods are particularly useful tools here. Below are examples of the research I have been involved with in this area; several contain methodological innovation, although the focus of the article is generally on the scientific results rather than the methodology.