My research focuses on the analysis of high-dimensional data, particularly those that arise from the fields of genetics and genomics. As the data generated by experiments in these fields continues to grow in size and complexity, there is an ever greater need for sound statistical procedures that yield scientific insight from large amounts of information. Methodologically, my primary area of research is in penalized regression. Specific areas I am working in are listed below.
Inference for penalized regression estimators
Penalized regression is an attractive methodology for dealing with high-dimensional data where classical likelihood approaches to modeling break down. However, its widespread adoption has been hindered by a lack of inferential tools. In particular, penalized regression is very useful for variable selection, but how confident should one be about those selections? How many of those selections would likely have occurred by chance alone? The papers below represent my ongoing work to estimate false discovery rates for penalized regression models.
- Breheny PJ. (2019). Marginal false discovery rates for penalized regression models. Biostatistics, 20: 299-314. [link] [pdf] [R package] [Reproduce]
- Miller RE and Breheny P (2019). Marginal false discovery rate control for likelihood-based penalized regression models. Biometrical Journal, 61: 889-901. [link] [pdf] [R package] [Reproduce]
- Miller R and Breheny P (In submission). Feature-specific inference for penalized regression using local false discovery rates. [pdf] [R package]
Nonconvex penaltiesAlthough the lasso has many attractive properties, it also introduces significant bias toward 0 for large regression coefficients. The MCP and SCAD penalties have been proposed as alternatives designed to diminish this bias, and shown to have attractive theoretical and empirical properties. However, the penalty functions for SCAD and MCP are nonconvex, which introduces numerical challenges in fitting these models, as well as additional practical considerations in tuning parameter selection. The first paper develops algorithms for fitting nonconvex models in high dimensions and proposes local convexity as a diagnostic measure. The second extends these concepts to elastic net-type estimators and further explores the issue of tuning parameter selection. The third paper discusses methods for accelerating convergence in very high-dimensional problems.
- Breheny P and Huang J (2011). Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection. Annals of Applied Statistics, 5: 232–253. [link] [pdf] [R package]
- Huang J, Breheny P, Lee S, Ma S and Zhang C (2016). The Mnet method for variable selection. Statistica Sinica, 26: 903-923. [link] [pdf] [R package]
- Lee S and Breheny P (2015). Strong Rules for Nonconvex Penalties and Their Implications for Efficient Algorithms in High-Dimensional Regression. Journal of Computational and Graphical Statistics, 24: 1074-1091. [link] [pdf] [R package]
Grouped (hierarchical) variable selectionIn regression modeling, explanatory variables can often be thought of as grouped. Taking this grouping information into account in the modeling process should improve both the interpretability and the accuracy of the model. These gains are likely to be particularly important in high-dimensional settings where sparsity and variable selection play important roles in estimation accuracy. The first paper provides a review of this subject, while the second extends the ideas of nonconvex penalization to grouped variable selection and proposes efficient algorithms to fit these models. The third paper extends the idea of group selection to the problem of overlapping groups.
- Huang J, Breheny P and Ma S (2012). A Selective Review of Group Selection in High-Dimensional Models. Statistical Science, 27: 481-499. [link] [pdf]
- Breheny P and Huang J (2015). Group descent algorithms for nonconvex penalized linear and logistic regression models with grouped predictors. Statistics and Computing, 25: 173-187. [link] [pdf] [R package]
- Zeng Y and Breheny P (2016). Overlapping group logistic regression with applications to genetic pathway selection. Cancer Informatics, 15: 179–187. [link] [pdf] [R package]
Bi-level variable selectionMost of the methods developed for grouped variable selection produce estimates that are sparse at the group level and not at the level of individual variables. This is not always appropriate for the data. In many applications (e.g., genetic association studies), the goal is to identify important individual markers, but to increase the power of the search by incorporating grouping information. The first paper below introduces this topic; the second identifies some shortcomings of the method proposed in the first and proposes a new method with many advantages over the first.
- Breheny P and Huang J (2009). Penalized methods for bi-level variable selection. Statistics and Its Interface, 2: 369–380. [link] [pdf] [R package]
- Breheny P. (2015). The group exponential lasso for bi-level variable selection. Biometrics, 71: 731–740. [link] [pdf] [R package]
Copy-number association studiesThe vast majority of effort spent on understanding the genetic basis of human variation and disease has focused on single-nucleotide polymorphisms (SNPs). Methods for carrying out genetic association studies involving copy-number variation (CNVs), on the other hand, still suffer from many shortcomings and are in need of further development. The first paper is a brief editorial summarizing these shortcomings. The second investigates the issues of smoothing and testing in two-step CNV association tests. The third paper, follows up on and extends the ideas of the second.
- Breheny P, Li Y and Charnigo R (2012). Statistical Challenges and Opportunities in Copy Number Variant Association Studies. Journal of Biometrics & Biostatistics, 3: e118. [link] [pdf]
- Breheny P, Chalise P, Batzler A, Wang L and Fridley BL (2012). Genetic Association Studies of Copy-Number Variation: Should Assignment of Copy Number States Precede Testing? PLoS ONE, 7: e34262. [link] [pdf]
- Li Y and Breheny P (2013). Kernel-Based Aggregation of Marker-Level Genetic Association Tests Involving Copy-Number Variation. Microarrays, 2: 265–283. [link] [pdf]
Visualization of regression modelsThe importance of visualizing data is widely recognized. Visualization of models and estimates and predictions deriving from those models is just as important, yet tools for easily carrying out these visualizations are less well-developed. The paper below describes our development of software to provide tools for visualizing a wide class of regression models fit in R.
- Breheny P and Burchett W (2017). Visualization of regression models using visreg. The R Journal, 9: 56–71. [link] [pdf] [R package] [Homepage]
Genetics and genomicsI have been particularly motivated by genetic association studies and gene expression studies. These studies are high dimensional, yet have structures imposed upon them by the underlying biology. Consequently, penalization, shrinkage, hierarchical modeling, visualization, and empirical Bayes methods are particularly useful tools here. Below are examples of the research I have been involved with in this area; several contain methodological innovation, although the focus of the article is generally on the scientific results rather than the methodology.
- Yi H, Breheny P, Imam N, Liu Y and Hoeschele I (2015). Penalized multimarker vs. single-marker regression methods for genome-wide association studies of quantitative traits. Genetics, 199: 205-222. [link] [pdf]
- McClintock TS, Adipietro K, Titlow WB, Breheny P, Walz A, Mombaerts P and Matsunami H (2014). In Vivo identification of eugenol-responsive and muscone-responsive mouse odorant receptors. The Journal of Neuroscience, 34: 15669–15678. [link] [pdf]
- Nickell MD, Breheny P, Stromberg AJ and McClintock TS (2012). Genomics of mature and immature olfactory sensory neurons. Journal of Comparative Neurology, 520: 2608–2629. [link] [pdf]