bet9

【95周年校庆系列讲座】Effect Size Heterogeneity Matters in High Dimensions

时间:2020-06-03         阅读:

光华讲坛——社会名流与企业家论坛第 5759  期

(线上讲座)

主题Effect Size Heterogeneity Matters in High Dimensions

主讲人bet9宾夕法尼亚大学 苏炜杰助理教授

主持人统计学院 常晋源教授

时间2020年6月5日(周五)10:00-11:20

直播平台及会议IDbet9腾讯会议,会议ID:872 707 851

主办单位:bet9统计研究中心 数据科学与商业智能联合实验室 统计学院 科研处

主讲人简介:

bet9Weijie Su is an Assistant Professor in the Wharton Statistics Department at the University of Pennsylvania, where he co-directs the Penn Research in Machine Learning. Prior to joining Penn, he received his Ph.D. in Statistics from Stanford University in 2016 and his B.S. in Mathematics from Peking University in 2011. His research interests span high-dimensional statistics, mathematical optimization, privacy-preserving data analysis, multiple hypothesis testing, and deep learning theory. He is a recipient of the Stanford Theodore W. Anderson Dissertation Award in 2016, an NSF CAREER Award in 2019, an Alfred P. Sloan Research Fellowship and a Facebook Faculty Research Award in 2020.

苏炜杰,宾夕法尼亚大学沃顿商学院助理教授,是宾夕法尼亚大学机器学习研究的联合主任。在加入宾夕法尼亚大学之前,他于2016年在斯坦福大学获得统计学博士学位,2011年在北京大学获得数学学士学位。他的研究兴趣包括高维统计、数学优化、隐私保护数据分析、多重假设检验和深度学习理论。他2016年获得了斯坦福大学Theodore W. Anderson博士学位论文奖,2019年获得了美国国家科学基金会成就奖,2020年获得了斯隆研究奖和Facebook教员研究奖。

内容提要:

In high-dimensional linear regression, would increasing true effect sizes always lead to better model selection, while maintaining the other conditions unchanged (such as fixing sparsity)? In this paper, we answer this question in the negative in a certain regime of sparsity for the Lasso method, through introducing a new notion we term effect size heterogeneity. Roughly speaking, a regression coefficient vector has high effect size heterogeneity if the nonzero entries of this vector have significantly different magnitudes, and vice versa. From the perspective of this new measure, we prove that in the regime of linear sparsity, false and true positive rates achieve the optimal trade-off uniformly along the Lasso path when this measure is maximal in the sense that all nonzero effect sizes have very different magnitudes, and the worst-case trade-off is achieved when it is minimum in the sense that all nonzero effect sizes are about equal. Moreover, we demonstrate that the Lasso path produces an optimal ranking of explanatory variables in terms of the rank of the first false variable when the effect size heterogeneity is maximum, and vice versa. Taken together, the two findings suggest that effect size heterogeneity shall serve as a complementary measure to the sparsity of regression coefficients in the analysis of high- dimensional regression problems. In the case of low effect size heterogeneity, variables with comparable effect sizes—no matter how large they are—metaphorically, would compete with each other along the Lasso path, leading to the degradation of the Lasso in terms of variable selection. Our proofs use techniques from approximate message passing theory as well as a novel argument for estimating the rank of the first false variable.

在高维线性回归中,当保持其他条件(如稀疏性)不变时,增加真实效应的大小是否总会导致更好的模型选择?在本文中,我们通过引入一个新的概念(即效应大小非均匀性),用在一定稀疏状态下的Lasso方法来否定这个问题。粗略地说,如果回归系数向量的非零项有显著性差异,则这个回归系数向量就会有较大的效应异质性,反之亦然。从这一新的估量方法出发,我们可以证明在线性稀疏的情况下,当该估量最大时意味着所有非零效应大小是不同的,此时假正率和真正率可沿着Lasso路径达到最佳平衡;当该估量最小时意味着所有非零效应是相等的,会达到最坏情况的均衡。此外,我们还证明了当效应大小异质性达最大时,Lasso路径会根据第一个假变量的秩对解释变量进行最优排序,反之亦然。综上所述,这两项研究结果表明,在分析高维回归问题时,效应大小的异质性应作为回归系数稀疏性的补充措施。在效应大小异质性较低的情况下,当变量的效应大小是可比的时,如果异质性多大,变量都会在Lasso路径上相互竞争,导致Lasso在变量选择方面的退化。我们的证明使用了近似消息传递理论的技术和估计第一个错误变量的秩的新论点。