We in this paper propose a parametric kernel mode-based regression built on mode value, which can achieve robust and efficient estimators when the data have outliers or heavy-tailed distributions. We show that the resultant estimators can arrive at the highest asymptotic breakdown point of 0.5. We then utilize such a regression for massive datasets by combining it with the distributed statistical learning technique, which can greatly reduce the required amount of primary memory while simultaneously incorporating heterogeneity into the estimation procedure. By approximating the local kernel objective function using a least squares format, we are able to preserve compact statistics for each worker machine and employ them to rebuild the estimate of the entire dataset with asymptotically minimal approximation error. With the help of a Gaussian kernel, an iteration algorithm built on expectation-maximization procedure is introduced, which could substantially lessen the computational burden. The asymptotic properties of the developed mode-based estimators are established, where we prove that the suggested estimator for massive datasets is statistically as efficient as the global mode-based estimator using the full dataset. We further conduct a shrinkage estimation based on the local quadratic approximation and demonstrate that the resulting estimator has the oracle property by employing an adaptive LASSO approach. The finite sample performance of the developed method is illustrated using simulations as well as real data analysis.

中文摘要：本文提出了一种基于众数值的参数核模回归方法，当数据存在异常值或重尾分布时，该方法可以实现稳健高效的估计。我们证明，由此产生的估计值可以达到最高渐近分解点 0.5。然后，我们将这种回归方法与分布式统计学习技术相结合，用于海量数据集，从而大大减少了所需的主内存量，同时将异质性纳入了估计过程。通过使用最小二乘法近似本地核目标函数，我们能够保留每台工作机器的紧凑统计数据，并利用它们重建对整个数据集的估计，近似误差渐近最小。在高斯核的帮助下，我们引入了一种基于期望最大化程序的迭代算法，这可以大大减轻计算负担。我们建立了所开发的基于模式的估计器的渐近特性，证明了所建议的海量数据集估计器在统计上与使用完整数据集的基于全局模式的估计器一样有效。我们还基于局部二次近似进行了收缩估计，并通过采用自适应 LASSO 方法证明了所得到的估计器具有神谕特性。我们通过模拟和实际数据分析说明了所开发方法的有限样本性能。

嘉宾介绍：

王涛，加拿大维多利亚大学助理教授，美国加州大学河滨分校经济学博士。研究方向为计量经济学，成果发表于JoE、JRSSA、OBES、Statistica Sinica、Journal of Time Series Analysis等期刊。