LG - 机器学习 CV - 计算机视觉 CL - 计算与语言 AS - 音频与语音 RO - 机器人 (*表示值得重点关注)

1、[LG] Optimal Rates for Averaged Stochastic Gradient Descent under Neural Tangent Kernel Regime

A Nitanda, T Suzuki
[The University of Tokyo]

We analyze the convergence of the averaged stochastic gradient descent for over-parameterized two-layer neural networks for regression problems. It was recently found that, under the neural tangent kernel (NTK) regime, where the learning dynamics for overparameterized neural networks can be mostly characterized by that for the associated reproducing kernel Hilbert space (RKHS), an NTK plays an important role in revealing the global convergence of gradient-based methods. However, there is still room for a convergence rate analysis in the NTK regime. In this study, we show the global convergence of the averaged stochastic gradient descent and derive the optimal convergence rate by exploiting the complexities of the target function and the RKHS associated with the NTK. Moreover, we show that the target function specified by the NTK of a ReLU network can be learned at the optimal convergence rate through a smooth approximation of ReLU networks under certain conditions.


2、[LG] Rethinking Architecture Selection in Differentiable NAS

R Wang, M Cheng, X Chen, X Tang, C Hsieh
[UCLA & DiDi AI Labs]

Differentiable Neural Architecture Search is one of the most popular Neural Architecture Search (NAS) methods for its search efficiency and simplicity, accomplished by jointly optimizing the model weight and architecture parameters in a weight-sharing supernet via gradient-based algorithms. At the end of the search phase, the operations with the largest architecture parameters will be selected to form the final architecture, with the implicit assumption that the values of architecture parameters reflect the operation strength. While much has been discussed about the supernet’s optimization, the architecture selection process has received little attention. We provide empirical and theoretical analysis to show that the magnitude of architecture parameters does not necessarily indicate how much the operation contributes to the supernet’s performance. We propose an alternative perturbation-based architecture selection that directly measures each operation’s influence on the supernet. We re-evaluate several differentiable NAS methods with the proposed architecture selection and find that it is able to extract significantly improved architectures from the underlying supernets consistently. Furthermore, we find that several failure modes of DARTS can be greatly alleviated with the proposed selection method, indicating that much of the poor generalization observed in DARTS can be attributed to the failure of magnitude-based architecture selection rather than entirely the optimization of its supernet.


3、[CL] Share or Not? Learning to Schedule Language-Specific Capacity for Multilingual Translation

B Zhang, A Bapna, R Sennrich, O Firat
[University of Edinburgh & Google Research & University of Zurich]

Using a mix of shared and language-specific (LS) parameters has shown promise in multilingual neural machine translation (MNMT), but the question of when and where LS capacity matters most is still under-studied. We offer such a study by proposing conditional language-specific routing (CLSR). CLSR employs hard binary gates conditioned on token representations to dynamically select LS or shared paths. By manipulating these gates, it can schedule LS capacity across sub-layers in MNMT subject to the guidance of translation signals and budget constraints. Moreover, CLSR can easily scale up to massively multilingual settings. Experiments with Transformer on OPUS-100 and WMT datasets show that: 1) MNMT is sensitive to both the amount and the position of LS modeling: distributing 10%-30% LS computation to the top and/or bottom encoder/decoder layers delivers the best performance; and 2) one-to-many translation benefits more from CLSR compared to many-to-one translation, particularly with unbalanced training data. Our study further verifies the trade-off between the shared capacity and LS capacity for multilingual translation. We corroborate our analysis by confirming the soundness of our findings as foundation of our improved multilingual Transformers. Source code and models are available at https://github.com/googleinterns/cct-m4.


4、[LG] Learning with Feature-Dependent Label Noise: A Progressive Approach

Y Zhang, S Zheng, P Wu, M Goswami, C Chen
[Stony Brook University & Stony Brook University]

Label noise is frequently observed in real-world large-scale datasets. The noise is introduced due to a variety of reasons; it is heterogeneous and feature-dependent. Most existing approaches to handling noisy labels fall into two categories: they either assume an ideal feature-independent noise, or remain heuristic without theoretical guarantees. In this paper, we propose to target a new family of feature-dependent label noise, which is much more general than commonly used i.i.d. label noise and encompasses a broad spectrum of noise patterns. Focusing on this general noise family, we propose a progressive label correction algorithm that iteratively corrects labels and refines the model. We provide theoretical guarantees showing that for a wide variety of (unknown) noise patterns, a classifier trained with this strategy converges to be consistent with the Bayes classifier. In experiments, our method outperforms SOTA baselines and is robust to various noise types and levels.


5、[LG] Metadata Normalization

M Lu, Q Zhao, J Zhang, K M. Pohl, L Fei-Fei, J C Niebles, E Adeli
[Stanford University]


[RO] OmniHang: Learning to Hang Arbitrary Objects using Contact Point Correspondences and Neural Collision Estimation

Y You, L Shao, T Migimatsu, J Bohg
[University of California, Los Angeles & Stanford University]

[CV] High-fidelity Face Tracking for AR/VR via Deep Lighting Adaptation

L Chen, C Cao, F D l Torre, J Saragih, C Xu, Y Sheikh
[Facebook Reality Labs & Univeristy of Rochester]

[LG] Muesli: Combining Improvements in Policy Optimization

M Hessel, I Danihelka, F Viola, A Guez, S Schmitt, L Sifre, T Weber, D Silver, H v Hasselt

[LG] Differentiable Model Compression via Pseudo Quantization Noise

A Défossez, Y Adi, G Synnaeve
[Facebook AI Research]