LG - 机器学习 CV - 计算机视觉 CL - 计算与语言 AS - 音频与语音 RO - 机器人

1、[LG] An Attention Free Transformer

S Zhai, W Talbott, N Srivastava, C Huang, H Goh, R Zhang, J Susskind
[Apple Inc]

We introduce Attention Free Transformer (AFT), an efficient variant of Transformers [1] that eliminates the need for dot product self attention. In an AFT layer, the key and value are first combined with a set of learned position biases, the result of which is multiplied with the query in an element-wise fashion. This new operation has a memory complexity linear w.r.t. both the context size and the dimension of features, making it compatible to both large input and model sizes. We also introduce AFT-local and AFT-conv, two model variants that take advantage of the idea of locality and spatial weight sharing while maintaining global connectivity. We conduct extensive experiments on two autoregressive modeling tasks (CIFAR10 and Enwik8) as well as an image recognition task (ImageNet-1K classification). We show that AFT demonstrates competitive performance on all the benchmarks, while providing excellent efficiency at the same time.


2、[CL] Comparing Test Sets with Item Response Theory

C Vania, P M Htut, W Huang, D Mungra, R Y Pang, J Phang, H Liu, K Cho, S R. Bowman
[Amazon & New York University & Allen Institute for AI]

Recent years have seen numerous NLP datasets introduced to evaluate the performance of fine-tuned models on natural language understanding tasks. Recent results from large pretrained models, though, show that many of these datasets are largely saturated and unlikely to be able to detect further progress. What kind of datasets are still effective at discriminating among strong models, and what kind of datasets should we expect to be able to detect future improvements? To measure this uniformly across datasets, we draw on Item Response Theory and evaluate 29 datasets using predictions from 18 pretrained Transformer models on individual test examples. We find that Quoref, HellaSwag, and MC-TACO are best suited for distinguishing among state-of-the-art models, while SNLI, MNLI, and CommitmentBank seem to be saturated for current strong models. We also observe span selection task format, which is used for QA datasets like QAMR or SQuAD2.0, is effective in differentiating between strong and weak models.


3、[CL] Attention Flows are Shapley Value Explanations

K Ethayarajh, D Jurafsky
[Stanford University]

Shapley Values, a solution to the credit assignment problem in cooperative game theory, are a popular type of explanation in machine learning, having been used to explain the importance of features, embeddings, and even neurons. In NLP, however, leave-oneout and attention-based explanations still predominate. Can we draw a connection between these different methods? We formally prove that — save for the degenerate case — attention weights and leave-one-out values cannot be Shapley Values. Attention flow is a post-processed variant of attention weights obtained by running the max-flow algorithm on the attention graph. Perhaps surprisingly, we prove that attention flows are indeed Shapley Values, at least at the layerwise level. Given the many desirable theoretical qualities of Shapley Values — which has driven their adoption among the ML community — we argue that NLP practitioners should, when possible, adopt attention flow explanations alongside more traditional ones.


4、[CV] Single Image Depth Estimation using Wavelet Decomposition

M Ramamonjisoa, M Firman, J Watson, V Lepetit, D Turmukhambetov
[Univ Gustave Eiffel & Niantic]

We present a novel method for predicting accurate depths from monocular images with high efficiency. This optimal efficiency is achieved by exploiting wavelet decomposition, which is integrated in a fully differentiable encoder-decoder architecture. We demonstrate that we can reconstruct high-fidelity depth maps by predicting sparse wavelet coefficients. In contrast with previous works, we show that wavelet coefficients can be learned without direct supervision on coefficients. Instead we supervise only the final depth image that is reconstructed through the inverse wavelet transform. We additionally show that wavelet coefficients can be learned in fully self-supervised scenarios, without access to ground-truth depth. Finally, we apply our method to different state-of-the-art monocular depth estimation models, in each case giving similar or better results compared to the original model, while requiring less than half the multiply-adds in the decoder network.


5、[LG] A Differentiable Point Process with Its Application to Spiking Neural Networks

H Kajino
[IBM Research]
一种可微点过程及其在尖峰神经网络中的应用。本文关注的是尖峰神经网络(SNN)的概率模型的学习算法。Jimenez Rezende & Gerstner提出了一种随机变分推理算法,用来训练带有隐藏神经元的SNN。该算法使用分数函数梯度估计器更新变分分布,其高方差往往阻碍了整个学习算法。本文提出一种基于路径级梯度估计器的SNN替代梯度估计器。主要技术难点是缺乏对任意点过程实现进行微分的一般方法,而这是推导路径级梯度估计器的必要条件。提出了一种可微点过程,用它来推导SNN的路径级梯度估计器。通过数值模拟研究了该梯度估计器的有效性。

This paper is concerned about a learning algorithm for a probabilistic model of spiking neural networks (SNNs). Jimenez Rezende & Gerstner (2014) proposed a stochastic variational inference algorithm to train SNNs with hidden neurons. The algorithm updates the variational distribution using the score function gradient estimator, whose high variance often impedes the whole learning algorithm. This paper presents an alternative gradient estimator for SNNs based on the path-wise gradient estimator. The main technical difficulty is a lack of a general method to differentiate a realization of an arbitrary point process, which is necessary to derive the path-wise gradient estimator. We develop a differentiable point process, which is the technical highlight of this paper, and apply it to derive the path-wise gradient estimator for SNNs. We investigate the effectiveness of our gradient estimator through numerical simulation.



[LG] How Attentive are Graph Attention Networks?

S Brody, U Alon, E Yahav

[CV] Fourier Space Losses for Efficient Perceptual Image Super-Resolution

D Fuoli, L V Gool, R Timofte
[ETH Zurich]

[CL] Hi-Transformer: Hierarchical Interactive Transformer for Efficient and Effective Long Document Modeling

C Wu, F Wu, T Qi, Y Huang
[Tsinghua University & Microsoft Research Asia]

[CL] multiPRover: Generating Multiple Proofs for Improved Interpretability in Rule Reasoning

S Saha, P Yadav, M Bansal
[UNC Chapel Hill]