LG - 机器学习 CV - 计算机视觉 CL - 计算与语言 AS - 音频与语音 RO - 机器人

1、[LG] Relative Positional Encoding for Transformers with Linear Complexity

A Liutkus, O Cífka, S Wu, U Şimşekli, Y Yang, G Richard
[Inria & Telecom Paris & Research Center for IT Innovation]

Recent advances in Transformer models allow for unprecedented sequence lengths, due to linear space and time complexity. In the meantime, relative positional encoding (RPE) was proposed as beneficial for classical Transformers and consists in exploiting lags instead of absolute positions for inference. Still, RPE is not available for the recent linear-variants of the Transformer, because it requires the explicit computation of the attention matrix, which is precisely what is avoided by such methods. In this paper, we bridge this gap and present Stochastic Positional Encoding as a way to generate PE that can be used as a replacement to the classical additive (sinusoidal) PE and provably behaves like RPE. The main theoretical contribution is to make a connection between positional encoding and cross-covariance structures of correlated Gaussian processes. We illustrate the performance of our approach on the Long-Range Arena benchmark and on music generation.


2、[CL] BookSum: A Collection of Datasets for Long-form Narrative Summarization

W Kryściński, N Rajani, D Agarwal, C Xiong, D Radev
[Salesforce Research]

The majority of available text summarization datasets include short-form source documents that lack long-range causal and temporal dependencies, and often contain strong layout and stylistic biases. While relevant, such datasets will offer limited challenges for future generations of text summarization systems. We address these issues by introducing BOOKSUM, a collection of datasets for long-form narrative summarization. Our dataset covers source documents from the literature domain, such as novels, plays and stories, and includes highly abstractive, human written summaries on three levels of granularity of increasing difficulty: paragraph-, chapter-, and book-level. The domain and structure of our dataset poses a unique set of challenges for summarization systems, which include: processing very long documents, non-trivial causal and temporal dependencies, and rich discourse structures. To facilitate future work, we trained and evaluated multiple extractive and abstractive summarization models as baselines for our dataset.


3、[CV] Finding an Unsupervised Image Segmenter in Each of Your Deep Generative Models

L Melas-Kyriazi, C Rupprecht, I Laina, A Vedaldi
[University of Oxford]

Recent research has shown that numerous humaninterpretable directions exist in the latent space of GANs. In this paper, we develop an automatic procedure for finding directions that lead to foreground-background image separation, and we use these directions to train an image segmentation model without human supervision. Our method is generator-agnostic, producing strong segmentation results with a wide range of different GAN architectures. Furthermore, by leveraging GANs pretrained on large datasets such as ImageNet, we are able to segment images from a range of domains without further training or finetuning. Evaluating our method on image segmentation benchmarks, we compare favorably to prior work while using neither human supervision nor access to the training data. Broadly, our results demonstrate that automatically extracting foregroundbackground structure from pretrained deep generative models can serve as a remarkably effective substitute for human supervision.


4、[LG] Parallel and Flexible Sampling from Autoregressive Models via Langevin Dynamics

V Jayaram, J Thickstun
[University of Washington]

This paper introduces an alternative approach to sampling from autoregressive models. Autoregressive models are typically sampled sequentially, according to the transition dynamics defined by the model. Instead, we propose a sampling procedure that initializes a sequence with white noise and follows a Markov chain defined by Langevin dynamics on the global loglikelihood of the sequence. This approach parallelizes the sampling process and generalizes to conditional sampling. Using an autoregressive model as a Bayesian prior, we can steer the output of a generative model using a conditional likelihood or constraints. We apply these techniques to autoregressive models in the visual and audio domains, with competitive results for audio source separation, super-resolution, and inpainting.


5、[CV] Finding a Needle in a Haystack: Tiny Flying Object Detection in 4K Videos using a Joint Detection-and-Tracking Approach

R Yoshihashi, R Kawakami, S You, T T Trinh, M Iida, T Naemura
[The University of Tokyo]

Detecting tiny objects in a high-resolution video is challenging because the visual information is little and unreliable. Specifically, the challenge includes very low resolution of the objects, MPEG artifacts due to compression and a large searching area with many hard negatives. Tracking is equally difficult because of the unreliable appearance, and the unreliable motion estimation. Luckily, we found that by combining this two challenging tasks together, there will be mutual benefits. Following the idea, in this paper, we present a neural network model called the Recurrent Correlational Network, where detection and tracking are jointly performed over a multi-frame representation learned through a single, trainable, and end-to-end network. The framework exploits a convolutional long short-term memory network for learning informative appearance changes for detection, while the learned representation is shared in tracking for enhancing its performance. In experiments with datasets containing images of scenes with small flying objects, such as birds and unmanned aerial vehicles, the proposed method yielded consistent improvements in detection performance over deep singleframe detectors and existing motion-based detectors. Furthermore, our network performs as well as state-ofthe-art generic object trackers when it was evaluated as a tracker on a bird image dataset.



[CV] Exemplar-Based Open-Set Panoptic Segmentation Network

J Hwang, S W Oh, J Lee, B Han
[Seoul National University & Adobe Research]

[AI] Coach-Player Multi-Agent Reinforcement Learning for Dynamic Team Composition

B Liu, Q Liu, P Stone, A Garg, Y Zhu, A Anandkumar
[University of Texas at Austin & University of Toronto & Nvidia]

[LG] Fast and Slow Learning of Recurrent Independent Mechanisms

K Madan, R N Ke, A Goyal, B B Schölkopf, Y Bengio
[University of Monsreal & Polytechnique Montreal]

[CV] Image Cropping on Twitter: Fairness Metrics, their Limitations, and the Importance of Representation, Design, and Agency

K Yee, U Tantipongpipat, S Mishra