LG - 机器学习 CV - 计算机视觉 CL - 计算与语言 AS - 音频与语音 RO - 机器人

1、[CV] RepMLP: Re-parameterizing Convolutions into Fully-connected Layers for Image Recognition

X Ding, X Zhang, J Han, G Ding
[Tsinghua University & MEGVII Technology & Aberystwyth University]

We propose RepMLP, a multi-layer-perceptron-style neural network building block for image recognition, which is composed of a series of fully-connected (FC) layers. Compared to convolutional layers, FC layers are more efficient, better at modeling the long-range dependencies and positional patterns, but worse at capturing the local structures, hence usually less favored for image recognition. We propose a structural re-parameterization technique that adds local prior into an FC to make it powerful for image recognition. Specifically, we construct convolutional layers inside a RepMLP during training and merge them into the FC for inference. On CIFAR, a simple pure-MLP model shows performance very close to CNN. By inserting RepMLP in traditional CNN, we improve ResNets by 1.8% accuracy on ImageNet, 2.9% for face recognition, and 2.3% mIoU on Cityscapes with lower FLOPs. Our intriguing findings highlight that combining the global representational capacity and positional perception of FC with the local prior of convolution can improve the performance of neural network with faster speed on both the tasks with translation invariance (e.g., semantic segmentation) and those with aligned images and positional patterns (e.g., face recognition). The code and models are available atthis https URL.


2、[AI] Generative Art Using Neural Visual Grammars and Dual Encoders

C Fernando, S. M. A Eslami, J Alayrac, P Mirowski, D Banarse, S Osindero

Whilst there are perhaps only a few scientific methods, there seem to be almost as many artistic methods as there are artists. Artistic processes appear to inhabit the highest order of open-endedness. To begin to understand some of the processes of art making it is helpful to try to automate them even partially. In this paper, a novel algorithm for producing generative art is described which allows a user to input a text string, and which in a creative response to this string, outputs an image which interprets that string. It does so by evolving images using a hierarchical neural Lindenmeyer system, and evaluating these images along the way using an image text dual encoder trained on billions of images and their associated text from the internet. In doing so we have access to and control over an instance of an artistic process, allowing analysis of which aspects of the artistic process become the task of the algorithm, and which elements remain the responsibility of the artist.


3、[CL] Semantic Journeys: Quantifying Change in Emoji Meaning from 2012-2018

A Robertson, F F Liza, D Nguyen, B McGillivray, S A. Hale
[University of Edinburgh & University of Essex & Utrecht University & University of Cambridge & University of Oxford]

The semantics of emoji has, to date, been considered from a static perspective. We offer the first longitudinal study of how emoji semantics changes over time, applying techniques from computational linguistics to six years of Twitter data. We identify five patterns in emoji semantic development and find evidence that the less abstract an emoji is, the more likely it is to undergo semantic change. In addition, we analyse select emoji in more detail, examining the effect of seasonality and world events on emoji semantics. To aid future work on emoji and semantics, we make our data publicly available along with a web-based interface that anyone can use to explore semantic change in emoji.


4、[CV] Self-Supervised Multi-Frame Monocular Scene Flow

J Hur, S Roth
[TU Darmstadt]
自监督多帧单目场景流。由于简单、经济的捕捉设置,从单目图像序列估计3D场景流越来越受到关注。由于问题的严重不确定性,当前方法的准确性受到了限制,尤其是高效的实时方法。本文提出一种基于自监督学习的多帧单目场景流网络,在保留实时效率的同时,比之前的网络提高了准确性。基于先进的双帧基线与分割解码器的设计,提出了(i)使用三帧输入和卷积LSTM连接的多帧模型,(ii)采用occlusion-aware census loss以提高准确性,以及(iii)梯度分离策略以提高训练稳定性。在KITTI数据集上,基于自监督学习的单目场景流方法达到了最先进的准确性。

Estimating 3D scene flow from a sequence of monocular images has been gaining increased attention due to the simple, economical capture setup. Owing to the severe illposedness of the problem, the accuracy of current methods has been limited, especially that of efficient, real-time approaches. In this paper, we introduce a multi-frame monocular scene flow network based on self-supervised learning, improving the accuracy over previous networks while retaining real-time efficiency. Based on an advanced twoframe baseline with a split-decoder design, we propose (i) a multi-frame model using a triple frame input and convolutional LSTM connections, (ii) an occlusion-aware census loss for better accuracy, and (iii) a gradient detaching strategy to improve training stability. On the KITTI dataset, we observe state-of-the-art accuracy among monocular scene flow methods based on self-supervised learning.


5、[CV] Texture for Colors: Natural Representations of Colors Using Variable Bit-Depth Textures

S Baluja

Numerous methods have been proposed to transform color and grayscale images to their single bit-per-pixel binary counterparts. Commonly, the goal is to enhance specific attributes of the original image to make it more amenable for analysis. However, when the resulting binarized image is intended for human viewing, aesthetics must also be considered. Binarization techniques, such as half-toning, stippling, and hatching, have been widely used for modeling the original image’s intensity profile. We present an automated method to transform an image to a set of binary textures that represent not only the intensities, but also the colors of the original. The foundation of our method is information preservation: creating a set of textures that allows for the reconstruction of the original image’s colors solely from the binarized representation. We present techniques to ensure that the textures created are not visually distracting, preserve the intensity profile of the images, and are natural in that they map sets of colors that are perceptually similar to patterns that are similar. The approach uses deep-neural networks and is entirely self-supervised; no examples of good vs. bad binarizations are required. The system yields aesthetically pleasing binary images when tested on a variety of image sources.



[AI] Foundations of Intelligence in Natural and Artificial Systems: A Workshop Report

T Millhouse, M Moses, M Mitchell

[CV] Visual Composite Set Detection Using Part-and-Sum Transformers

Q Dong, Z Tu, H Liao, Y Zhang, V Mahadevan, S Soatto
[Amazon Web Services]

[AS] Self-Supervised Learning from Automatically Separated Sound Scenes

E Fonseca, A Jansen, D P. W. Ellis, S Wisdom, M Tagliasacchi, J R. Hershey, M Plakal, S Hershey, R. C Moore, X Serra
[Universitat Pompeu Fabra & Google Research]

[CV] Real-time Deep Dynamic Characters

M Habermann, L Liu, W Xu, M Zollhoefer, G Pons-Moll, C Theobalt
[Max Planck Institute for Informatics & Facebook Reality Labs]