LG - 机器学习 CV - 计算机视觉 CL - 计算与语言 AS - 音频与语音 RO - 机器人 (*表示值得重点关注)

1、[CV] GANcraft: Unsupervised 3D Neural Rendering of Minecraft Worlds

Z Hao, A Mallya, S Belongie, M Liu
[NVIDIA & Cornell University]
GANcraft: “我的世界”无监督3D神经渲染。想想一下,让每个“我的世界”玩家都变成3D画家,用代表各种材料的积木搭建简单的3D世界,通过算法的转换,就能得到一个看起来相当逼真的3D世界,其中有高大的绿树、冰雪覆盖的山峰和蓝色的大海,世界该多么美好。本文提出一种无监督的神经渲染框架,用于生成大型3D块状世界(如“我的世界”中创建的方块世界)的照片级逼真图像,实现世界到世界的变换,本质上是图像到图像变换的3D扩展。该方法将语义方块世界作为输入,每个方块配有一个语义标签,如泥土、草地或水,将世界表示为一个连续的体函数,训练模型为用户控制的相机渲染与视图一致的逼真图像。针对方块世界缺乏成对真实参照图像的情况,设计了一种基于伪真实参照图像和对抗性损失的训练技术,由2D的图像到图像变换网络生成的伪真实参照图像,提供了有效的监督手段,与之前用于视图合成的神经渲染工作有很大区别,不需要真实参照图像来估计场景几何和视图相关外观。除了相机轨迹,GANcraft还允许用户控制场景语义和输出风格。与强基线比较的实验结果,显示了GANcraft在这个新颖的逼真画面3D方块世界合成任务上的有效性。

We present GANcraft, an unsupervised neural rendering framework for generating photorealistic images of large 3D block worlds such as those created in Minecraft. Our method takes a semantic block world as input, where each block is assigned a semantic label such as dirt, grass, or water. We represent the world as a continuous volumetric function and train our model to render view-consistent photorealistic images for a user-controlled camera. In the absence of paired ground truth real images for the block world, we devise a training technique based on pseudo-ground truth and adversarial training. This stands in contrast to prior work on neural rendering for view synthesis, which requires ground truth images to estimate scene geometry and view-dependent appearance. In addition to camera trajectory, GANcraft allows user control over both scene semantics and output style. Experimental results with comparison to strong baselines show the effectiveness of GANcraft on this novel task of photorealistic 3D block world synthesis. The project website is available atthis https URL.

爱可可AI前沿推介(4.17) - 图1
爱可可AI前沿推介(4.17) - 图2
爱可可AI前沿推介(4.17) - 图3爱可可AI前沿推介(4.17) - 图4爱可可AI前沿推介(4.17) - 图5

2、[IR] Deep Learning-based Online Alternative Product Recommendations at Scale

M Guo, N Yan, X Cui, S H Wu, U Ahsan, R West, K A Jadda
[The Home Depot]

Alternative recommender systems are critical for ecommerce companies. They guide customers to explore a massive product catalog and assist customers to find the right products among an overwhelming number of options. However, it is a non-trivial task to recommend alternative products that fit customer needs. In this paper, we use both textual product information (e.g. product titles and descriptions) and customer behavior data to recommend alternative products. Our results show that the coverage of alternative products is significantly improved in offline evaluations as well as recall and precision. The final A/B test shows that our algorithm increases the conversion rate by 12 percent in a statistically significant way. In order to better capture the semantic meaning of product information, we build a Siamese Network with Bidirectional LSTM to learn product embeddings. In order to learn a similarity space that better matches the preference of real customers, we use co-compared data from historical customer behavior as labels to train the network. In addition, we use NMSLIB to accelerate the computationally expensive kNN computation for millions of products so that the alternative recommendation is able to scale across the entire catalog of a major ecommerce site.

爱可可AI前沿推介(4.17) - 图6
爱可可AI前沿推介(4.17) - 图7爱可可AI前沿推介(4.17) - 图8

3、[CV] Geometry-Free View Synthesis: Transformers and no 3D Priors

R Rombach, P Esser, B Ommer
[Heidelberg University]

Is a geometric model required to synthesize novel views from a single image? Being bound to local convolutions, CNNs need explicit 3D biases to model geometric transformations. In contrast, we demonstrate that a transformer-based model can synthesize entirely novel views without any hand-engineered 3D biases. This is achieved by (i) a global attention mechanism for implicitly learning long-range 3D correspondences between source and target views, and (ii) a probabilistic formulation necessary to capture the ambiguity inherent in predicting novel views from a single image, thereby overcoming the limitations of previous approaches that are restricted to relatively small viewpoint changes. We evaluate various ways to integrate 3D priors into a transformer architecture. However, our experiments show that no such geometric priors are required and that the transformer is capable of implicitly learning 3D relationships between images. Furthermore, this approach outperforms the state of the art in terms of visual quality while covering the full distribution of possible realizations. Code is available atthis https URL

爱可可AI前沿推介(4.17) - 图9
爱可可AI前沿推介(4.17) - 图10
爱可可AI前沿推介(4.17) - 图11爱可可AI前沿推介(4.17) - 图12爱可可AI前沿推介(4.17) - 图13

4、[CL] Retrieval Augmentation Reduces Hallucination in Conversation

K Shuster, S Poff, M Chen, D Kiela, J Weston
[Facebook AI Research]

Despite showing increasingly human-like conversational abilities, state-of-the-art dialogue models often suffer from factual incorrectness and hallucination of knowledge (Roller et al., 2020). In this work we explore the use of neural-retrieval-in-the-loop architectures - recently shown to be effective in open-domain QA (Lewis et al., 2020b; Izacard and Grave, 2020) - for knowledge-grounded dialogue, a task that is arguably more challenging as it requires querying based on complex multi-turn dialogue context and generating conversationally coherent responses. We study various types of architectures with multiple components - retrievers, rankers, and encoder-decoders - with the goal of maximizing knowledgeability while retaining conversational ability. We demonstrate that our best models obtain state-of-the-art performance on two knowledge-grounded conversational tasks. The models exhibit open-domain conversational capabilities, generalize effectively to scenarios not within the training data, and, as verified by human evaluations, substantially reduce the well-known problem of knowledge hallucination in state-of-the-art chatbots.

爱可可AI前沿推介(4.17) - 图14
爱可可AI前沿推介(4.17) - 图15

5、[CL] Generating Datasets with Pretrained Language Models

T Schick, H Schütze
[LMU Munich]

To obtain high-quality sentence embeddings from pretrained language models, they must either be augmented with additional pretraining objectives or finetuned on large amounts of labeled text pairs. While the latter approach typically outperforms the former, it requires great human effort to generate suitable datasets of sufficient size. In this paper, we show how large pretrained language models can be leveraged to obtain high-quality embeddings without requiring any labeled data, finetuning or modifications to their pretraining objective: We utilize their generative abilities to generate entire datasets of labeled text pairs from scratch, which can then be used for regular finetuning of much smaller models. Our fully unsupervised approach outperforms strong baselines on several English semantic textual similarity datasets.

爱可可AI前沿推介(4.17) - 图16
爱可可AI前沿推介(4.17) - 图17爱可可AI前沿推介(4.17) - 图18


[CV] Self-supervised Video Object Segmentation by Motion Grouping

C Yang, H Lamdouar, E Lu, A Zisserman, W Xie
[University of Oxford]
爱可可AI前沿推介(4.17) - 图19
爱可可AI前沿推介(4.17) - 图20
爱可可AI前沿推介(4.17) - 图21爱可可AI前沿推介(4.17) - 图22

[CV] Image Super-Resolution via Iterative Refinement

C Saharia, J Ho, W Chan, T Salimans, D J. Fleet, M Norouzi
[Google Research]
爱可可AI前沿推介(4.17) - 图23
爱可可AI前沿推介(4.17) - 图24爱可可AI前沿推介(4.17) - 图25爱可可AI前沿推介(4.17) - 图26爱可可AI前沿推介(4.17) - 图27

[RO] Auto-Tuned Sim-to-Real Transfer

Y Du, O Watkins, T Darrell, P Abbeel, D Pathak
[UC Berkeley & CMU]
爱可可AI前沿推介(4.17) - 图28
爱可可AI前沿推介(4.17) - 图29爱可可AI前沿推介(4.17) - 图30爱可可AI前沿推介(4.17) - 图31

[LG] Self-Supervised Exploration via Latent Bayesian Surprise

P Mazzaglia, O Catal, T Verbelen, B Dhoedt
[Ghent University]
爱可可AI前沿推介(4.17) - 图32
爱可可AI前沿推介(4.17) - 图33爱可可AI前沿推介(4.17) - 图34爱可可AI前沿推介(4.17) - 图35爱可可AI前沿推介(4.17) - 图36