Abstract
1 Introduction
2 Previous work
3 Model parallelism
4 Distributed optimization algorithms
- 4.1 Downpour SGD
- 4.2 Sandblaster L-BFGS
5 Experiments

Abstract

Recent work in unsupervised feature learning and deep learning has shown that being able to train large models can dramatically improve performance.
最近在无监督特征学习和深度学习方面的工作表明，能够训练大型模型可以显著提高性能。

In this paper, we consider the problem of training a deep network with billions of parameters using tens of thousands of CPU cores.
在本文中，我们考虑了使用数万个CPU内核训练具有数十亿个参数的深度网络的问题。

We have developed a software framework called DistBelief that can utilize computing clusters with thousands of machines to train large models.
我们开发了一个名为DistBelief的软件框架，它可以利用具有数千台机器的计算集群来训练大型模型。

Within this framework, we have developed two algorithms for large-scale distributed training: (i) Downpour SGD, an asynchronous stochastic gradient descent procedure supporting a large number of model replicas, and (ii) Sandblaster, a framework that supports a variety of distributed batch optimization procedures, including a distributed implementation of L-BFGS.
在这个框架内，我们为大规模分布式训练开发了两种算法 :( i) Downpour SGD，一个支持大量模型副本的异步随机梯度下降过程，和 (ii) Sandblaster，支持多种分布式批处理优化过程的框架，包括l-bfgs的分布式实现。

Downpour SGD and Sandblaster L-BFGS both increase the scale and speed of deep network training.
Downpour SGD和Sandblaster L-BFGS都增加了深度网络训练的规模和速度。

We have successfully used our system to train a deep network 30x larger than previously reported in the literature, and achieves state-of-the-art performance on ImageNet, a visual object recognition task with 16 million images and 21k categories.
我们已经成功地使用我们的系统来训练一个比之前文献报道的大30倍的深度网络，并且在ImageNet上实现了最先进的性能，具有1600万个图像和21k类别的视觉对象识别任务。

We show that these same techniques dramatically accelerate the training of a more modestly- sized deep network for a commercial speech recognition service.
我们表明，这些相同的技术极大地加速了商业语音识别服务的更适度规模的深度网络的训练。

Although we focus on and report performance of these methods as applied to training large neural networks, the underlying algorithms are applicable to any gradient-based machine learning algorithm.
尽管我们专注于并报告这些方法应用于训练大型神经网络的性能，但基础算法适用于任何基于梯度的机器学习算法。

1 Introduction

Deep learning and unsupervised feature learning have shown great promise in many practical applications.
深度学习和无监督特征学习在许多实际应用中显示出巨大的前景。

State-of-the-art performance has been reported in several domains, ranging from speech recognition [1, 2], visual object recognition [3, 4], to text processing [5, 6].
在几个领域已经报道了最先进的性能，从语音识别 [1,2] 、视觉对象识别 [3,4] 到文本处理 [5，6]。

It has also been observed that increasing the scale of deep learning, with respect to the number of training examples, the number of model parameters, or both, can drastically improve ultimate classification accuracy [3, 4, 7].
还观察到，相对于训练示例的数量、模型参数的数量或两者，增加深度学习的规模可以极大地提高最终分类准确性 [3，4,7]。

These results have led to a surge of interest in scaling up the training and inference algorithms used for these models [8] and in improving applicable optimization procedures [7, 9].
这些结果导致了对扩大用于这些模型 [8] 的训练和推理算法以及改进适用的优化程序 [7,9] 的兴趣的激增。

The use of GPUs [1, 2, 3, 8] is a significant advance in recent years that makes the training of modestly sized deep networks practical.
图形处理器 [1,2，3,8] 的使用是近年来的一个重大进步，使得适度规模的深层网络的训练变得实用。

A known limitation of the GPU approach is that the training speed-up is small when the model does not fit in GPU memory (typically less than 6 gigabytes).
GPU方法的已知限制是，当模型不适合GPU内存 (通常小于6 gb) 时，训练速度较小。

To use a GPU effectively, researchers often reduce the size of the data or parameters so that CPU-to-GPU transfers are not a significant bottleneck.
为了有效地使用GPU，研究人员通常会减小数据或参数的大小，以使CPU到GPU的传输不是一个重要的瓶颈。

While data and parameter reduction work well for small problems (e.g. acoustic modeling for speech recognition), they are less attractive for problems with a large number of examples and dimensions (e.g., high-resolution images).
而数据和参数减少对小问题很有效 (例如g.用于语音识别的声学建模)，它们对于具有大量示例和维度 (例如g.，高分辨率图像)。

In this paper, we describe an alternative approach: using large-scale clusters of machines to distribute training and inference in deep networks.
在本文中，我们描述了一种替代方法: 使用大规模的机器集群在深度网络中分配训练和推理。

We have developed a software framework called DistBelief that enables model parallelism within a machine (via multithreading) and across machines (via message passing), with the details of parallelism, synchronization and communication managed by the framework.
我们已经开发了一个名为DistBelief的软件框架，它能够在机器内 (通过多线程) 和机器间 (通过消息传递) 实现模型并行性，并具有并行性的细节，由框架管理的同步和通信。

In addition to supporting model parallelism, the DistBelief framework also supports data parallelism, where multiple replicas of a model are used to optimize a single objective.
除了支持模型并行性之外，DistBelief框架还支持数据并行性，其中模型的多个副本用于优化单个目标。

Within this framework, we have designed and implemented two novel methods for large-scale distributed training: (i) Downpour SGD, an asynchronous stochastic gradient descent procedure which leverages adaptive learning rates and supports a large number of model replicas, and (ii) Sandblaster L-BFGS, a distributed implementation of L-BFGS that uses both data and model parallelism.
在此框架内，我们设计并实施了两种新的大规模分布式训练方法 :( i) Downpour SGD，利用自适应学习率并支持大量模型副本的异步随机梯度下降过程，以及 (ii) Sandblaster l-bfgs，使用数据和模型并行性的l-bfgs的分布式实现。

Both Downpour SGD and Sandblaster L-BFGS enjoy significant speed gains compared to more conventional implementations of SGD and L-BFGS.
与更传统的SGD和l-bfgs实施相比，Downpour SGD和Sandblaster L-BFGS都享有显著的速度提升。

Our experiments reveal several surprising results about large-scale nonconvex optimization.
我们的实验揭示了几个关于大规模非凸优化的令人惊讶的结果。

Firstly, asynchronous SGD, rarely applied to nonconvex problems, works very well for training deep networks, particularly when combined with Adagrad [10] adaptive learning rates.
首先，异步SGD很少应用于非凸问题，非常适合训练深度网络，尤其是当与Adagrad [10] 自适应学习率结合时。

Secondly, we show that given sufficient resources, L-BFGS is competitive with or faster than many variants of SGD.
其次，我们表明，如果有足够的资源，l-bfgs与SGD的许多变体相比具有竞争力或更快。

With regard to specific applications in deep learning, we report two main findings: that our distributed optimization approach can both greatly accelerate the training of modestly sized models, and that it can also train models that are larger than could be contemplated otherwise.
关于深度学习中的具体应用，我们报告了两个主要发现: 我们的分布式优化方法可以大大加快适度规模模型的训练，并且它也可以训练比其他情况下可以预期的更大的模型。

To illustrate the first point, we show that we can use a cluster of machines to train a modestly sized speech model to the same classification accuracy in less than 1/10th the time required on a GPU.
为了说明第一点，我们表明，我们可以使用机器集群在不到GPU所需时间的十分之一的时间内，将适度大小的语音模型训练到相同的分类精度。

To illustrate the second point, we trained a large neural network of more than 1 billion parameters and used this network to drastically improve on state-of-the-art performance on the ImageNet dataset, one of the largest datasets in computer vision.
为了说明第二点，我们训练了一个包含10亿多个参数的大型神经网络，并使用该网络极大地提高了ImageNet数据集的最新性能，计算机视觉中最大的数据集之一。

2 Previous work

In recent years commercial and academic machine learning data sets have grown at an unprecedented pace.
近年来，商业和学术机器学习数据集以前所未有的速度增长。

In response, a great many authors have explored scaling up machine learning algorithms through parallelization and distribution [11, 12, 13, 14, 15, 16, 17].
作为回应，许多作者探索了通过并行化和分布来扩展机器学习算法 [11,12，13,14，15,16，17]。

Much of this research has focused on linear, convex models, where distributed gradient computation is the natural first step.
大部分研究集中在线性凸模型上，其中分布梯度计算是自然的第一步。

Within this area, some groups have relaxed synchronization requirements, exploring delayed gradient updates for convex problems [12, 17].
在该区域内，一些组具有放松的同步要求，探索凸问题的延迟梯度更新 [12,17]。

In parallel, other groups working on problems with sparse gradients (problems where only a tiny fraction of the coordinates of the gradient vector are non-zero for any given training example) have explored lock-less asynchronous stochastic gradient descent on shared-memory architectures (i.e. single machines) [5, 18].
同时，其他小组正在研究稀疏梯度的问题 (对于任何给定的训练示例，梯度矢量的坐标中只有一小部分非零的问题) 探索了共享内存架构上的无锁异步随机梯度下降 (i.e.单机) [5,18]。

We are interested in an approach that captures the best of both worlds, allowing the use of a cluster of machines asynchronously computing gradients, but without requiring that the problem be either convex or sparse.
我们对一种抓住两全其美的方法感兴趣，允许使用机器集群异步计算梯度，但不要求问题是凸的或稀疏的。

In the context of deep learning, most work has focused on training relatively small models on a single machine (e.g., Theano [19]).
在深度学习的背景下，大多数工作都集中在在一台机器上训练相对较小的模型上 (例如Theano [19])。

Suggestions for scaling up deep learning include the use of a farm of GPUs to train a collection of many small models and subsequently averaging their predictions [20], or modifying standard deep networks to make them inherently more parallelizable [21].
对于扩展深度学习的建议包括使用gpu来训练许多小模型的集合，并随后对他们的预测进行平均 [20]，或者修改标准的深层网络以使它们本质上更加平行 [21]。

Our focus is scaling deep learning techniques in the direction of training very large models, those with a few billion parameters, but without introducing restrictions on the form of the model.
我们的重点是在训练非常大的模型的方向上扩展深度学习技术，这些模型具有几十亿个参数，但没有对模型的形式引入限制。

In special cases where one layer dominates computation, some authors have considered distributing computation in that one layer and replicating computation in the remaining layers [5].
在一层主导计算的特殊情况下，一些作者已经考虑在那一层中分配计算，并在其余层中复制计算 [5]。

But in the general case where many layers of the model are computationally intensive, full model parallelism in a spirit similar to [22] is required.
但是在模型的许多层都是计算密集型的一般情况下，需要具有类似于 [22] 的精神的全模型并行性。

To be successful, however, we believe that model parallelism must be combined with clever distributed optimization techniques that leverage data parallelism.
然而，为了取得成功，我们认为模型并行性必须与利用数据并行性的聪明的分布式优化技术相结合。

We considered a number of existing large-scale computational tools for application to our problem, MapReduce [23] and GraphLab [24] being notable examples.
我们考虑了许多现有的大规模计算工具来应用于我们的问题，MapReduce [23] 和GraphLab [24] 是值得注意的例子。

We concluded that MapReduce, designed for parallel data processing, was ill-suited for the iterative computations inherent in deep network training; whereas GraphLab, designed for general (unstructured) graph computations, would not exploit computing efficiencies available in the structured graphs typically found in deep networks.
我们得出结论，设计用于并行数据处理的MapReduce不适用于深度网络训练中固有的迭代计算; 而设计用于一般 (非结构化) 图计算的GraphLab，不会利用典型的深层网络结构图中可用的计算效率。

3 Model parallelism

To facilitate the training of very large deep networks, we have developed a software framework, DistBelief, that supports distributed computation in neural networks and layered graphical models.
为了促进超大型深度网络的训练，我们开发了一个软件框架，DistBelief，它支持神经网络中的分布式计算和分层图形模型。

The user defines the computation that takes place at each node in each layer of the model, and the messages that should be passed during the upward and downward phases of computation.
用户定义在模型的每一层的每个节点上进行的计算，以及在计算的向上和向下阶段应该传递的消息。

Figure 1: An example of model parallelism in DistBelief. A five layer deep neural network with local connectivity is shown here, partitioned across four machines (blue rectangles). Only those nodes with edges that cross partition boundaries (thick lines) will need to have their state transmitted between machines. Even in cases where a node has multiple edges crossing a partition boundary, its state is only sent to the machine on the other side of that boundary once. Within each partition, computation for individual nodes will the parallelized across all available CPU cores.
图1: distfaith中的模型并行性示例。这里显示了一个具有局部连通性的五层深度神经网络，分为四台机器 (蓝色矩形)。只有那些边缘跨越分区边界 (粗线) 的节点需要在机器之间传输它们的状态。即使在节点有多个边缘穿过分区边界的情况下，其状态也只会发送到该边界另一侧的计算机一次。在每个分区中，各个节点的计算将在所有可用的CPU内核中并行化。

For large models, the user may partition the model across several machines (Figure 1), so that responsibility for the computation for different nodes is assigned to different machines.
对于大型模型，用户可以在多台机器上划分模型 (图1)，以便将不同节点的计算责任分配给不同的机器。

The framework automatically parallelizes computation in each machine using all available cores, and manages communication, synchronization and data transfer between machines during both training and inference.
该框架使用所有可用核心自动并行每台机器中的计算，并在训练和推理期间管理机器之间的通信、同步和数据传输。

The performance benefits of distributing a deep network across multiple machines depends on the connectivity structure and computational needs of the model.
跨多台机器分配深层网络的性能优势取决于模型的连接结构和计算需求。

Models with a large number of parameters or high computational demands typically benefit from access to more CPUs and memory, up to the point where communication costs dominate.
具有大量参数或高计算需求的模型通常受益于对更多cpu和内存的访问，直到通信成本占主导地位。

We have successfully run large models with up to 144 partitions in the DistBelief framework with significant speedups, while more modestly sized models show decent speedups for up to 8 or 16 partitions. (See Section 5, under the heading Model Parallelism Benchmarks, for experimental results.)
我们已经成功地在distfaith框架中运行了具有多达144个分区的大型模型，并具有显著的加速，而更适度大小的模型在多达8或16个分区中显示出不错的加速。(有关实验结果，请参阅第5节，在 “模型并行性基准” 标题下。)

Obviously, models with local connectivity structures tend to be more amenable to extensive distribution than fully-connected structures, given their lower communication requirements.
显然，与完全连接的结构相比，具有本地连接结构的模型倾向于更适合广泛分布，考虑到它们较低的通信要求。

The typical cause of less-than-ideal speedups is variance in processing times across the different machines, leading to many machines waiting for the single slowest machine to finish a given phase of computation.
速度不够理想的典型原因是不同机器处理时间的差异，导致许多机器等待最慢的机器完成给定的计算阶段。

Nonetheless, for our largest models, we can efficiently use 32 machines where each machine achieves an average CPU utilization of 16 cores, for a total of 512 CPU cores training a single large neural network.
尽管如此，对于我们最大的模型，我们可以有效地使用32台机器，每台机器实现16核的平均CPU利用率，总共512个CPU核训练一个大型神经网络。

When combined with the distributed optimization algorithms described in the next section, which utilize multiple replicas of the entire neural network, it is possible to use tens of thousands of CPU cores for training a single model, leading to significant reductions in overall training times.
当结合下一节中描述的利用整个神经网络的多个副本的分布式优化算法时，可以使用数万个CPU内核来训练单个模型，导致整体训练时间显著减少。

4 Distributed optimization algorithms

Parallelizing computation within the DistBelief framework allows us to instantiate and run neural networks considerably larger than have been previously reported.
远程信任框架内的并行化计算使我们能够实例化和运行比以前报道的大得多的神经网络。

But in order to train such large models in a reasonable amount of time, we need to parallelize computation not only within a single instance of the model, but to distribute training across multiple model instances.
但是为了在合理的时间内训练如此大的模型，我们不仅需要在单个模型的实例，但要在多个模型实例之间分配训练。

In this section we describe this second level of parallelism, where we employ a set of DistBelief model instances, or replicas, to simultaneously solve a single optimization problem.
在本节中，我们将描述第二级并行性，其中我们使用一组distfaith模型实例或副本来同时解决单个优化问题。

We present a comparison of two large-scale distributed optimization procedures: Downpour SGD, an online method, and Sandblaster L-BFGS, a batch method.
我们对两种大规模分布式优化程序进行了比较: 在线方法Downpour SGD和批处理方法Sandblaster L-BFGS。

Both methods leverage the concept of a centralized sharded parameter server, which model replicas use to share their parameters.
这两种方法都利用了集中式分片参数服务器的概念，该服务器对副本进行建模以共享其参数。

Both methods take advantage of the distributed computation DistBelief allows within each individual replica.
这两种方法都利用了分布式计算distfaith允许在每个单独的副本中。

But most importantly, both methods are designed to tolerate variance in the processing speed of different model replicas, and even the wholesale failure of model replicas which may be taken offline or restarted at random.
但最重要的是，这两种方法都旨在容忍不同模型副本的处理速度的差异，甚至是可能离线或随机重启的模型副本的大规模故障。

In a sense, these two optimization algorithms implement an intelligent version of data parallelism.
从某种意义上说，这两种优化算法实现了数据并行性的智能版本。

Both approaches allow us to simultaneously process distinct training examples in each of the many model replicas, and periodically combine their results to optimize our objective function.
这两种方法使我们能够同时处理许多模型副本中的每一个中不同的训练示例，并定期结合它们的结果来优化我们的目标函数。

4.1 Downpour SGD

Stochastic gradient descent (SGD) is perhaps the most commonly used optimization procedure for training deep neural networks [25, 26, 3].
随机梯度下降 (SGD) 可能是最常用的优化程序训练深度神经网络 [25,26，3]。

Unfortunately, the traditional formulation of SGD is inherently sequential, making it impractical to apply to very large data sets where the time required to move through the data in an entirely serial fashion is prohibitive.
不幸的是，SGD的传统表述本质上是连续的，使得应用于非常大的数据集是不切实际的，在这些数据集中，以完全连续的方式移动数据所需的时间是令人望而却步的。

To apply SGD to large data sets, we introduce Downpour SGD, a variant of asynchronous stochastic gradient descent that uses multiple replicas of a single DistBelief model.
为了将SGD应用于大型数据集，我们引入了倾盆大雨SGD，这是一种异步随机梯度下降的变体，它使用单个膨胀模型的多个副本。

The basic approach is as follows: We divide the training data into a number of subsets and run a copy of the model on each of these subsets.
基本方法如下: 我们将训练数据划分为多个子集，并在每个子集上运行模型的副本。

The models communicate updates through a centralized parameter server, which keeps the current state of all parameters for the model, sharded across many machines (e.g., if we have 10 parameter server shards, each shard is responsible for storing and applying updates to 1/10th of the model parameters) (Figure 2).
模型通过一个集中的参数服务器来传达更新，该服务器保持模型所有参数的当前状态，并在许多机器上进行分片 (例如g.，如果我们有10个参数服务器分片，则每个分片负责存储并将更新应用于模型参数的1/10 (图2)。

This approach is asynchronous in two distinct aspects: the model replicas run independently of each other, and the parameter server shards also run independently of one another.
这种方法在两个不同的方面是异步的: 模型副本相互独立运行，参数服务器分片也相互独立运行。

In the simplest implementation, before processing each mini-batch, a model replica asks the parameter server service for an updated copy of its model parameters.
在最简单的实现中，在处理每个小批处理之前，模型副本要求参数服务器服务提供其模型参数的更新副本。

Because DistBelief models are themselves partitioned across multiple machines, each machine needs to communicate with just the subset of parameter server shards that hold the model parameters relevant to its partition.
因为distfaith模型本身是在多台计算机上进行分区的，所以每台计算机只需要与保存与其分区相关的模型参数的参数服务器分片的子集进行通信。

After receiving an updated copy of its parameters, the DistBelief model replica processes a mini-batch of data to compute a parameter gradient, and sends the gradient to the parameter server, which then applies the gradient to the current value of the model parameters.
在收到其参数的更新副本后，distfaith模型副本将处理小批量数据以计算参数梯度，并将梯度发送到参数服务器，然后将梯度应用于模型参数的当前值。

It is possible to reduce the communication overhead of Downpour SGD by limiting each model replica to request updated parameters only every nfetch steps and send updated gradient values only every npush steps (where nfetch might not be equal to npush).
通过限制每个模型副本仅在每个nfetch步骤中请求更新的参数并仅在每个npush步骤中发送更新的梯度值 (其中nfetch可能不相等)，可以减少倾盆SGD的通信开销。到npush)。

In fact, the process of fetching parameters, pushing gradients, and processing training data can be carried out in three only weakly synchronized threads (see the Appendix for pseudocode).
实际上，获取参数、推送梯度和处理训练数据的过程只能在三个弱同步线程中执行 (伪代码见附录)。

In the experiments reported below we fixed nfetch = npush = 1 for simplicity and ease of comparison to traditional SGD.
在下面报道的实验中，为了与传统SGD进行简单和易于比较，我们修复了nfetch = npush = 1。

Downpour SGD is more robust to machines failures than standard (synchronous) SGD.
倾盆大雨SGD对机器故障比标准 (同步) SGD更坚固。

For synchronous SGD, if one machine fails, the entire training process is delayed; whereas for asynchronous SGD, if one machine in a model replica fails, the other model replicas continue processing their training data and updating the model parameters via the parameter servers.
对于同步SGD，如果一台计算机发生故障，则整个训练过程将延迟; 而对于异步SGD，如果模型副本中的一台计算机发生故障，其他模型副本继续处理其训练数据并通过参数服务器更新模型参数。

On the other hand, the multiple forms of asynchronous processing in Downpour SGD introduce a great deal of additional stochasticity in the optimization procedure.
另一方面，倾盆大雨SGD中的多种形式的异步处理在优化过程中引入了大量额外的随机性。

Most obviously, a model replica is almost certainly computing its gradients based on a set of parameters that are slightly out of date, in that some other model replica will likely have updated the parameters on the parameter server in the meantime.
最明显的是，模型副本几乎可以肯定是基于一组稍微过时的参数来计算其梯度，因为其他模型副本可能同时更新了参数服务器上的参数。

But there are several other sources of stochasticity beyond this: Because the parameter server shards act independently, there is no guarantee that at any given moment the parameters on each shard of the parameter server have undergone the same number of updates, or that the updates were applied in the same order.
但是除此之外，还有其他几个随机来源: 因为参数服务器分片独立运行，不能保证在任何给定时刻参数服务器的每个分片上的参数都经历了相同数量的更新，或者更新是以相同的顺序应用的。

Moreover, because the model replicas are permitted to fetch parameters and push gradients in separate threads, there may be additional subtle inconsistencies in the timestamps of parameters.
此外，由于允许模型副本在单独的线程中获取参数和推送梯度，因此参数的时间戳可能存在其他细微的不一致。

There is little theoretical grounding for the safety of these operations for nonconvex problems, but in practice we found relaxing consistency requirements to be remarkably effective.
对于非凸问题的这些操作的安全性几乎没有理论基础，但是在实践中，我们发现放松一致性要求是非常有效的。

One technique that we have found to greatly increase the robustness of Downpour SGD is the use of the Adagrad [10] adaptive learning rate procedure.
我们发现一种可以大大提高倾盆SGD的稳健性的技术是使用Adagrad [10] 自适应学习率程序。

Rather than using a single fixed learning rate on the parameter sever ( Large Scale Distributed Deep Networks - 图2 in Figure 2), Adagrad uses a separate adaptive learning rate for each parameter.
而不是在参数服务器上使用单一固定的学习速率 (在图2中)，Adagrad对每个参数使用单独的自适应学习率。

Let Large Scale Distributed Deep Networks - 图4 be the learning rate of the i-th parameter at iteration K and its gradient, then we set: .

Because these learning rates are computed only from the summed squared gradients of each parameter, Adagrad is easily implemented locally within each parameter server shard.
由于这些学习率仅根据每个参数的总平方梯度计算，因此Adagrad可以轻松地在每个参数服务器分片中本地实现。

The value of Large Scale Distributed Deep Networks - 图7 , the constant scaling factor for all learning rates, is generally larger (perhaps by an order of magnitude) than the best fixed learning rate used without Adagrad.
的值, 所有学习率的恒定比例因子，通常比没有Adagrad时使用的最佳固定学习率更大 (可能是一个数量级)。

The use of Adagrad extends the maximum number of model replicas that can productively work simultaneously, and combined with a practice of “warmstarting” model training with only a single model replica before unleashing the other replicas, it has virtually eliminated stability concerns in training deep networks using Downpour SGD (see results in Section 5).
Adagrad的使用扩展了可同时有效工作的模型副本的最大数量，并结合了在释放其他副本之前仅使用单个模型副本进行 “热启动” 模型训练的实践，它实际上已经消除了使用倾盆大雨SGD训练深层网络时的稳定性问题 (见第5节的结果)。

4.2 Sandblaster L-BFGS

Batch methods have been shown to work well in training small deep networks [7].
批处理方法已被证明在训练小型深度网络中很有效 [7]。

To apply these methods to large models and large datasets, we introduce the Sandblaster batch optimization framework and discuss an implementation of L-BFGS using this framework.
为了将这些方法应用于大型模型和大数据集，我们介绍了Sandblaster批处理优化框架，并讨论了使用该框架实现l-bfgs。

A key idea in Sandblaster is distributed parameter storage and manipulation.
Sandblaster的一个关键思想是分布式参数存储和操作。

The core of the optimization algorithm (e.g L-BFGS) resides in a coordinator process (Figure 2), which does not have direct access to the model parameters.
优化算法 (例如l-bfgs) 的核心存在于协调器过程中 (图2)，该过程不能直接访问模型参数。

Instead, the coordinator issues commands drawn from a small set of operations (e.g., dot product, scaling, coefficient-wise addition, multiplication) that can be performed by each parameter server shard independently, with the results being stored locally on the same shard.
相反，协调器发出从一小部分操作中提取的命令 (例如g.，点积、缩放、系数加法、乘法) 可以由每个参数服务器分片独立执行，结果存储在本地同一分片上。

Additional information, e.g the history cache for L-BFGS, is also stored on the parameter server shard on which it was computed.
附加信息，例如l-bfgs的历史缓存，也存储在计算它的参数服务器分片上。

This allows running large models (billions of parameters) without incurring the overhead of sending all the parameters and gradients to a single central server. (See the Appendix for pseudocode.)
这允许运行大型模型 (数十亿个参数)，而不会产生将所有参数和梯度发送到单个中央服务器的开销。(伪代码见附录。)

In typical parallelized implementations of L-BFGS, data is distributed to many machines and each machine is responsible for computing the gradient on a specific subset of data examples.
在l-bfgs的典型并行化实现中，数据被分配到许多机器，并且每台机器负责计算数据示例的特定子集上的梯度。

The gradients are sent back to a central server (or aggregated via a tree [16]).
梯度被发送回中央服务器 (或通过树 [16] 聚合)。

Many such methods wait for the slowest machine, and therefore do not scale well to large shared clusters.
许多这样的方法等待最慢的机器，因此不能很好地扩展到大型共享集群。

To account for this problem, we employ the following load balancing scheme: The coordinator assigns each of the N model replicas a small portion of work, much smaller than 1/Nth of the total size of a batch, and assigns replicas new portions whenever they are free.
为了解决这个问题，我们采用以下负载平衡方案: 协调器将N个模型的每个副本分配为工作的一小部分，小于批处理总大小的1/n，并在空闲时分配副本作为新部分。

With this approach, faster model replicas do more work than slower replicas.
使用这种方法，更快的模型副本比更慢的副本做更多的工作。

To further manage slow model replicas at the end of a batch, the coordinator schedules multiple copies of the outstanding portions and uses the result from whichever model replica finishes first.
要在批处理结束时进一步管理慢速模型副本，协调器会调度未完成部分的多个副本，并使用首先完成的模型副本的结果。

This scheme is similar to the use of “backup tasks” in the MapReduce framework [23].
该方案类似于在MapReduce框架中使用 “备份任务” [23]。

Prefetching of data, along with supporting data affinity by assigning sequential portions of data to the same worker makes data access a non-issue.
通过将数据的顺序部分分配给同一工作线程来预取数据以及支持数据亲和力，从而使数据访问不成问题。

In contrast with Downpour SGD, which requires relatively high frequency, high bandwidth parameter synchronization with the parameter server, Sandblaster workers only fetch parameters at the beginning of each batch (when they have been updated by the coordinator), and only send the gradients every few completed portions (to protect against replica failures and restarts).
与要求与参数服务器进行相对高频率、高带宽参数同步的Downpour SGD相比，喷砂工作人员仅在每个批处理开始时获取参数 (当协调器更新参数时)，并且仅每隔几个完成的部分发送渐变 (以防止复制失败和重新启动)。

5 Experiments

We evaluated our optimization algorithms by applying them to training models for two different deep learning problems: object recognition in still images and acoustic processing for speech recognition.
我们评估了我们的优化算法，将它们应用于两个不同深度学习问题的训练模型: 静止图像中的对象识别和语音识别的声学处理。

The speech recognition task was to classify the central region (or frame) in a short snippet of audio as one of several thousand acoustic states.
语音识别任务是将音频的短片段中的中心区域 (或帧) 分类为几千种声学状态之一。

We used a deep network with five layers: four hidden layer with sigmoidal activations and 2560 nodes each, and a softmax output layer with 8192 nodes.
我们使用了一个具有五层的深层网络: 四个具有s形激活的隐藏层和每个2560个节点，以及一个具有8192个节点的softmax输出层。

The input representation was 11 consecutive overlapping 25 ms frames of speech, each represented by 40 log-energy values.
输入表示为11个连续重叠的25毫秒语音帧，每个帧由40个对数能量值表示。

The network was fully-connected layer-to-layer, for a total of approximately 42 million model parameters.
该网络完全逐层连接，总共约4200万个模型参数。

We trained on a data set of 1.1 billion weakly labeled examples, and evaluated on a hold out test set.
我们在11亿个弱标记示例的数据集上进行了训练，并在保留测试集中进行了评估。

See [27] for similar deep network configurations and training procedures.
有关类似的深层网络配置和训练程序，请参见 [27]。

For visual object recognition we trained a larger neural network with locally-connected receptive fields on the ImageNet data set of 16 million images, each of which we scaled to 100x100 pixels [28].
对于视觉对象识别，我们用本地连接的感受器训练了一个更大的神经网络ImageNet数据集上的字段包含1600万个图像，我们将每个图像缩放为100x100像素 [28]。

The network had three stages, each composed of filtering, pooling and local contrast normalization, where each node in the filtering layer was connected to a 10x10 patch in the layer below.
该网络有三个阶段，每个阶段由过滤、池和局部对比度归一化组成，其中过滤层中的每个节点都连接到下面层中的10x10补丁。

Our infrastructure allows many nodes to connect to the same input patch, and we ran experiments varying the number of identically connected nodes from 8 to 36.
我们的基础设施允许许多节点连接到同一个输入补丁，并且我们进行了实验，将相同连接的节点数量从8个改变到36个。

The output layer consisted of 21 thousand one-vs-all logistic classifier nodes, one for each of the ImageNet object categories.
输出层由21,000个一对全部逻辑分类器节点组成，每个ImageNet对象类别一个。

See [29] for similar deep network configurations and training procedures.
有关类似的深层网络配置和训练程序，请参见 [29]。

Model parallelism benchmarks: To explore the scaling behavior of DistBelief model parallelism (Section 3), we measured the mean time to process a single mini-batch for simple SGD training as a function of the number of partitions (machines) used in a single model instance. In Figure 3 we quantify the impact of parallelizing across N machines by reporting the average training speed-up: the ratio of the time taken using only a single machine to the time taken using N. Speedups for inference steps in these models are similar and are not shown here.
模型并行性基准:为了探索distfaith模型并行性的缩放行为 (第3节)，我们测量了处理用于简单SGD训练的单个小批量的平均时间，作为在单个模型实例中使用的分区 (机器) 数量的函数。在图3中，我们通过报告平均训练速度来量化跨N台机器并行化的影响: 仅使用一台机器所花费的时间与使用N所花费的时间之比。这些模型中推理步骤的速度是相似的，这里没有显示。

The moderately sized speech model runs fastest on 8 machines, computing 2.2 faster than using a single machine. (Models were configured to use no more than 20 cores per machine.)
中等大小的语音模型在8台机器上运行最快，计算速度比使用一台机器快2.2。(模型配置为每台机器使用不超过20个内核。)

Partitioning the model on more than 8 machines actually slows training, as network overhead starts to dominate in the fully-connected network structure and there is less work for each machine to perform with more partitions.
在超过8台机器上对模型进行分区实际上会减慢训练速度，因为在完全连接的网络结构中，网络开销开始占主导地位，并且每台机器在更多分区下执行的工作更少。

In contrast, the much larger, locally-connected image models can benefit from using many more machines per model replica.
相比之下，更大的本地连接图像模型可以从每个模型副本使用更多的机器中受益。

The largest model, with 1.7 billion parameters benefits the most, giving a speedup of more than 12 using 81 machines.
最大的模型，有17亿个参数，受益最大，使用81台机器的加速超过12。

For these large models using more machines continues to increase speed, but with diminishing returns.
对于这些大型型号，使用更多的机器继续提高速度，但回报却越来越少。