Abstract 摘要

Deep learning is a popular machine learning technique and has been applied to many real-world problems, ranging from computer vision to natural language processing. However, training a deep neural network is very timeconsuming, especially on big data. It has become difficult for a single machine to train a large model over large datasets. A popular solution is to distribute and parallelize the training process across multiple machines using the parameter server framework. In this paper, we present a distributed paradigm on the parameter server framework called Dynamic Stale Synchronous Parallel (DSSP) which improves the state-of-theart Stale Synchronous Parallel (SSP) paradigm by dynamically determining the staleness threshold at the run time. Conventionally to run distributed training in SSP, the user needs to specify a particular stalenes threshold as a hyper-parameter. However, a user does not usually know how to set the threshold and thus often finds a threshold value through trial and error, which is time-consuming. Based on workers’ recent processing time, our approach DSSP adaptively adjusts the threshold per iteration at running time to reduce the waiting time of faster workers for synchronization of the globally shared parameters (the weights of the model), and consequently increases the frequency of parameters updates (increases iteration throughput), which speedups the convergence rate. We compare DSSP with other paradigms such as Bulk Synchronous Parallel (BSP), Asynchronous Parallel (ASP), and SSP by running deep neural networks (DNN) models over GPU clusters in both homogeneous and heterogeneous environments. The results show that in a heterogeneous environment where the cluster consists of mixed models of GPUs, DSSP converges to a higher accuracy much earlier than SSP and BSP and performs similarly to ASP. In a homogeneous distributed cluster, DSSP has more stable and slightly better performance than SSP and ASP, and converges much faster than BSP.
深度学习是一种流行的机器学习技术,已经应用于许多现实世界的问题,从计算机视觉到自然语言处理。然而,训练深度神经网络是非常耗时的,尤其是在大数据上。一台机器很难在大型数据集上训练大型模型。一个流行的解决方案是使用parameter server框架在多台计算机上分发和并行化训练过程。在本文中,我们提出了一种关于参数服务器框架的分布式范例,称为动态陈旧同步并行 (DSSP),它改进了现有的陈旧同步并行 (SSP) 范例通过在运行时动态确定陈旧阈值。通常,要在SSP中运行分布式训练,用户需要指定特定的stalenes阈值作为超参数。然而,用户通常不知道如何设置阈值,因此经常通过反复试验找到阈值,这很耗时。根据工人最近的处理时间,我们的方法DSSP在运行时自适应地调整每次迭代的阈值,以减少更快的工作人员对全局共享参数 (模型的权重) 同步的等待时间,因此增加了参数更新的频率 (增加了迭代吞吐量),从而加快了收敛速度。我们通过运行深度神经网络 (DNN) 将DSSP与其他范例 (如批量同步并行 (BSP) 、异步并行 (ASP) 和SSP) 进行比较同质和异构环境下的GPU集群模型。结果表明,在集群由gpu混合模型组成的异构环境中,DSSP收敛到比SSP和BSP更早的更高精度,并且执行类似于ASP。在同质分布式集群中,DSSP比SSP和ASP更稳定,性能稍好,收敛速度比BSP快得多。