Asynchronous Methods for Deep Reinforcement Learning

We propose a conceptually simple and lightweight framework for deep reinforcement learning that uses asynchronous gradient descent for optimization of deep neural network controllers. We present asynchronous variants of four standard reinforcement learning algorithms and show that parallel actor-learners have a stabilizing effect on training allowing all four methods to successfully train neural network controllers. The best performing method, an asynchronous variant of actor-critic, surpasses the current state-of-the-art on the Atari domain while training for half the time on a single multi-core CPU instead of a GPU. Furthermore,we show that asynchronous actor-critic succeeds on a wide variety of continuous motor control problems as well as on a new task of navigating random 3D mazes using a visual input.



1. Introduction
Deep neural networks provide rich representations that can enable reinforcement learning (RL) algorithms to perform effectively. However, it was previously thought that the combination of simple online RL algorithms with deep neural networks was  undamentally unstable. Instead, a variety of solutions have been proposed to stabilize the algorithm (Riedmiller, 2005; Mnih et al., 2013; 2015; Van Hasselt et al., 2015; Schulman et al., 2015a). These approaches share a common idea: the sequence of bserved data encountered by an online RL agent is non-stationary, and on-line RL updates are strongly correlated. By storing the agent’s data in an experience replay memory, the data can be batched (Riedmiller, 2005; Schulman et al., 2015a) or randomly sampled (Mnih et al., 2013; 2015; Van Hasselt et al., 2015) from different time-steps. Aggregating over memory in this way reduces non-stationarity and decorrelates updates, but at the same time limits the methods to off-policy reinforcement learning algorithms.Deep RL algorithms based on experience replay have achieved unprecedented success in challenging domains such as Atari 2600. However, experience replay has several drawbacks: it uses more memory and computation per real interaction; and it requires off-policy learning algorithms that can update from data generated by an older policy. In this paper we provide a very different paradigm for deep reinforcement learning. Instead of experience replay, we asynchronously execute multiple agents in parallel, on multiple instances of the environment. This parallelism also decorrelates the agents’ data into a more stationary process,since at any given time-step the parallel agents will be experiencing a variety of different states. This simple idea enables a much larger spectrum of fundamental on-policy RL algorithms, such as Sarsa, n-step methods, and actorcritic methods, as well as off-policy RL algorithms such as Q-learning, to be applied robustly and effectively using deep neural networks.Our parallel reinforcement learning paradigm also offers practical benefits. Whereas previous approaches to deep reinforcement learning rely heavily on specialized hardware such as GPUs (Mnih et al., 2015; Van Hasselt et al., 2015;
Schaul et al., 2015) or massively distributed architectures (Nair et al., 2015), our experiments run on a single machine with a standard multi-core CPU. When applied to a variety of Atari 2600 domains, on many games asynchronous reinforcement learning achieves better results, in far less time than previous GPU-based algorithms, using far less resource than massively distributed approaches. The best of the proposed methods, asynchronous advantage actorcritic (A3C), also mastered a variety of continuous motor control tasks as well as learned general strategies for exploring 3D mazes purely from visual inputs. We believe that the success of A3C on both 2D and 3D games, discrete and continuous action spaces, as well as its ability to train
feedforward and recurrent agents makes it the most general and successful reinforcement learning agent to date.

1. 介绍

深度神经网络提供了丰富的表示,可以使强化学习(RL)算法有效地执行。然而,以前人们认为简单的在线RL算法与深度神经网络的结合是不稳定的。相反,提出了多种解决方案来稳定算法(Riedmiller, 2005;Mnih等,2013;2015;Van Hasselt等,2015;Schulman等,2015a)。这些方法都有一个共同的想法:在线RL代理遇到的等待数据序列是非平稳的,并且在线RL更新是强相关的。通过在经验重放存储器中存储代理的数据,数据可以批量处理(Riedmiller, 2005;Schulman等,2015a)或随机抽样(Mnih等,2013;2015;Van Hasselt等,2015)从不同的时间步长。这种内存聚合方法减少了非平稳性和冗余更新,但同时限制了算法的非策略强化学习算法。基于经验重放的深度RL算法在雅达利2600等具有挑战性的领域取得了前所未有的成功。然而,体验重放有几个缺点:每次真实交互使用更多的内存和计算量;它还需要非政策学习算法,可以从旧政策生成的数据中进行更新。在本文中,我们为深度强化学习提供了一个非常不同的范例。我们在环境的多个实例上异步并行地执行多个代理,而不是体验重放。这种并行性还将代理的数据拆分为一个更平稳的过程,因为在任何给定的时间步长,并行代理都将经历各种不同的状态。这个简单的想法使得更大范围的基本的政策上的RL算法,如Sarsa, n步方法,和演员评论家方法,以及政策外的RL算法,如Q-learning,可以使用深度神经网络被稳健和有效地应用。我们的并行强化学习模式也提供了实际的好处。而以往的深度强化学习方法严重依赖于专用硬件,如gpu (Mnih等,2015;Van Hasselt et al., 2015;Schaul et al., 2015)或大规模分布式架构(Nair et al., 2015),我们的实验在具有标准多核CPU的单机上运行。当应用于各种Atari 2600域时,在许多游戏中异步强化学习取得了更好的效果,比以前的gpu算法所花的时间要少得多,比大规模分布式方法所使用的资源要少得多。在被提出的方法中,最好的异步优势actor评论家(A3C)也掌握了各种连续的电机控制任务,以及学习了纯粹从视觉输入探索3D迷宫的一般策略。我们认为,A3C在2D和3D游戏、离散和连续动作空间上的成功,以及它训练前馈和复发型agent的能力,使其成为迄今为止最普遍、最成功的强化学习agent。

2. Related Work
The General Reinforcement Learning Architecture (Gorila)of (Nair et al., 2015) performs asynchronous training of reinforcement learning agents in a distributed setting. In Gorila,each process contains an actor that acts in its own copy of the environment, a separate replay memory, and a learner that samples data from the replay memory and computes gradients of the DQN loss (Mnih et al., 2015) with respect to the policy parameters. The gradients are asynchronously sent to a central parameter server which updates a central copy of the model. The updated policy parameters are sent to the actor-learners at fixed intervals. By using 100 separate actor-learner processes and 30 parameter server instances,a total of 130 machines, Gorila was able to significantly outperform DQN over 49 Atari games. On many games Gorila reached the score achieved by DQN over 20 times faster than DQN. We also note that a similar way of parallelizing DQN was proposed by (Chavez et al., 2015).In earlier work, (Li & Schuurmans, 2011) applied the Map Reduce framework to parallelizing batch reinforcement learning methods with linear function approximation.Parallelism was used to speed up large matrix operations but not to parallelize the collection of experience or stabilize learning. (Grounds & Kudenko, 2008) proposed a parallel version of the Sarsa algorithm that uses multiple separate actor-learners to accelerate training. Each actorlearner learns separately and periodically sends updates to weights that have changed significantly to the other learners using peer-to-peer communication.(Tsitsiklis, 1994) studied convergence properties of Qlearning in the asynchronous optimization setting. These results show that Q-learning is still guaranteed to converge when some of the information is outdated as long as outdated information is always eventually discarded and several other technical assumptions are satisfied. Even earlier,(Bertsekas, 1982) studied the related problem of distributed dynamic programming.Another related area of work is in evolutionary methods,which are often straightforward to parallelize by distributing fitness evaluations over multiple machines or threads (Tomassini, 1999). Such parallel evolutionary approaches have recently been applied to some visual reinforcement learning tasks. In one example, (Koutník et al.,2014) evolved convolutional neural network controllers for the TORCS driving simulator by performing fitness evaluations on 8 CPU cores in parallel.

(Nair et al., 2015)的General Reinforcement Learning Architecture (Gorila)在分布式环境下对Reinforcement Learning agent进行异步训练。在Gorila中,每个进程包含一个在其自身环境副本中工作的参与者、一个单独的重播内存和一个从重播内存中采样数据并计算DQN丢失的梯度(Mnih等,2015)的学习者(learner)。梯度异步发送到中心参数服务器,该服务器更新模型的中心副本。更新后的策略参数以固定的时间间隔发送给参与者-学习者。通过使用100个独立的actor-learner进程和30个参数服务器实例,总共130台机器,Gorila能够显著超过49个Atari游戏的DQN。在许多游戏中,Gorila比DQN快20倍达到DQN的分数。我们还注意到(Chavez et al., 2015)提出了一种类似的并行DQN方法。在早期的工作中,(Li & Schuurmans, 2011)将Map Reduce框架应用于线性函数逼近并行化批处理强化学习方法。并行性用于加速大矩阵运算,而不是用于并行化经验的收集或稳定学习。(Grounds & Kudenko, 2008)提出了Sarsa算法的一个并行版本,使用多个独立的演员-学习者来加速训练。每个演员学习者都是单独学习的,并定期向使用点对点通信的其他学习者发送权重的更新。(Tsitsiklis, 1994)研究了Qlearning在异步优化设置中的收敛特性。这些结果表明,只要最终丢弃过时的信息并满足其他几个技术假设,Q-learning仍然可以保证在某些信息过时的情况下收敛。更早的时候,(Bertsekas, 1982)研究了分布式动态规划的相关问题。另一个相关的工作领域是进化方法,通过在多台机器或线程上分布适合度评估,这种方法通常可以直接并行化(Tomassini, 1999)。这种并行进化方法最近已被应用于一些视觉强化学习任务。例如,(Koutnik et al.,2014)通过对8个CPU核并行执行适应度评估,为TORCS驾驶模拟器进化了卷积神经网络控制器。