Reinforcement Learning Algorithms in a Multi-Agent Snake Environment by Christian Biseinere

Abstract: Machine learning has led to significant progress in previously unsolved problems such as speech recognition and self-driving cars. Reinforcement learning is the subset of machine learning that deals with computer agents making a sequence of autonomous actions to achieve a set goal or reward. One popular way to test and train reinforcement learning algorithms is through playing video games because they have a set reward and are easy to model in ways that computer agents can understand. One area of reinforcement learning that has shown promise is multi-agent environments where more than one autonomous agent interacts in a competitive or cooperative way. While static environments are easier to solve because they have set rules that the agent can learn, this can lead to over fitting of the agent to its environment and an inability to transfer learned knowledge to other problems and environments. Multi-agent environments provide variability that has been shown to lead to more complex emergent behaviors. In this paper, I use the state-of-the-art reinforcement learning algorithms Proximal Policy Optimization (PPO) and Trust-Region Policy Optimization (TRPO) in a multiplayer clone of the classic game snake. The environment the agents compete in uses the Gym framework, and the agents use TensorFlow, a Python API for implementing machine-learning algorithms. I examine the performance of these agents after several iterations of training measured by the rewards an agent gets from playing the game. I also compare and contrast PPO and TRPO for this problem and show that PPO agents train much faster than TRPO agents and show more promising results in this environment.


The concept of Artificial Intelligence, (AI) bestowing on computers the ability to think and reason as well as humans, has fascinated computer scientists and researchers for decades and led to many advancements in computing. As a part of the field of AI, machine learning is a discipline that uses data to train models that are then used to solve problems. This is in contrast to expert systems that use algorithms written by programmers to codify the rules that make machines intelligent. Using more robust learning algorithms and exponential increases in computing power and data storage capacity, researchers have been able to solve problems that were only solvable by humans such as natural language processing, image processing, and self-driving cars. A recent example includes DeepMind’s (a subsidiary of Alphabet, Google’s parent company) AlphaGo beating the top world champions in the board game Go (Metz, 2016). In this board game, two players compete to capture space by enveloping the other player’s pieces. While it seems simple, there are more possible moves than there are stars in the sky, far too many to calculate the outcomes ahead of time with conventional means. This victory was significant because of the complexity of the game and the number of possible moves compared to previously solved games like chess. When IBM’s Deep Blue beat the world chess champion in 1997, it succeeded because it was an expert machine that was able to calculate the results of every possible move and pick the best one. In Go, because of the large number of possible moves on the board, it wouldn’t be feasible to calculate the results of every possible move. To win this game, researchers at DeepMind had to use machine-learning techniques to teach an agent how to play the game. With machine-learning, data is collected and used to train an agent to accomplish a task autonomously by optimizing functions that represent the task or goal of the agent. In recent years, machine learning has been super charged by the convergence of exponentially faster computers, increasingly cheaper computing power available through the cloud, and the reemergence of an old idea put in to practical use for the first time with this new computing power, neural networks. While machine learning can be done without neural networks, they have led to much more robust computer agents that are capable of more advanced tasks than simpler machine learning methods. Neural networks are loosely based on the human brain and the mathematical foundations of the connections that make up the human brain. While neural networks are often a simplification of the brain that they are based on, machine learning algorithms that use neural networks are already capable of superhuman ability in certain narrow tasks. This narrow but superhuman knowledge means that an AI like AlphaGo can be the best Go player in the world but will not be able to pick up chess without losing its Go playing ability. This has led to significant advances in domains that previously only belonged to humans, such as, computer vision, speech recognition, and autonomous machines like self-driving cars.

This last category is the domain of reinforcement learning, a subfield of machine learning. As stated in “OpenAI Gym,” “Reinforcement learning is the branch of machine learning that is concerned with making sequences of decisions.” (Brockman et al. 2016). This is in contrast to previously mentioned problems such as classification, used in computer vision to decipher images, and regression, which is used to come to quantitative conclusions about data. “OpenAI Gym” marked the starting point of a new era of comparable and easily peer-reviewed reinforcement learning research by defining a useful interface for digital environments and laying out future directions for research. Among these future directions for research are multi-agent environments, which is the subject of this paper. In reinforcement learning the agent is the trained computer program that solves a given problem and the environment is the world that the agent interacts with and observes to solve the problem. By introducing the variability of multiple competing or cooperative agents to the playing field, we have the opportunity to observe emergent behaviors in the models trained on this environment because of the added complexity of multiple autonomous agents. These multi-agent environments also more closely resemble the real world, in which there are millions of autonomous actors working together and against each other to accomplish various goals. Since machine learning and reinforcement learning algorithms are already influencing the real world in their decisions on applications like facial recognition and self-driving cars, it is important that we find robust solutions to these problems that are able to stand up to the rigorous conditions of the real world.

We use games and simulations to test machine learning algorithms and approaches because they provide two critical features needed to train good reinforcement learning agents. The first of these is an easy way to observe the environment; because video games and simulations are already digital environments, we can easily translate them into data that reinforcement learning agents can understand by capturing the screen and turning the pixel values into arrays of data. The second feature is a quantifiable reward structure that we can use to give our agents a goal that they can use to develop more long-term strategies. Without a quantifiable reward it is hard to teach current reinforcement learning agents that rely on reward functions or policy functions that use reward functions.

This study examines the performance of two reinforcement learning algorithms on the “Slitherin” problem from “Requests for Research 2” from OpenAI (Sutskever et al. 2018). In this problem, researchers were challenged to implement a multiplayer clone of the classic game Snake and solve it with reinforcement learning agents. These agents were implemented using the stable baselines repository (Hill et al. 2018), a fork of the OpenAI baseline implementations of popular reinforcement learning algorithms (Dhariwal et al. 2017). This environment is a clone of the game Snake that allows multiple snakes to play at once and compete for food. While the game was originally single player, this multiplayer version will allow us to observe the learned behaviors of the interactions between snakes. The Environment used in this study is a fork of the Gym environment “Gym-Snake” (Grant, 2018) that was modified to work with the stable baselines repository among other modifications described in the environment section of this paper.

In “Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments” (Lowe et al. 2017), the authors used the Gym environment for multi-agent reinforcement learning with a novel approach called multi-agent deep deterministic policy gradient (MADDPG), which is a variation of earlier work with deep deterministic policy gradients (DDPG). This modified algorithm was specifically written to solve multi-agent problems because the existing algorithms such as Q-learning, other policy gradients, and actor-critic methods had certain limitations that prevented them from solving this problem by themselves. The results of this paper are interesting because the MADDPG agent was able to vastly outperform the other algorithms in the four environments the team tested.

In “Emergent Complexity via Multi-Agent Competition,” the authors make the case for the importance of studying multi-agent environments by showing that agents trained in these environments begin to show emergent behaviors that are not normally seen in relatively simple environments (Bansal et al. 2017). The methodology of this experiment is also interesting because it utilizes two previously mentioned studies, “Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments” and “Proximal Policy Optimization Algorithms.” In this paper, Bansal et al. describes a study of emergent behaviors in a multi-agent environment using the existing PPO algorithm and this work closely aligns with the research done in this study with PPO and TRPO.


The environment used in this experiment is a multiplayer clone of the game “Snake” commonly found on Nokia 3310 phones and reproduced on many other platforms. This problem was posed by OpenAI in their second request for research that was designed to get more people working on multi-agent problems (Sutskever et al. 2018). This environment, made by Grant (2018), was designed as an OpenAI Gym environment and allows for customization in the size of field, number of snakes, amount of food, etc. This environment has been modified to work with the stable baselines repository and to provide the correct observation space and action space to the algorithm in the multi-snake environment. The rewards have also been reconfigured to return the sum of the rewards for all of the snakes rather than an array so that it works with the algorithms from the stable baselines repository. The scoring system used to evaluate the trained agents is based on the rewards used to train the agent. Starting at 0, a snake receives -1 each time a snake dies and +1 for each fruit that gets picked up. The lowest score possible is -2 and this scoring system is used to try and teach the agent which actions to take by giving it a negative reward for failing and a positive reward for succeeding.

I configured the environment to use a 25 by 25 square grid, the default size, and feature 2 snakes and 5 fruit. To minimize complexity, this study used two snakes because the request for research recommended to start with 2 snakes before attempting to scale the game to more snakes (Sutskever et al. 2018). This study used 5 fruit at a time on the playing field because of the way PPO and TRPO learn. They both use random actions in the beginning to explore the environment and get feedback on their actions, so having 5 fruit on the field could make it more likely that a snake would accidentally pick up a fruit in the beginning and learn from the positive reward to pick up more fruit in the future. Each time a snake eats a fruit it grows by one square and another fruit appears on the playing field.


In reinforcement learning, there are two major branches of learning algorithms, Q-learning, where you train a function that directly selects the action to take given a certain state, and policy gradient methods, where you train a policy that predicts the best action to take at any given state. Policy gradient methods have been shown to be more versatile and robust, and have largely superseded Q-learning, but they are often more complicated to implement and apply to problems (Schulman et al. 2017). In this environment, this study compares the performance of two reinforcement learning algorithms: Trust Region Policy Optimization (Schulman et al. 2015) and Proximal Policy Optimization (Schulman et al. 2017) over multiple training sessions and with 3 agents for each algorithm. The OpenAI baselines are high quality implementations of common reinforcement learning algorithms, including DQN, PPO, AC3, and others, that are available for reinforcement learning research (Dhariwal et al. 2017). This study was done using a fork of this software called stable baselines (Hill et al. 2018) that has an easier to use API and high-quality implementations of TRPO and PPO.

Figure 2. PPO1 (gray) vs. TPRO1 (green) rewards over the course of their final training sessions by hours run time

“Proximal Policy Optimization Algorithms” introduces a novel upgrade to policy optimization algorithms that improves performance in the standard OpenAI environments over previous policy optimization algorithms (Schulman et al. 2017). Policy gradient algorithms use a proxy for the actions that the agent will take instead of directly training a neural network to find those values. This generally leads to more flexibility and emergent behaviors than Deep Q Networks (DQN) which directly train the network to determine Q values that represent every possible action in every possible scenario. PPO has many benefits added compared to standard policy gradients, as described in the abstract for “Proximal Policy Optimization Algorithms”:

We propose a new family of policy gradient methods for reinforcement learning, which al- ternate between sampling data through interaction with the environment, and optimizing a “surrogate” objective function using stochastic gradient ascent. Whereas standard policy gradient methods perform one gradient update per data sample, we propose a novel objective function that enables multiple epochs of minibatch updates. (Schulman et al.).

I will be comparing PPO to Trust Region Policy Optimization (TRPO), an older policy gradient algorithm that does not use the minibatch updates (Schulman et al. 2015). Minibatch updates refers to using multiple smaller updates to the policy and allows implementations of PPO to be multi-threaded while TRPO is single-threaded. TRPO uses a “trust region” to make sure that changes to the policy do not change it drastically, but does not utilize minibatch updates. Due to the randomness associated with learning solely from experience, this study compares 3 agents per algorithm to more effectively compare the rate at which the agents learn and the average score of the fully trained agents.

This problem was challenging because reinforcement learning methods such as Deep Q-Learning (DQN) would be unable to cope with the variability of a multi-agent environment that would otherwise be perfectly suited to solving this game in a single-player setting. PPO and TRPO were chosen for this problem because they are well suited to play this kind of game where there is random chance and no fixed sequence of events to learn. Other algorithms like Deep Q-Networks do not handle changing environments well. In training these agents there are two ways to measure the time that these algorithms take to train. A time step refers to a single turn in the game and each time step there will be an observation by the agent and a single action for each snake on the playing field. The absolute time in minutes and hours it takes to run these training sessions can then be used to compare the training performance of these algorithms. The agents were trained for 3 separate time steps, 100,000, 1,000,000, and 10,000,000 time steps broken down into the initial 100,000, an additional 900,000, and finally an additional 9,000,000. This was done to compare the results of additional training on the same agents and to save historic models of each agent at each milestone for comparison. The PPO agents were trained for all three training sessions, while the TRPO agents were only trained for two sessions because the TRPO agents took longer to run two sessions than the PPO agents took for all three, and this study focuses on the relative training time of these agents. This results in a similar training time for all the agents but a very different amount of time steps and thus experience gained in their training. The default hyperparameters for the CnnPolicy from stable baselines were used to make these agents.

When training these agents, it became difficult to complete training sessions because of how long it took to run these training sessions on my computer. These algorithms are slow on conventional CPUs (Central Processing Unit) but can be greatly accelerated with compatible GPUs (Graphics Processing Unit). The type of math that GPUs are designed to be perform quickly, 3D geometry and vector math for graphics acceleration, happens to make these specialized parts really good at the math needed for training machine learning agents as well. To overcome these limitations, the long-term training for the study was done on an Azure Deep Learning Virtual Machine because it is specifically designed for machine learning research. It comes with a full development environment and TensorFlow preinstalled and configured to use GPU acceleration. This allowed shorter training sessions, as well as the ability to train more capable agents. The virtual machine ran on a NC6 instance, a GPU optimized virtual machine with a Nvidia Tesla K80 GPU, in the US East region and sent the resulting trained models and training logs back to my computer to observe them playing and interpret the data.

Training data was collected using the logging function of the stable baselines API and the data was visualized using TensorBoard. TensorBoard is a part of TensorFlow that allows you to visualize the training data of your machine learning agents. Training data was used to extracted relevant data for the results section and to use TensorBoard to create graphs of training data. Due to the randomness inherent in learning through random exploration, the training data has a lot of noise. To combat this noise, the default smoothing coefficient of 0.6 was used to reduce the noise and expose the trends that are discussed in the results section. TensorBoard was also used to produce the graphs that are in the results section.


The training data was collected using the TensorBoard logging function of stable baselines and the agents were evaluated by using the smoothing function in TensorBoard to eliminate noise and show trends in the data. The smoothing function in TensorBoard reduces the noise in the data set by applying a weight to the value (I used the default value of 0.6). The smoothed value more clearly shows the trends in the data over the course of many hours of training.

The first training session consisted of 100,000 time steps of training for all 6 agents, the 3 PPO and 3 TRPO agents, with each time step being a turn in the game (Table 1). The PPO agents were extremely fast in this stage, taking approximately 5 minutes due to multithreading, which allows multiple training sessions to happen concurrently and update the same policy. The TRPO agents took much longer to run with over an hour for each agent, but showed slightly better results at this stage of the training than PPO agents. This seems to indicate that TRPO agents are able to learn more with less actual experience measured by turns in the game, but when comparing the run time of the agents PPO is clearly much faster, which allows for additional training to close the gap. Overall, 100,000 time steps does not appear to be enough training for either of these algorithms to produce agents that can play the game well enough to get a positive score. This negative score comes from the fact that the agent receives a negative reward when a snake dies (-1) and a positive reward when it picks up a fruit (+1), which means that a model would have to pick up at least 2 fruit to break even in a single game.

Table 1. Results from the first training sessions

The second training sessions were for an additional 900,000 time steps to bring the total training of each agent to 1,000,000 time steps (Table 2). This step of the training is interesting because the PPO agents are starting to show some improvement with only just over an hour total of training while the TRPO agents took significantly longer to finish their training with only one TRPO agent getting a positive smoothed reward. Despite the differing smoothed rewards for the TRPO agents at this point in time, all three TRPO agents performed similarly in final testing which indicates that these smoothed rewards are due to excessive noise in the data. Since the TRPO agents took over 10 hours to finish this part of the training on an Azure NC6 instance, the last training session was only carried out by the PPO agents.

Table 2. Results from the second training sessions

For the final training sessions (Table 3), the PPO agents were trained for an additional 9,000,000 time steps to bring them to 10,000,000 total time steps. If we compare the time steps the agents complete by relative execution time, the PPO agents have been able to complete an order of magnitude more experience, 10,000,000 versus 1,000,000 time steps, compared to the TRPO agents. However, the amount of experience is not as important as what the agent is able to learn from that experience and whether that results in a trained model that can play the game well. According to the training data, the PPO agents do not disappoint here either, with the lowest scoring PPO agent outperforming the highest scoring TRPO agent by a sizable margin. The PPO agents also learned different strategies spontaneously through trial and error despite using the same learning algorithm and environment.

Table 3. Results from the third training sessions.

To collect more data on how the fully trained agents play the game, videos of the agents playing 10 games in a row were recorded and the scores from these sessions were averaged in Table 4. The TRPO agents did not seem to develop any recognizable strategy, and this is likely due to a lack of experience even with all of the training time these agents had. The fact that the TRPO agents did not develop any coherent strategy with more training time than the PPO agents shows the important advancements in performance that PPO introduced, especially because it is multithreaded. The PPO agents developed some unique strategies to maximize their average rewards, often employing strategies that people would use to turn around or avoid a wall in a unique way. It is also interesting that the snakes do not seem to go straight for food objects in ways that you would commonly see a person go after them, because the agents spontaneously learned strategies to stay alive that aren’t readily apparent to a human observer. PPO1 and PPO2 also seemed to develop a strategy of immediately killing one snake while maximizing the other snake. PPO3 was the most interesting to watch, however, because it is the only agent that performs equally well on both snakes and this seems to be why it has the highest average score.

Table 4: Average score over 10 games for fully trained agents with 10,000,000 time steps for PPO agents and 1,000,000 time steps for TRPO agents.


Implementing machine learning algorithms to solve multi-agent environments can still be challenging as existing tools and environments are mostly made for single-agent problems. Baseline implementations of algorithms can help because they provide a good foundation of high-quality implementations of state-of-the-art learning algorithms that can be applied to new problems. With PPO, the agents showed promising results with about 10 hours of training for each agent, while TRPO agents took longer to complete considerably less training but did show slightly better results for the number of time-steps they accomplished. While the training data seems to indicate that TRPO agents can marginally outperform PPO agents when comparing the number of completed time steps, running a multithreaded implementation of PPO with 64 threads allows the PPO agents to gain far more experience in the same amount of time and achieve far better results. This is important because the significant computing power used to train reinforcement learning agents can be expensive, and more efficient algorithms save time and money.

Out of the PPO agents, PPO1 and PPO2 seemed to focus on a single snake and obtain all of its rewards through one snake while PPO3 seems to use two snakes to achieve higher total rewards. While the snakes always started in the same places, the fruits randomly appeared and the agents had to learn to go after fruits themselves wherever they appeared instead of simply going to the same place every time. These behaviors were learned completely through trial and error, and they show that PPO agents were able to handle the variability in the environment by developing unique strategies and behaviors from scratch with no prior knowledge of the environment. This adaptability and flexibility could prove useful in real-world