- Links
- “Evaluating model-based planning and planner amortization for continuous control”, Byravan et al 2021
- “Scalable Online Planning via Reinforcement Learning Fine-Tuning”, Fickinger et al 2021
- “Lessons from AlphaZero for Optimal, Model Predictive, and Adaptive Control”, Bertsekas 2021
- “Train on Small, Play the Large: Scaling Up Board Games with AlphaZero and GNN”, Ben-Assayag & El-Yaniv 2021
- “Neural Tree Expansion for Multi-Robot Planning in Non-Cooperative Environments”, Riviere et al 2021
- “Scaling Scaling Laws with Board Games”, Jones 2021
- “Transfer of Fully Convolutional Policy-Value Networks Between Games and Game Variants”, Soemers et al 2021
- “Learning to Stop: Dynamic Simulation Monte-Carlo Tree Search”, Lan et al 2020
- “Assessing Game Balance with AlphaZero: Exploring Alternative Rule Sets in Chess”, Tomašev et al 2020
- “Learning Personalized Models of Human Behavior in Chess”, McIlroy-Young et al 2020
- “ReBeL: Combining Deep Reinforcement Learning and Search for Imperfect-Information Games”, Brown et al 2020
- “Learning Compositional Neural Programs for Continuous Control”, Pierrot et al 2020
- “Monte-Carlo Tree Search as Regularized Policy Optimization”, Grill et al 2020
- “Tackling Morpion Solitaire with AlphaZero-likeRanked Reward Reinforcement Learning”, Wang et al 2020
- “Aligning Superhuman AI with Human Behavior: Chess as a Model System”, McIlroy-Young et al 2020
- “Neural Machine Translation with Monte-Carlo Tree Search”, Parker & Chen 2020
- “Approximate exploitability: Learning a best response in large games”, Timbers et al 2020
- “Accelerating and Improving AlphaZero Using Population Based Training”, Wu et al 2020
- “Self-Play Learning Without a Reward Metric”, Schmidt et al 2019
- “MuZero: Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model”, Schrittwieser et al 2019
- “Multiplayer AlphaZero”, Petosa & Balch 2019
- “Global optimization of quantum dynamics with AlphaZero deep exploration”, Dalgaard et al 2019
- “Learning Compositional Neural Programs with Recursive Tree Search and Planning”, Pierrot et al 2019
- “Deep Policies for Width-Based Planning in Pixel Domains”, Junyent et al 2019
- “Policy Gradient Search: Online Planning and Expert Iteration without Search Trees”, Anthony et al 2019
- “α-Rank: Multi-Agent Evaluation by Evolution”, Omidshafiei et al 2019
- “Accelerating Self-Play Learning in Go”, Wu 2019
- “ELF OpenGo: An Analysis and Open Reimplementation of AlphaZero”, Tian et al 2019
- “Bayesian Optimization in AlphaGo”, Chen et al 2018
- “A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play”, Silver et al 2018
- “AlphaSeq: Sequence Discovery with Deep Reinforcement Learning”, Shao et al 2018
- “ExIt-OOS: Towards Learning from Planning in Imperfect Information Games”, Kitchen & Benedetti 2018
- “Has dynamic programming improved decision making?”, Rust 2018
- “Improving width-based planning with compact policies”, Junyent et al 2018
- “Surprising Negative Results for Generative Adversarial Tree Search”, Azizzadenesheli et al 2018
- “Dual Policy Iteration”, Sun et al 2018
- “Solving the Rubik’s Cube Without Human Knowledge”, McAleer et al 2018
- “Feedback-Based Tree Search for Reinforcement Learning”, Jiang et al 2018
- “A Tree Search Algorithm for Sequence Labeling”, Lao et al 2018
- “Feature-Based Aggregation and Deep Reinforcement Learning: A Survey and Some New Implementations”, Bertsekas 2018
- “Sim-to-Real Optimization of Complex Real World Mobile Network with Imperfect Information via Deep Reinforcement Learning from Self-play”, Tan et al 2018
- “M-Walk: Learning to Walk over Graphs using Monte Carlo Tree Search”, Shen et al 2018
- “Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm”, Silver et al 2017
- “Mastering the game of Go without human knowledge”, Silver et al 2017
- “Learning Generalized Reactive Policies using Deep Neural Networks”, Groshev et al 2017
- “Learning to Plan Chemical Syntheses”, Segler et al 2017
- “Thinking Fast and Slow with Deep Learning and Tree Search”, Anthony et al 2017
- “Deep Reinforcement Learning: An Overview”, Li 2017
- “Mastering the game of Go with deep neural networks and tree search”, Silver et al 2016
- “Giraffe: Using Deep Reinforcement Learning to Play Chess”, Lai 2015
- “Reinforcement Learning as Classification: Leveraging Modern Classifiers”, Lagoudakis & Parr 2003

- Miscellaneous

# Links

## “Evaluating model-based planning and planner amortization for continuous control”, Byravan et al 2021

“Evaluating model-based planning and planner amortization for continuous control”, (2021-10-07):

There is a widespread intuition that model-based control methods should be able to surpass the data efficiency of model-free approaches. In this paper we attempt to evaluate this intuition on various challenging locomotion tasks. We take a hybrid approach, combining model predictive control (MPC) with a learned model and model-free policy learning; the learned policy serves as a proposal for MPC. We find that well-tuned model-free agents are strong baselines even for high DoF control problems but MPC with learned proposals and models (trained on the fly or transferred from related tasks) can significantly improve performance and data efficiency in hard multi-task/multi-goal settings. Finally, we show that it is possible to distil a model-based planner into a policy that amortizes the planning computation without any loss of performance. Videos of agents performing different tasks can be seen at https://sites.google.com/view/mbrl-amortization/home.

## “Scalable Online Planning via Reinforcement Learning Fine-Tuning”, Fickinger et al 2021

“Scalable Online Planning via Reinforcement Learning Fine-Tuning”, (2021-09-30):

Lookahead search has been a critical component of recent AI successes, such as in the games of chess, go, and poker. However, the search methods used in these games, and in many other settings, are tabular. Tabular search methods do not scale well with the size of the search space, and this problem is exacerbated by stochasticity and partial observability. In this work we replace tabular search with online model-based fine-tuning of a policy neural network via reinforcement learning, and show that this approach outperforms state-of-the-art search algorithms in benchmark settings. In particular, we use our search algorithm to achieve a new state-of-the-art result in self-play Hanabi, and show the generality of our algorithm by also showing that it outperforms tabular search in the Atari game Ms. Pacman.

## “Lessons from AlphaZero for Optimal, Model Predictive, and Adaptive Control”, Bertsekas 2021

“Lessons from AlphaZero for Optimal, Model Predictive, and Adaptive Control”, (2021-08-20):

In this paper we aim to provide analysis and insights (often based on visualization), which explain the beneficial effects of on-line decision making on top of off-line training. In particular, through a unifying abstract mathematical framework, we show that the principal AlphaZero/TD-Gammon ideas of approximation in value space and rollout apply very broadly to deterministic and stochastic optimal control problems, involving both discrete and continuous search spaces. Moreover, these ideas can be effectively integrated with other important methodologies such as model predictive control, adaptive control, decentralized control, discrete and Bayesian optimization, neural network-based value and policy approximations, and heuristic algorithms for discrete optimization.

## “Train on Small, Play the Large: Scaling Up Board Games with AlphaZero and GNN”, Ben-Assayag & El-Yaniv 2021

“Train on Small, Play the Large: Scaling Up Board Games with AlphaZero and GNN”, (2021-07-18):

Playing board games is considered a major challenge for both humans and AI researchers. Because some complicated board games are quite hard to learn, humans usually begin with playing on smaller boards and incrementally advance to master larger board strategies. Most neural network frameworks that are currently tasked with playing board games neither perform such incremental learning nor possess capabilities to automatically scale up.

In this work, we look at the board as a graph and combine a graph neural network architecture inside the AlphaZero framework, along with some other innovative improvements. Our

ScalableAlphaZerois capable of learning to play incrementally on small boards, and advancing to play on large ones. Our model can be trained quickly to play different challenging board games on multiple board sizes, without using any domain knowledge.We demonstrate the effectiveness of ScalableAlphaZero and show, for example, that by training it for only three days on small Othello boards, it can defeat the AlphaZero model on a large board, which was trained to play the large board for 30 days.

## “Neural Tree Expansion for Multi-Robot Planning in Non-Cooperative Environments”, Riviere et al 2021

“Neural Tree Expansion for Multi-Robot Planning in Non-Cooperative Environments”, (2021-04-20):

We present a self-improving, Neural Tree Expansion (NTE) method for multi-robot online planning in non-cooperative environments, where each robot attempts to maximize its cumulative reward while interacting with other self-interested robots. Our algorithm adapts the centralized, perfect information, discrete-action space method from AlphaZero to a decentralized, partial information, continuous action space setting for multi-robot applications. Our method has three interacting components: (1) a centralized, perfect-information “expert” Monte Carlo Tree Search (MCTS) with large computation resources that provides expert demonstrations, (2) a decentralized, partial-information “learner” MCTS with small computation resources that runs in real-time and provides self-play examples, and (3) policy & value neural networks that are trained with the expert demonstrations and bias both the expert and the learner tree growth. Our numerical experiments demonstrate Neural Tree Expansion’s computational advantage by finding better solutions than a MCTS with 20× more resources. The resulting policies are dynamically sophisticated, demonstrate coordination between robots, and play the Reach-Target-Avoid differential game significantly better than the state-of-the-art control-theoretic baseline for multi-robot, double-integrator systems. Our hardware experiments on an aerial swarm demonstrate the computational advantage of Neural Tree Expansion, enabling online planning at 20Hz with effective policies in complex scenarios.

## “Scaling Scaling Laws with Board Games”, Jones 2021

“Scaling Scaling Laws with Board Games”, (2021-04-07; ; backlinks):

The largest experiments in machine learning now require resources far beyond the budget of all but a few institutions. Fortunately, it has recently been shown that the results of these huge experiments can often be extrapolated from the results of a sequence of far smaller, cheaper experiments. In this work, we show that not only can the extrapolation be done based on the size of the model, but on the size of the problem as well.

By conducting a sequence of experiments using AlphaZero and Hex, we show that the performance achievable with a fixed amount of compute degrades predictably as the game gets larger and harder. Along with our main result, we further show that the test-time and train-time compute available to an agent can be traded off while maintaining performance.

Slope: The slope of the incline is 500 Elo per order of magnitude increase in compute.A more memorable interpretation is that if you are in the linearly-increasing regime, then you will need about 2× as much compute as your opponent to beat them 2⁄3 of the time.

Perfect play: The minimum compute needed for perfect play increases 7× for each increment in board size.

Takeoff: The minimum training compute needed to see any improvement over random play increases by 4× for each increment of board size.

Random play: Finally, the distance between random play and perfect play increases by 500 Elo for each increment of board size.Unlike the other quantities mentioned previously, the distance between random and perfect play is a property of the game itself rather than of the agent.

…

Train-test trade-off: So far we have focused on the compute budget during training, but another pertinent budget is the compute spent during evaluation. All the results discussed previously have used a tree search of size 64 during evaluation, the same as used during training. But there is no reason that the train-time search and test-time search have to be the same size, and so by varying the size of the test-time compute budget we can see inFigure 8that larger tree searches at test time can substantially improve the performance of an agent.Knowing now that compute can be spent in 2 places, at train time and test time, the immediate question is: how do these 2 budgets trade off? This is illustrated in

Figure 9, which shows that the trade-off is linear in log-compute: for each additional 10× of train-time compute, about 15× of test-time compute can be eliminated, down to a floor of a single-node tree search…the simple relationship between compute at train time and compute at test time was originally surprising to us. Our intuition was that test-time compute is much ‘cheaper’ than train-time compute, and so we were surprised that one could easily substitute for the other. On reflection however, we believe the key distinction is that an optimization at test-time needs only optimise over one sample, while train-time compute meanwhile must optimise over the entire distribution of samples.…the way in which performance scales with compute is that an agent with twice as much compute as its opponent can win roughly 2⁄3 of the time. This behaviour is strikingly similar to that of a toy model where each player chooses as many random numbers as they have compute, and the player with the highest number wins

^{3}. In this toy model, doubling your compute doubles how many random numbers you draw, and the probability that you possess the largest number is 2⁄3 [as you go from 1:1, half the total numbers drawn, to 2:1, or 2/(2+1)—as if each tree search were an independent lottery ticket]. This suggests that the complex game play of Hex might actually reduce to each agent having a ‘pool’ of strategies proportional to its compute, and whoever picks the better strategy wins. While on the basis of the evidence presented herein we can only consider this to be serendipity, we are keen to see whether the same behaviour holds in other games.Second, both the relation of performance to board size and the relation of performance to compute are smooth. Before embarking on this project, a key unknown was whether performance would show any ‘spikes’ with regards to compute or board size. A spike with regards to compute might indicate the model had achieved some key insight, while a spike with regards to board size might indicate a minimum complexity past which key insights are available for the model to discover. As is however, models’ performance changes smoothly and predictably with both increased compute and increased complexity.

## “Transfer of Fully Convolutional Policy-Value Networks Between Games and Game Variants”, Soemers et al 2021

“Transfer of Fully Convolutional Policy-Value Networks Between Games and Game Variants”, (2021-02-24):

In this paper, we use fully convolutional architectures in AlphaZero-like self-play training setups to facilitate transfer between variants of board games as well as distinct games. We explore how to transfer trained parameters of these architectures based on shared semantics of channels in the state and action representations of the Ludii general game system. We use Ludii’s large library of games and game variants for extensive transfer learning evaluations, in zero-shot transfer experiments as well as experiments with additional fine-tuning time.

## “Learning to Stop: Dynamic Simulation Monte-Carlo Tree Search”, Lan et al 2020

“Learning to Stop: Dynamic Simulation Monte-Carlo Tree Search”, (2020-12-14):

Monte Carlo tree search (MCTS) has achieved state-of-the-art results in many domains such as Go and Atari games when combining with deep neural networks (DNNs). When more simulations are executed, MCTS can achieve higher performance but also requires enormous amounts of CPU and GPU resources. However, not all states require a long searching time to identify the best action that the agent can find. For example, in 19×19 Go and NoGo, we found that for more than half of the states, the best action predicted by DNN remains unchanged even after searching 2 minutes. This implies that a significant amount of resources can be saved if we are able to stop the searching earlier when we are confident with the current searching result. In this paper, we propose to achieve this goal by predicting the uncertainty of the current searching status and use the result to decide whether we should stop searching. With our algorithm, called Dynamic Simulation MCTS (DS-MCTS), we can speed up a NoGo agent trained by AlphaZero 2.5× faster while maintaining a similar winning rate. Also, under the same average simulation count, our method can achieve a 61% winning rate against the original program.

## “Assessing Game Balance with AlphaZero: Exploring Alternative Rule Sets in Chess”, Tomašev et al 2020

“Assessing Game Balance with AlphaZero: Exploring Alternative Rule Sets in Chess”, (2020-09-09):

It is non-trivial to design engaging and balanced sets of game rules. Modern chess has evolved over centuries, but without a similar recourse to history, the consequences of rule changes to game dynamics are difficult to predict. AlphaZero provides an alternative in silico means of game balance assessment. It is a system that can learn near-optimal strategies for any rule set from scratch, without any human supervision, by continually learning from its own experience. In this study we use AlphaZero to creatively explore and design new chess variants. There is growing interest in chess variants like Fischer Random Chess, because of classical chess’s voluminous opening theory, the high percentage of draws in professional play, and the non-negligible number of games that end while both players are still in their home preparation. We compare nine other variants that involve atomic changes to the rules of chess. The changes allow for novel strategic and tactical patterns to emerge, while keeping the games close to the original. By learning near-optimal strategies for each variant with AlphaZero, we determine what games between strong human players might look like if these variants were adopted. Qualitatively, several variants are very dynamic. An analytic comparison show that pieces are valued differently between variants, and that some variants are more decisive than classical chess. Our findings demonstrate the rich possibilities that lie beyond the rules of modern chess.

## “Learning Personalized Models of Human Behavior in Chess”, McIlroy-Young et al 2020

“Learning Personalized Models of Human Behavior in Chess”, (2020-08-23):

Even when machine learning systems surpass human ability in a domain, there are many reasons why AI systems that capture human-like behavior would be desirable: humans may want to learn from them, they may need to collaborate with them, or they may expect them to serve as partners in an extended interaction. Motivated by this goal of human-like AI systems, the problem of predicting human actions—as opposed to predicting optimal actions—has become an increasingly useful task. We extend this line of work by developing highly accurate personalized models of human behavior in the context of chess. Chess is a rich domain for exploring these questions, since it combines a set of appealing features: AI systems have achieved superhuman performance but still interact closely with human chess players both as opponents and preparation tools, and there is an enormous amount of recorded data on individual players. Starting with an open-source version of AlphaZero trained on a population of human players, we demonstrate that we can significantly improve prediction of a particular player’s moves by applying a series of fine-tuning adjustments. Furthermore, we can accurately perform stylometry—predicting who made a given set of actions—indicating that our personalized models capture human decision-making at an individual level.

## “ReBeL: Combining Deep Reinforcement Learning and Search for Imperfect-Information Games”, Brown et al 2020

“ReBeL: Combining Deep Reinforcement Learning and Search for Imperfect-Information Games”, (2020-07-27):

The combination of deep reinforcement learning and search at both training and test time is a powerful paradigm that has led to a number of successes in single-agent settings and perfect-information games, best exemplified by AlphaZero. However, prior algorithms of this form cannot cope with imperfect-information games.

This paper presents

ReBeL, a general framework for self-play reinforcement learning and search that provably converges to a Nash equilibrium in any two-player zero-sum game. In the simpler setting of perfect-information games, ReBeL reduces to an algorithm similar to AlphaZero.Results in two different imperfect-information games show ReBeL converges to an approximate Nash equilibrium. We also show ReBeL achieves superhuman performance in heads-up no-limit Texas hold’em poker, while using far less domain knowledge than any prior poker AI.

## “Learning Compositional Neural Programs for Continuous Control”, Pierrot et al 2020

“Learning Compositional Neural Programs for Continuous Control”, (2020-07-27):

We propose a novel solution to challenging sparse-reward, continuous control problems that require hierarchical planning at multiple levels of abstraction. Our solution, dubbed AlphaNPI-X, involves three separate stages of learning. First, we use off-policy reinforcement learning algorithms with experience replay to learn a set of atomic goal-conditioned policies, which can be easily repurposed for many tasks. Second, we learn self-models describing the effect of the atomic policies on the environment. Third, the self-models are harnessed to learn recursive compositional programs with multiple levels of abstraction. The key insight is that the self-models enable planning by imagination, obviating the need for interaction with the world when learning higher-level compositional programs. To accomplish the third stage of learning, we extend the AlphaNPI algorithm, which applies AlphaZero to learn recursive neural programmer-interpreters. We empirically show that AlphaNPI-X can effectively learn to tackle challenging sparse manipulation tasks, such as stacking multiple blocks, where powerful model-free baselines fail.

## “Monte-Carlo Tree Search as Regularized Policy Optimization”, Grill et al 2020

“Monte-Carlo Tree Search as Regularized Policy Optimization”, (2020-07-24; ):

The combination of Monte-Carlo tree search (MCTS) with deep reinforcement learning has led to significant advances in artificial intelligence. However, AlphaZero, the current state-of-the-art MCTS algorithm, still relies on handcrafted heuristics that are only partially understood. In this paper, we show that AlphaZero’s search heuristics, along with other common ones such as UCT, are an approximation to the solution of a specific regularized policy optimization problem. With this insight, we propose a variant of AlphaZero which uses the exact solution to this policy optimization problem, and show experimentally that it reliably outperforms the original algorithm in multiple domains.

## “Tackling Morpion Solitaire with AlphaZero-likeRanked Reward Reinforcement Learning”, Wang et al 2020

“Tackling Morpion Solitaire with AlphaZero-likeRanked Reward Reinforcement Learning”, (2020-06-14):

Morpion Solitaire is a popular single player game, performed with paper and pencil. Due to its large state space (on the order of the game of Go) traditional search algorithms, such as MCTS, have not been able to find good solutions. A later algorithm, Nested Rollout Policy Adaptation, was able to find a new record of 82 steps, albeit with large computational resources. After achieving this record, to the best of our knowledge, there has been no further progress reported, for about a decade.

In this paper we take the recent impressive performance of deep self-learning reinforcement learning approaches from AlphaGo/AlphaZero as inspiration to design a searcher for Morpion Solitaire. A challenge of Morpion Solitaire is that the state space is sparse, there are few win/loss signals. Instead, we use an approach known as ranked reward to create a reinforcement learning self-play framework for Morpion Solitaire. This enables us to find medium-quality solutions with reasonable computational effort. Our record is a 67 steps solution, which is very close to the human best (68) without any other adaptation to the problem than using ranked reward. We list many further avenues for potential improvement.

## “Aligning Superhuman AI with Human Behavior: Chess as a Model System”, McIlroy-Young et al 2020

“Aligning Superhuman AI with Human Behavior: Chess as a Model System”, (2020-06-02):

As artificial intelligence becomes increasingly intelligent—in some cases, achieving superhuman performance—there is growing potential for humans to learn from and collaborate with algorithms. However, the ways in which AI systems approach problems are often different from the ways people do, and thus may be uninterpretable and hard to learn from. A crucial step in bridging this gap between human and artificial intelligence is modeling the granular actions that constitute human behavior, rather than simply matching aggregate human performance.

We pursue this goal in a model system with a long history in artificial intelligence: chess. The aggregate performance of a chess player unfolds as they make decisions over the course of a game. The hundreds of millions of games played online by players at every skill level form a rich source of data in which these decisions, and their exact context, are recorded in minute detail. Applying existing chess engines to this data, including an open-source implementation of AlphaZero, we find that they do not predict human moves well.

We develop and introduce Maia, a customized version of Alpha-Zero trained on human chess games, that predicts human moves at a much higher accuracy than existing engines, and can achieve maximum accuracy when predicting decisions made by players at a specific skill level in a tuneable way. For a dual task of predicting whether a human will make a large mistake on the next move, we develop a deep neural network that significantly outperforms competitive baselines. Taken together, our results suggest that there is substantial promise in designing artificial intelligence systems with human collaboration in mind by first accurately modeling granular human decision-making.

## “Neural Machine Translation with Monte-Carlo Tree Search”, Parker & Chen 2020

“Neural Machine Translation with Monte-Carlo Tree Search”, (2020-04-27):

Recent algorithms in machine translation have included a value network to assist the policy network when deciding which word to output at each step of the translation. The addition of a value network helps the algorithm perform better on evaluation metrics like the BLEU score. After training the policy and value networks in a supervised setting, the policy and value networks can be jointly improved through common actor-critic methods. The main idea of our project is to instead leverage Monte-Carlo Tree Search (MCTS) to search for good output words with guidance from a combined policy and value network architecture in a similar fashion as AlphaZero. This network serves both as a local and a global look-ahead reference that uses the result of the search to improve itself. Experiments using the IWLST14 German to English translation dataset show that our method outperforms the actor-critic methods used in recent machine translation papers.

## “Approximate exploitability: Learning a best response in large games”, Timbers et al 2020

“Approximate exploitability: Learning a best response in large games”, (2020-04-20):

A standard metric used to measure the approximate optimality of policies in imperfect information games is exploitability, i.e. the performance of a policy against its worst-case opponent. However, exploitability is intractable to compute in large games as it requires a full traversal of the game tree to calculate a best response to the given policy. We introduce a new metric, approximate exploitability, that calculates an analogous metric using an approximate best response; the approximation is done by using search and reinforcement learning. This is a generalization of local best response, a domain specific evaluation metric used in poker. We provide empirical results for a specific instance of the method, demonstrating that our method converges to exploitability in the tabular and function approximation settings for small games. In large games, our method learns to exploit both strong and weak agents, learning to exploit an AlphaZero agent.

## “Accelerating and Improving AlphaZero Using Population Based Training”, Wu et al 2020

“Accelerating and Improving AlphaZero Using Population Based Training”, (2020-03-13):

AlphaZero has been very successful in many games. Unfortunately, it still consumes a huge amount of computing resources, the majority of which is spent in self-play. Hyperparameter tuning exacerbates the training cost since each hyperparameter configuration requires its own time to train one run, during which it will generate its own self-play records. As a result, multiple runs are usually needed for different hyperparameter configurations. This paper proposes using population based training (PBT) to help tune hyperparameters dynamically and improve strength during training time. Another significant advantage is that this method requires a single run only, while incurring a small additional time cost, since the time for generating self-play records remains unchanged though the time for optimization is increased following the AlphaZero training algorithm. In our experiments for 9×9 Go, the PBT method is able to achieve a higher win rate for 9×9 Go than the baselines, each with its own hyperparameter configuration and trained individually. For 19×19 Go, with PBT, we are able to obtain improvements in playing strength. Specifically, the PBT agent can obtain up to 74% win rate against ELF OpenGo, an open-source state-of-the-art AlphaZero program using a neural network of a comparable capacity. This is compared to a saturated non-PBT agent, which achieves a win rate of 47% against ELF OpenGo under the same circumstances.

## “Self-Play Learning Without a Reward Metric”, Schmidt et al 2019

“Self-Play Learning Without a Reward Metric”, (2019-12-16; ; backlinks):

The AlphaZero algorithm for the learning of strategy games via self-play, which has produced superhuman ability in the games of Go, chess, and shogi, uses a quantitative reward function for game outcomes, requiring the users of the algorithm to explicitly balance different components of the reward against each other, such as the game winner and margin of victory. We present a modification to the AlphaZero algorithm that requires only a total ordering over game outcomes, obviating the need to perform any quantitative balancing of reward components. We demonstrate that this system learns optimal play in a comparable amount of time to AlphaZero on a sample game.

## “MuZero: Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model”, Schrittwieser et al 2019

“MuZero: Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model”, (2019-11-19; ):

Constructing agents with planning capabilities has long been one of the main challenges in the pursuit of artificial intelligence. Tree-based planning methods have enjoyed huge success in challenging domains, such as chess and Go, where a perfect simulator is available. However, in real-world problems the dynamics governing the environment are often complex and unknown.

In this work we present the

MuZeroalgorithm which, by combining a tree-based search with a learned model, achieves superhuman performance in a range of challenging and visually complex domains, without any knowledge of their underlying dynamics. MuZero learns a model that, when applied iteratively, predicts the quantities most directly relevant to planning: the reward, the action-selection policy, and the value function.When evaluated on 57 different Atari games—the canonical video game environment for testing AI techniques, in which model-based planning approaches have historically struggled—our new algorithm achieved a new state of the art. When evaluated on Go, chess and shogi, without any knowledge of the game rules, MuZero matched the superhuman performance of the AlphaZero algorithm that was supplied with the game rules.

## “Multiplayer AlphaZero”, Petosa & Balch 2019

“Multiplayer AlphaZero”, (2019-10-29):

The AlphaZero algorithm has achieved superhuman performance in two-player, deterministic, zero-sum games where perfect information of the game state is available. This success has been demonstrated in Chess, Shogi, and Go where learning occurs solely through self-play. Many real-world applications (e.g., equity trading) require the consideration of a multiplayer environment. In this work, we suggest novel modifications of the AlphaZero algorithm to support multiplayer environments, and evaluate the approach in two simple 3-player games. Our experiments show that multiplayer AlphaZero learns successfully and consistently outperforms a competing approach: Monte Carlo tree search. These results suggest that our modified AlphaZero can learn effective strategies in multiplayer game scenarios. Our work supports the use of AlphaZero in multiplayer games and suggests future research for more complex environments.

## “Global optimization of quantum dynamics with AlphaZero deep exploration”, Dalgaard et al 2019

“Global optimization of quantum dynamics with AlphaZero deep exploration”, (2019-07-12):

While a large number of algorithms for optimizing quantum dynamics for different objectives have been developed, a common limitation is the reliance on good initial guesses, being either random or based on heuristics and intuitions. Here we implement a tabula rasa deep quantum exploration version of the Deepmind AlphaZero algorithm for systematically averting this limitation. AlphaZero employs a deep neural network in conjunction with deep lookahead in a guided tree search, which allows for predictive hidden variable approximation of the quantum parameter landscape. To emphasize transferability, we apply and benchmark the algorithm on three classes of control problems using only a single common set of algorithmic hyperparameters. AlphaZero achieves substantial improvements in both the quality and quantity of good solution clusters compared to earlier methods. It is able to spontaneously learn unexpected hidden structure and global symmetry in the solutions, going beyond even human heuristics.

## “Learning Compositional Neural Programs with Recursive Tree Search and Planning”, Pierrot et al 2019

“Learning Compositional Neural Programs with Recursive Tree Search and Planning”, (2019-05-30):

We propose a novel reinforcement learning algorithm, AlphaNPI, that incorporates the strengths of Neural Programmer-Interpreters (NPI) and AlphaZero. NPI contributes structural biases in the form of modularity, hierarchy and recursion, which are helpful to reduce sample complexity, improve generalization and increase interpretability. AlphaZero contributes powerful neural network guided search algorithms, which we augment with recursion. AlphaNPI only assumes a hierarchical program specification with sparse rewards: 1 when the program execution satisfies the specification, and 0 otherwise. Using this specification, AlphaNPI is able to train NPI models effectively with RL for the first time, completely eliminating the need for strong supervision in the form of execution traces. The experiments show that AlphaNPI can sort as well as previous strongly supervised NPI variants. The AlphaNPI agent is also trained on a Tower of Hanoi puzzle with two disks and is shown to generalize to puzzles with an arbitrary number of disk

## “Deep Policies for Width-Based Planning in Pixel Domains”, Junyent et al 2019

“Deep Policies for Width-Based Planning in Pixel Domains”, (2019-04-12):

Width-based planning has demonstrated great success in recent years due to its ability to scale independently of the size of the state space. For example, Bandres et al 2018 introduced a rollout version of the Iterated Width algorithm whose performance compares well with humans and learning methods in the pixel setting of the Atari games suite. In this setting, planning is done on-line using the “screen” states and selecting actions by looking ahead into the future. However, this algorithm is purely exploratory and does not leverage past reward information. Furthermore, it requires the state to be factored into features that need to be pre-defined for the particular task, e.g., the B-PROST pixel features. In this work, we extend width-based planning by incorporating an explicit policy in the action selection mechanism. Our method, called π-IW, interleaves width-based planning and policy learning using the state-actions visited by the planner. The policy estimate takes the form of a neural network and is in turn used to guide the planning step, thus reinforcing promising paths. Surprisingly, we observe that the representation learned by the neural network can be used as a feature space for the width-based planner without degrading its performance, thus removing the requirement of pre-defined features for the planner. We compare π-IW with previous width-based methods and with AlphaZero, a method that also interleaves planning and learning, in simple environments, and show that π-IW has superior performance. We also show that π-IW algorithm outperforms previous width-based methods in the pixel setting of Atari games suite.

## “Policy Gradient Search: Online Planning and Expert Iteration without Search Trees”, Anthony et al 2019

“Policy Gradient Search: Online Planning and Expert Iteration without Search Trees”, (2019-04-07):

Monte Carlo Tree Search (MCTS) algorithms perform simulation-based search to improve policies online. During search, the simulation policy is adapted to explore the most promising lines of play. MCTS has been used by state-of-the-art programs for many problems, however a disadvantage to MCTS is that it estimates the values of states with Monte Carlo averages, stored in a search tree; this does not scale to games with very high branching factors. We propose an alternative simulation-based search method, Policy Gradient Search (PGS), which adapts a neural network simulation policy online via policy gradient updates, avoiding the need for a search tree. In Hex, PGS achieves comparable performance to MCTS, and an agent trained using Expert Iteration with PGS was able defeat MoHex 2.0, the strongest open-source Hex agent, in 9×9 Hex.

## “α-Rank: Multi-Agent Evaluation by Evolution”, Omidshafiei et al 2019

“α-Rank: Multi-Agent Evaluation by Evolution”, (2019-03-04):

We introduce

α-Rank, a principled evolutionary dynamics methodology for the evaluation and ranking of agents in large-scale multi-agent interactions, grounded in a novel dynamical game-theoretic solution concept called Markov-Conley chains (MCCs).The approach leverages continuous-time and discrete-time evolutionary dynamical systems applied to empirical games, and scales tractably in the number of agents, the type of interactions, and the type of empirical games (symmetric and asymmetric). Current models are fundamentally limited in one or more of these dimensions and are not guaranteed to converge to the desired game-theoretic solution concept (typically the Nash equilibrium).

α-Rank provides a ranking over the set of agents under evaluation and provides insights into their strengths, weaknesses, and long-term dynamics. This is a consequence of the links we establish to the MCC solution concept when the underlying evolutionary model’s ranking-intensity parameter, α, is chosen to be large, which exactly forms the basis of α-Rank. In contrast to the Nash equilibrium, which is a static concept based on fixed points, MCCs are a dynamical solution concept based on the Markov chain formalism, Conley’s Fundamental Theorem of Dynamical Systems, and the core ingredients of dynamical systems: fixed points, recurrent sets, periodic orbits, and limit cycles.

α-Rank runs in polynomial time with respect to the total number of pure strategy profiles, whereas computing a Nash equilibrium for a general-sum game is known to be intractable. We introduce proofs that not only provide a unifying perspective of existing continuous-time and discrete-time evolutionary evaluation models, but also reveal the formal underpinnings of the α-Rank methodology.

We empirically validate the method in several domains including AlphaGo, AlphaZero, MuJoCo Soccer, and Poker.

## “Accelerating Self-Play Learning in Go”, Wu 2019

“Accelerating Self-Play Learning in Go”, (2019-02-27):

By introducing several improvements to the AlphaZero process and architecture, we greatly accelerate self-play learning in Go, achieving a 50× reduction in computation over comparable methods. Like AlphaZero and replications such as ELF OpenGo and Leela Zero, our bot KataGo only learns from neural-net-guided Monte Carlo tree search self-play. But whereas AlphaZero required thousands of TPUs over several days and ELF required thousands of GPUs over two weeks, KataGo surpasses ELF’s final model after only 19 days on fewer than 30 GPUs. Much of the speedup involves non-domain-specific improvements that might directly transfer to other problems. Further gains from domain-specific techniques reveal the remaining efficiency gap between the best methods and purely general methods such as AlphaZero. Our work is a step towards making learning in state spaces as large as Go possible without large-scale computational resources.

## “ELF OpenGo: An Analysis and Open Reimplementation of AlphaZero”, Tian et al 2019

“ELF OpenGo: An Analysis and Open Reimplementation of AlphaZero”, (2019-02-12):

The AlphaGo, AlphaGo Zero, and AlphaZero series of algorithms are remarkable demonstrations of deep reinforcement learning’s capabilities, achieving superhuman performance in the complex game of Go with progressively increasing autonomy. However, many obstacles remain in the understanding of and usability of these promising approaches by the research community. Toward elucidating unresolved mysteries and facilitating future research, we propose ELF OpenGo, an open-source reimplementation of the AlphaZero algorithm. ELF OpenGo is the first open-source Go AI to convincingly demonstrate superhuman performance with a perfect (20:0) record against global top professionals. We apply ELF OpenGo to conduct extensive ablation studies, and to identify and analyze numerous interesting phenomena in both the model training and in the gameplay inference procedures. Our code, models, selfplay datasets, and auxiliary data are publicly available.

## “Bayesian Optimization in AlphaGo”, Chen et al 2018

“Bayesian Optimization in AlphaGo”, (2018-12-17):

During the development of AlphaGo, its many hyper-parameters were tuned with Bayesian optimization multiple times. This automatic tuning process resulted in substantial improvements in playing strength. For example, prior to the match with Lee Sedol, we tuned the latest AlphaGo agent and this improved its win-rate from 50% to 66.5% in self-play games. This tuned version was deployed in the final match. Of course, since we tuned AlphaGo many times during its development cycle, the compounded contribution was even higher than this percentage. It is our hope that this brief case study will be of interest to Go fans, and also provide Bayesian optimization practitioners with some insights and inspiration.

## “A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play”, Silver et al 2018

`2018-silver.pdf#deepmind`

: “A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play”, (2018-12-07):

The game of chess is the longest-studied domain in the history of artificial intelligence. The strongest programs are based on a combination of sophisticated search techniques, domain-specific adaptations, and handcrafted evaluation functions that have been refined by human experts over several decades. By contrast, the AlphaGo Zero program recently achieved superhuman performance in the game of Go by reinforcement learning from self-play. In this paper, we generalize this approach into a single AlphaZero algorithm that can achieve superhuman performance in many challenging games. Starting from random play and given no domain knowledge except the game rules, AlphaZero convincingly defeated a world champion program in the games of chess and shogi (Japanese chess), as well as Go.

## “AlphaSeq: Sequence Discovery with Deep Reinforcement Learning”, Shao et al 2018

“AlphaSeq: Sequence Discovery with Deep Reinforcement Learning”, (2018-09-26):

Sequences play an important role in many applications and systems. Discovering sequences with desired properties has long been an interesting intellectual pursuit. This paper puts forth a new paradigm, AlphaSeq, to discover desired sequences algorithmically using deep reinforcement learning (DRL) techniques. AlphaSeq treats the sequence discovery problem as an episodic symbol-filling game, in which a player fills symbols in the vacant positions of a sequence set sequentially during an episode of the game. Each episode ends with a completely-filled sequence set, upon which a reward is given based on the desirability of the sequence set. AlphaSeq models the game as a Markov Decision Process (MDP), and adapts the DRL framework of AlphaGo to solve the MDP. Sequences discovered improve progressively as AlphaSeq, starting as a novice, learns to become an expert game player through many episodes of game playing. Compared with traditional sequence construction by mathematical tools, AlphaSeq is particularly suitable for problems with complex objectives intractable to mathematical analysis. We demonstrate the searching capabilities of AlphaSeq in two applications: 1) AlphaSeq successfully rediscovers a set of ideal complementary codes that can zero-force all potential interferences in multi-carrier CDMA systems. 2) AlphaSeq discovers new sequences that triple the signal-to-interference ratio—benchmarked against the well-known Legendre sequence—of a mismatched filter estimator in pulse compression radar systems.

## “ExIt-OOS: Towards Learning from Planning in Imperfect Information Games”, Kitchen & Benedetti 2018

“ExIt-OOS: Towards Learning from Planning in Imperfect Information Games”, (2018-08-30):

The current state of the art in playing many important perfect information games, including Chess and Go, combines planning and deep reinforcement learning with self-play. We extend this approach to imperfect information games and present ExIt-OOS, a novel approach to playing imperfect information games within the Expert Iteration framework and inspired by AlphaZero. We use Online Outcome Sampling, an online search algorithm for imperfect information games in place of MCTS. While training online, our neural strategy is used to improve the accuracy of playouts in OOS, allowing a learning and planning feedback loop for imperfect information games.

## “Has dynamic programming improved decision making?”, Rust 2018

“Has dynamic programming improved decision making?”, (2018-08-22; backlinks):

Dynamic programming (DP) is an extremely powerful tool for solving a wide class of sequential decision making problems under uncertainty. In principle, it enables us to compute

optimal decision rulesthat specify the best possible decision to take in any given situation. This article reviews developments in DP and contrasts its revolutionary impact on economics, operations research, engineering, and artificial intelligence, with the comparative paucity of real world applications where DP is actually used to improve decision making. I discuss the literature on numerical solution of DPs and its connection to the literature on reinforcement learning (RL) and artificial intelligence (AI).Despite amazing, highly publicized successes of these algorithms that result in superhuman levels of performance in board games such as chess or Go, I am not aware of comparably successful applications of DP for helping individuals and firms to solve real-world problems. I point to the fuzziness of many real world decision problems and the difficulty in mathematically formulating and modeling them as key obstacles to wider application of DP to improve decision making. Nevertheless, I provide several success stories where DP has demonstrably improved decision making and discuss a number of other examples where it seems likely that the application of DP could have substantial value.

I conclude that ‘applied DP’ offers substantial promise for economic policy making if economists can let go of the empirically untenable assumption of unbounded rationality and try to tackle the challenging decision problems faced every day by individuals and firms.

[

Keywords: actor-critic algorithms, Alpha Zero, approximate dynamic programming, artificial intelligence, behavioral economics, Bellman equation, bounded rationality, curse of dimensionality, computational complexity, decision rules, dynamic pricing, dynamic programming, employee compensation, Herbert Simon, fleet sizing, identification problem, individual and firm behavior life-cycle problem, locomotive allocation, machine learning, Markov decision processes, mental models, model-free learning, neural networks, neurodynamic programming, offline versus online training, optimal inventory management, optimal replacement, optimal search, principle of decomposition, Q-learning, revenue management, real-time dynamic programming, reinforcement learning, Richard Bellman, structural econometrics, supervised versus unsupervised learning]

## “Improving width-based planning with compact policies”, Junyent et al 2018

“Improving width-based planning with compact policies”, (2018-06-15):

Optimal action selection in decision problems characterized by sparse, delayed rewards is still an open challenge. For these problems, current deep reinforcement learning methods require enormous amounts of data to learn controllers that reach human-level performance. In this work, we propose a method that interleaves planning and learning to address this issue. The planning step hinges on the Iterated-Width (IW) planner, a state of the art planner that makes explicit use of the state representation to perform structured exploration. IW is able to scale up to problems independently of the size of the state space. From the state-actions visited by IW, the learning step estimates a compact policy, which in turn is used to guide the planning step. The type of exploration used by our method is radically different than the standard random exploration used in RL. We evaluate our method in simple problems where we show it to have superior performance than the state-of-the-art reinforcement learning algorithms A2C and Alpha Zero. Finally, we present preliminary results in a subset of the Atari games suite.

## “Surprising Negative Results for Generative Adversarial Tree Search”, Azizzadenesheli et al 2018

“Surprising Negative Results for Generative Adversarial Tree Search”, (2018-06-15; ):

While many recent advances in deep reinforcement learning (RL) rely on model-free methods, model-based approaches remain an alluring prospect for their potential to exploit unsupervised data to learn environment model. In this work, we provide an extensive study on the design of deep generative models for RL environments and propose a sample efficient and robust method to learn the model of Atari environments. We deploy this model and propose generative adversarial tree search (GATS) a deep RL algorithm that learns the environment model and implements Monte Carlo tree search (MCTS) on the learned model for planning. While MCTS on the learned model is computationally expensive, similar to AlphaGo, GATS follows depth limited MCTS. GATS employs deep Q network (DQN) and learns a Q-function to assign values to the leaves of the tree in MCTS. We theoretical analyze GATS vis-a-vis the bias-variance trade-off and show GATS is able to mitigate the worst-case error in the Q-estimate. While we were expecting GATS to enjoy a better sample complexity and faster converges to better policies, surprisingly, GATS fails to outperform DQN. We provide a study on which we show why depth limited MCTS fails to perform desirably.

## “Dual Policy Iteration”, Sun et al 2018

“Dual Policy Iteration”, (2018-05-28):

Recently, a novel class of Approximate Policy Iteration (API) algorithms have demonstrated impressive practical performance (e.g., ExIt from [2], AlphaGo-Zero from [27]). This new family of algorithms maintains, and alternately optimizes, two policies: a fast, reactive policy (e.g., a deep neural network) deployed at test time, and a slow, non-reactive policy (e.g., Tree Search), that can plan multiple steps ahead. The reactive policy is updated under supervision from the non-reactive policy, while the non-reactive policy is improved with guidance from the reactive policy. In this work we study this Dual Policy Iteration (DPI) strategy in an alternating optimization framework and provide a convergence analysis that extends existing API theory. We also develop a special instance of this framework which reduces the update of non-reactive policies to model-based optimal control using learned local models, and provides a theoretically sound way of unifying model-free and model-based RL approaches with unknown dynamics. We demonstrate the efficacy of our approach on various continuous control Markov Decision Processes.

## “Solving the Rubik’s Cube Without Human Knowledge”, McAleer et al 2018

“Solving the Rubik’s Cube Without Human Knowledge”, (2018-05-18):

A generally intelligent agent must be able to teach itself how to solve problems in complex domains with minimal human supervision. Recently, deep reinforcement learning algorithms combined with self-play have achieved superhuman proficiency in Go, Chess, and Shogi without human data or domain knowledge. In these environments, a reward is always received at the end of the game, however, for many combinatorial optimization environments, rewards are sparse and episodes are not guaranteed to terminate. We introduce Autodidactic Iteration: a novel reinforcement learning algorithm that is able to teach itself how to solve the Rubik’s Cube with no human assistance. Our algorithm is able to solve 100% of randomly scrambled cubes while achieving a median solve length of 30 moves—less than or equal to solvers that employ human domain knowledge.

## “Feedback-Based Tree Search for Reinforcement Learning”, Jiang et al 2018

“Feedback-Based Tree Search for Reinforcement Learning”, (2018-05-15):

Inspired by recent successes of Monte-Carlo tree search (MCTS) in a number of artificial intelligence (AI) application domains, we propose a model-based reinforcement learning (RL) technique that iteratively applies MCTS on batches of small, finite-horizon versions of the original infinite-horizon Markov decision process. The terminal condition of the finite-horizon problems, or the leaf-node evaluator of the decision tree generated by MCTS, is specified using a combination of an estimated value function and an estimated policy function. The recommendations generated by the MCTS procedure are then provided as feedback in order to refine, through classification and regression, the leaf-node evaluator for the next iteration. We provide the first sample complexity bounds for a tree search-based RL algorithm. In addition, we show that a deep neural network implementation of the technique can create a competitive AI agent for the popular multi-player online battle arena (MOBA) game King of Glory.

## “A Tree Search Algorithm for Sequence Labeling”, Lao et al 2018

“A Tree Search Algorithm for Sequence Labeling”, (2018-04-29):

In this paper we propose a novel reinforcement learning based model for sequence tagging, referred to as MM-Tag. Inspired by the success and methodology of the AlphaGo Zero, MM-Tag formalizes the problem of sequence tagging with a Monte Carlo tree search (MCTS) enhanced Markov decision process (MDP) model, in which the time steps correspond to the positions of words in a sentence from left to right, and each action corresponds to assign a tag to a word. Two long short-term memory networks (LSTM) are used to summarize the past tag assignments and words in the sentence. Based on the outputs of LSTMs, the policy for guiding the tag assignment and the value for predicting the whole tagging accuracy of the whole sentence are produced. The policy and value are then strengthened with MCTS, which takes the produced raw policy and value as inputs, simulates and evaluates the possible tag assignments at the subsequent positions, and outputs a better search policy for assigning tags. A reinforcement learning algorithm is proposed to train the model parameters. Our work is the first to apply the MCTS enhanced MDP model to the sequence tagging task. We show that MM-Tag can accurately predict the tags thanks to the exploratory decision making mechanism introduced by MCTS. Experimental results show based on a chunking benchmark showed that MM-Tag outperformed the state-of-the-art sequence tagging baselines including CRF and CRF with LSTM.

## “Feature-Based Aggregation and Deep Reinforcement Learning: A Survey and Some New Implementations”, Bertsekas 2018

“Feature-Based Aggregation and Deep Reinforcement Learning: A Survey and Some New Implementations”, (2018-04-12; ; backlinks):

In this paper we discuss policy iteration methods for approximate solution of a finite-state discounted Markov decision problem, with a focus on feature-based aggregation methods and their connection with deep reinforcement learning schemes. We introduce features of the states of the original problem, and we formulate a smaller “aggregate” Markov decision problem, whose states relate to the features. We discuss properties and possible implementations of this type of aggregation, including a new approach to approximate policy iteration. In this approach the policy improvement operation combines feature-based aggregation with feature construction using deep neural networks or other calculations. We argue that the cost function of a policy may be approximated much more accurately by the nonlinear function of the features provided by aggregation, than by the linear function of the features provided by neural network-based reinforcement learning, thereby potentially leading to more effective policy improvement.

## “Sim-to-Real Optimization of Complex Real World Mobile Network with Imperfect Information via Deep Reinforcement Learning from Self-play”, Tan et al 2018

“Sim-to-Real Optimization of Complex Real World Mobile Network with Imperfect Information via Deep Reinforcement Learning from Self-play”, (2018-02-18):

Mobile network that millions of people use every day is one of the most complex systems in the world. Optimization of mobile network to meet exploding customer demand and reduce capital/operation expenditures poses great challenges. Despite recent progress, application of deep reinforcement learning (DRL) to complex real world problem still remains unsolved, given data scarcity, partial observability, risk and complex rules/dynamics in real world, as well as the huge reality gap between simulation and real world.

To bridge the reality gap, we introduce a Sim-to-Real framework to directly transfer learning from simulation to real world via graph convolutional neural network (CNN)—by abstracting partially observable mobile network into graph, then distilling domain-variant irregular graph into domain-invariant tensor in locally Euclidean space as input to CNN, domain randomization and multi-task learning. We use a novel self-play mechanism to encourage competition among DRL agents for best record on multiple tasks via simulated annealing, just like athletes compete for world record in decathlon. We also propose a decentralized multi-agent, competitive and cooperative DRL method to coordinate the actions of multi-cells to maximize global reward and minimize negative impact to neighbor cells.

Using 6 field trials on commercial mobile networks, we demonstrate for the first time that a DRL agent can successfully transfer learning from simulation to complex real world problem with imperfect information, complex rules/dynamics, huge state/action space, and multi-agent interactions, without any training in the real world.

## “M-Walk: Learning to Walk over Graphs using Monte Carlo Tree Search”, Shen et al 2018

“M-Walk: Learning to Walk over Graphs using Monte Carlo Tree Search”, (2018-02-12):

Learning to walk over a graph towards a target node for a given query and a source node is an important problem in applications such as knowledge base completion (KBC). It can be formulated as a reinforcement learning (RL) problem with a known state transition model. To overcome the challenge of sparse rewards, we develop a graph-walking agent called M-Walk, which consists of a deep recurrent neural network (RNN) and Monte Carlo Tree Search (MCTS). The RNN encodes the state (i.e., history of the walked path) and maps it separately to a policy and Q-values. In order to effectively train the agent from sparse rewards, we combine MCTS with the neural policy to generate trajectories yielding more positive rewards. From these trajectories, the network is improved in an off-policy manner using Q-learning, which modifies the RNN policy via parameter sharing. Our proposed RL algorithm repeatedly applies this policy-improvement step to learn the model. At test time, MCTS is combined with the neural policy to predict the target node. Experimental results on several graph-walking benchmarks show that M-Walk is able to learn better policies than other RL-based methods, which are mainly based on policy gradients. M-Walk also outperforms traditional KBC baselines.

## “Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm”, Silver et al 2017

“Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm”, (2017-12-05; ; backlinks):

The game of chess is the most widely-studied domain in the history of artificial intelligence. The strongest programs are based on a combination of sophisticated search techniques, domain-specific adaptations, and handcrafted evaluation functions that have been refined by human experts over several decades. In contrast, the AlphaGo Zero program recently achieved superhuman performance in the game of Go, by tabula rasa reinforcement learning from games of self-play. In this paper, we generalise this approach into a single AlphaZero algorithm that can achieve, tabula rasa, superhuman performance in many challenging domains. Starting from random play, and given no domain knowledge except the game rules, AlphaZero achieved within 24 hours a superhuman level of play in the games of chess and shogi (Japanese chess) as well as Go, and convincingly defeated a world-champion program in each case.

## “Mastering the game of Go without human knowledge”, Silver et al 2017

`2017-silver.pdf#deepmind`

: “Mastering the game of Go without human knowledge”, (2017-10-19):

A long-standing goal of artificial intelligence is an algorithm that learns, tabula rasa, superhuman proficiency in challenging domains. Recently, AlphaGo became the first program to defeat a world champion in the game of Go. The tree search in AlphaGo evaluated positions and selected moves using deep neural networks. These neural networks were trained by supervised learning from human expert moves, and by reinforcement learning from self-play. Here we introduce an algorithm based solely on reinforcement learning, without human data, guidance or domain knowledge beyond game rules. AlphaGo becomes its own teacher: a neural network is trained to predict AlphaGo’s own move selections and also the winner of AlphaGo’s games. This neural network improves the strength of the tree search, resulting in higher quality move selection and stronger self-play in the next iteration. Starting tabula rasa, our new program AlphaGo Zero achieved superhuman performance, winning 100–0 against the previously published, champion-defeating AlphaGo.

## “Learning Generalized Reactive Policies using Deep Neural Networks”, Groshev et al 2017

“Learning Generalized Reactive Policies using Deep Neural Networks”, (2017-08-24):

We present a new approach to learning for planning, where knowledge acquired while solving a given set of planning problems is used to plan faster in related, but new problem instances. We show that a deep neural network can be used to learn and represent a

generalized reactive policy(GRP) that maps a problem instance and a state to an action, and that the learned GRPs efficiently solve large classes of challenging problem instances. In contrast to prior efforts in this direction, our approach significantly reduces the dependence of learning on handcrafted domain knowledge or feature selection. Instead, the GRP is trained from scratch using a set of successful execution traces. We show that our approach can also be used to automatically learn a heuristic function that can be used in directed search algorithms. We evaluate our approach using an extensive suite of experiments on two challenging planning problem domains and show that our approach facilitates learning complex decision making policies and powerful heuristic functions with minimal human input. Videos of our results are available at goo.gl/Hpy4e3.

## “Learning to Plan Chemical Syntheses”, Segler et al 2017

“Learning to Plan Chemical Syntheses”, (2017-08-14):

From medicines to materials, small organic molecules are indispensable for human well-being. To plan their syntheses, chemists employ a problem solving technique called retrosynthesis. In retrosynthesis, target molecules are recursively transformed into increasingly simpler precursor compounds until a set of readily available starting materials is obtained. Computer-aided retrosynthesis would be a highly valuable tool, however, past approaches were slow and provided results of unsatisfactory quality. Here, we employ Monte Carlo Tree Search (MCTS) to efficiently discover retrosynthetic routes. MCTS was combined with an expansion policy network that guides the search, and an “in-scope” filter network to pre-select the most promising retrosynthetic steps. These deep neural networks were trained on 12 million reactions, which represents essentially all reactions ever published in organic chemistry. Our system solves almost twice as many molecules and is 30× faster in comparison to the traditional search method based on extracted rules and hand-coded heuristics. Finally after a 60 year history of computer-aided synthesis planning, chemists can no longer distinguish between routes generated by a computer system and real routes taken from the scientific literature. We anticipate that our method will accelerate drug and materials discovery by assisting chemists to plan better syntheses faster, and by enabling fully automated robot synthesis.

## “Thinking Fast and Slow with Deep Learning and Tree Search”, Anthony et al 2017

“Thinking Fast and Slow with Deep Learning and Tree Search”, (2017-05-23):

Sequential decision making problems, such as structured prediction, robotic control, and game playing, require a combination of planning policies and generalisation of those plans. In this paper, we present

Expert Iteration(ExIt), a novel reinforcement learning algorithm which decomposes the problem into separate planning and generalisation tasks. Planning new policies is performed by tree search, while a deep neural network generalises those plans. Subsequently, tree search is improved by using the neural network policy to guide search, increasing the strength of new plans. In contrast, standard deep Reinforcement Learning algorithms rely on a neural network not only to generalize plans, but to discover them too. We show that ExIt outperforms REINFORCE for training a neural network to play the board game Hex, and our final tree search agent, trained tabula rasa, defeats MoHex 1.0, the most recent Olympiad Champion player to be publicly released.

## “Deep Reinforcement Learning: An Overview”, Li 2017

“Deep Reinforcement Learning: An Overview”, (2017-01-25):

We give an overview of recent exciting achievements of deep reinforcement learning (RL). We discuss six core elements, six important mechanisms, and twelve applications. We start with background of machine learning, deep learning and reinforcement learning. Next we discuss core RL elements, including value function, in particular, Deep Q-Network (DQN), policy, reward, model, planning, and exploration. After that, we discuss important mechanisms for RL, including attention and memory, unsupervised learning, transfer learning, multi-agent RL, hierarchical RL, and learning to learn. Then we discuss various applications of RL, including games, in particular, AlphaGo, robotics, natural language processing, including dialogue systems, machine translation, and text generation, computer vision, neural architecture design, business management, finance, healthcare, Industry 4.0, smart grid, intelligent transportation systems, and computer systems. We mention topics not reviewed yet, and list a collection of RL resources. After presenting a brief summary, we close with discussions.

Please see Deep Reinforcement Learning, arXiv:1810.06339, for a significant update.

## “Mastering the game of Go with deep neural networks and tree search”, Silver et al 2016

`2016-silver.pdf#deepmind`

: “Mastering the game of Go with deep neural networks and tree search”, (2016-01-28):

The game of Go has long been viewed as the most challenging of classic games for artificial intelligence owing to its enormous search space and the difficulty of evaluating board positions and moves. Here we introduce a new approach to computer Go that uses ‘value networks’ to evaluate board positions and ‘policy networks’ to select moves. These deep neural networks are trained by a novel combination of supervised learning from human expert games, and reinforcement learning from games of self-play. Without any lookahead search, the neural networks play Go at the level of state-of-the-art Monte Carlo tree search programs that simulate thousands of random games of self-play. We also introduce a new search algorithm that combines Monte Carlo simulation with value and policy networks. Using this search algorithm, our program AlphaGo achieved a 99.8% winning rate against other Go programs, and defeated the human European Go champion by 5 games to 0. This is the first time that a computer program has defeated a human professional player in the full-sized game of Go, a feat previously thought to be at least a decade away.

[Anecdote: I hear from Groq that the original AlphaGo GPU implementation was

noton track to defeat Lee Sedol by about a month before, when they happened to gamble on implementing TPUv1 support. The additional compute led to drastic performance gains, and the TPU model could beat the GPU model in ~98 of 100 games, and the final model solidly defeated Lee Sedol. (Since TPUv1s reportedly only did inferencing/forward-mode, presumably they were not used for the initial imitation learning, or the policy gradients self-play, but for generating the ~30 million self-play games which the value network was trained on (doing regression/prediction of ‘board → P(win)’, requiring no state or activations from the self-play games, just generating an extremely large corpus which could be easily used by GPU training.]

## “Giraffe: Using Deep Reinforcement Learning to Play Chess”, Lai 2015

“Giraffe: Using Deep Reinforcement Learning to Play Chess”, (2015-09-04):

This report presents Giraffe, a chess engine that uses self-play to discover all its domain-specific knowledge, with minimal hand-crafted knowledge given by the programmer. Unlike previous attempts using machine learning only to perform parameter-tuning on hand-crafted evaluation functions, Giraffe’s learning system also performs automatic feature extraction and pattern recognition. The trained evaluation function performs comparably to the evaluation functions of state-of-the-art chess engines—all of which containing thousands of lines of carefully hand-crafted pattern recognizers, tuned over many years by both computer chess experts and human chess masters. Giraffe is the most successful attempt thus far at using end-to-end machine learning to play chess.

## “Reinforcement Learning as Classification: Leveraging Modern Classifiers”, Lagoudakis & Parr 2003

`2003-lagoudakis.pdf`

: “Reinforcement Learning as Classification: Leveraging Modern Classifiers”, Michail Lagoudakis, Ronald Parr

# Miscellaneous

`https://hackernoon.com/the-3-tricks-that-made-alphago-zero-work-f3d47b6686ef`

`https://en.chessbase.com/post/the-future-is-here-alphazero-learns-chess`

`https://en.chessbase.com/post/leela-chess-zero-alphazero-for-the-pc`

`https://deepmind.com/blog/alphazero-shedding-new-light-grand-games-chess-shogi-and-go/`

`https://cacm.acm.org/magazines/2021/9/255049-playing-with-and-against-computers/fulltext`

`https://blog.janestreet.com/deep-learning-the-hardest-go-problem-in-the-world/`

`https://ai.googleblog.com/2021/03/leveraging-machine-learning-for-game.html`

`http://cl-informatik.uibk.ac.at/cek/holstep/ckfccs-holstep-submitted.pdf`