Using monte carlo simulations to estimate the value of a state

An important aspect of an artificial intelligent agent’s decision making process is game state evaluation. The value of a state determines if it warrants being visited or not. In this post we will focus on one method of state evaluation: Monte Carlo simulations (MC). This is a random sampling technique that is used to approximate the probability distributions of some phenomena. In this case it calculates the probability of winning from the given game state. It does this by simulating the game from the given state N times with the agents playing randomly and keeps track of the win/loss ratio.

For example, given the following position in a game of Tic-Tac-Toe, it can be seen that O wins 4 times and X wins 1 time. So overall the probability of winning is four out of five (80%) for O and 1 out of 5 for X (20%).

This state is not a good state for X to be in. The expansion of the tree is exhaustive and fills out all the possible moves that each player can perform at their turn. As was stated previously, MC simulations are a sampling method. So each run of the game will only follow one path down the tree. The question is how does each player choose a move at their turn?

If the player knows what the best move is then choosing that will lead to the minimax value. However, in practice it is rarely, if ever the case that the best move is known, after all, that is what we are trying to teach the agent. So in the best case the players will have some idea of a strong move based on some heuristic function using domain knowledge. If the players always follow this behavior, the resulting path down the tree will always be the same since the players are deterministic. Since we want to determine the probability of winning we want the behavior at some points to be random so that we explore the tree. So the question then becomes what is the balance between strong moves and random moves.

This is the focus of a paper titled ‘Monte-Carlo Simulation Balancing’ by Silver and Tesauro. In the paper the main discussion is learning the optimal policy that results in a simulation outcome that matches the minimax value of the state. There are two components of a simulation policy: Strength and balance. Strength refers to the error of a single player’s moves and balance takes into account the error of both player’s moves. A strong policy makes few mistakes and a balanced policy can make mistakes as long as they cancel each other out. The goal of a simulation policy should be to be balanced and this is shown this in the paper in further detail. One caveat is to note that in the paper the authors estimate the minimax value by running a deep Monte Carlo Tree Search (MCTS) since in the limit MCTS will converge to the minimax value. However, this is not possible in larger search spaces and if you go to the trouble to tune the MCTS to work well, then it makes sense to use it to estimate the state value instead of using it to tune the monte carlo simulations. In practice, if the minimax value is not known, it is difficult to calculate whether the policy is balanced.

The game that is being used is a turn based battle game similar to those that occur in Civilization. There are 5 custom scenarios in which a minimax value is estimated based on domain knowledge. The scenarios are as follows: “Even” has 5 units per player and the units are placed in a vertical mirrored setup, “Major Tactical” has player A’s units surrounding two of player B’s units and the rest of player B’s units are placed far away. “Minor tactical” has two of player A’s units surrounding one of player B’s units and the rest of the units are placed in a vertical mirrored setup. “Major material” has player A with 5 units and player B has 2 units placed in a mirrored setup. “Minor material” has player A with 5 units and player B with 4 units placed in a mirrored setup.

Instead of using the algorithms provided in the paper, 5 different agents that are varying combinations of a strong baseline agent and a random agent were created. The goal of the experiment was to use the varying policies in each scenario to determine how close they would be to the expected minimax value. Note that if we use pure random policy then we estimate the probability of winning, which is not the minimax value.

The expectation was that the baseline agents would converge quickly, whereas the random agents would tend to take longer to converge given the large state and action space. There was also an expectation that the baseline agent would perform well in scenarios where it’s behavior was optimal. This could be the case for the Major tactical scenario where the correct behavior is to immediately attack which is the baseline agent’s default behavior. We did experiments that ran 1,000 to 100,000 games in order to check convergence.

We had two major results:

The random agents converged to similar values as the baseline agents and in a very small amount of games (1,000 games). There was very little change from 1,000 to 100,000 games played.
The random agents were in general more in line with our expected minimax values except in the case where the baseline agent had more knowledge about the optimal behavior.

In cases where the optimal policy is known, the simulation results in a value close to the expected minimax value. But this optimal policy only works in the Major Tactical scenario. In the Even scenario, there is an imbalance and the random policy has a value closer to the minimax value of 0.5. It is a similar case in the Minor Material scenario.

It will be the case that in most scenarios, the baseline policy will not know the optimal behavior and therefore may perform suboptimally where as we can expect the random policy to cancel out its mistakes and have a balanced outcome.

This experiment shows that without trial and error, it can be difficult to determine an ideal simulation policy for a specific game, and it also shows that ideally the policy should be more balanced than strong. For this game the random policy provides the most balance in most cases. There is more work to be done in using the algorithms provided in the paper to build a better simulation policy that will ideally lead to the minimax value and not the probability of winning.