Testing Artificial Players: Strength

How do you know if your artificial player, or bot, is any good? What does it mean to be good for an artificial player anyways?


At Decisive, we consider an intelligent artificial player (IAP) to be good on account of two factors: sheer strength, and human-like behaviour. We are going to leave human-like behaviour for another blog, not because it is less important than strength, but because it is very important and I rather not make a blog too long.


Strength is key, because it is possible to make a strong IAP play worse, but the opposite is not true. Let’s clarify from the beginning that when I talk about strength here is in the context of the IAP having access only to the same information as the human players. In other words, cheating is not an acceptable technique to make an IAP stronger. Not to go on a tangent, but this is one of the reasons why I call our product IAP and not bot. I think it is ok to cheat for a bot (it’s name is almost an antonym of intelligence).


So, let’s say you worked on an IAP and want to know if it is strong or not, how do you test it? We would like a concrete number that is relative to two boundaries: a minimum and a maximum strength. So if I tell you: your IAP is a 5, where the minimum is 0 and the maximum is 10, we quickly get an idea of its strength.


As a general rule, the way to test an IAP is to make it play against another player, and see how it does. We need the opponent to be of a known strength, and also a fast player that can play hopefully thousands of games. That rules out humans.


To avoid a chicken and egg situation, an easy place to start is at strength zero: a random player. If our IAP wins 50% of the games against a random player, then all we’ve done is a random IAP! Anything above 50% and we know our player has some intelligence. This is actually an important early test, because we often try new techniques and theories for training IAPs and an early and easy ‘proof of learning’ is a must (we fail often, and we like to fail fast).


Unfortunately, even a 100% win against a random player does not mean the IAP is strong. It does mean it learnt, but it is far from playing well. Random is as bad a player as it gets after all. So, we need a better opponent: Baseline.


Baseline is a hard-coded bot that is easy to code, but does not do clearly stupid moves like random does, but it isn’t a good player either. Typically done with a behaviour tree or similar ‘if-then’ organizing technique, the Baseline player will not learn anything new but it will kill random and won’t lose to any half trained IAP.


In practise, our customers often have a ready available bot to use as a Baseline, because the game is already out there and they do have an AI to play against, or there is some testing bot or even an attempt to create a good artificial opponent.


In any case, either with existing code or writing from scratch, having a baseline player is not expensive or time consuming and it is very useful: it is step two after getting close to a 100% win against Random.


So, let’s have a few thousand matches between IAP and Baseline. Some games have a lot of randomness and others don’t, this and other factors can make the percentage of win/lose mean very different things, but in general we want to be well above 50% win rate against Baseline to consider that step done. Once the IAP has been trained enough to beat Baseline well, we can consider that to be the first version of the IAP. We want now to start playing against humans and take a look at the actual behaviour, see how good it is in every aspect and how much more training it might need.


Here is an example of some statistics of playing an IAP trained with machine learning against the then current AI for the game Blood & Honor (we called their bot ‘Legacy’)



This kind of statistic is key to know where we are with the training: in this case, the level 10 on strength for the IAP got 3 first game positions for each time it did not finish first. A real champion, especially on such a hard strategy game with quite a bit of randomness in its rules.


Further iterations are done by playing the new candidate against the current champion. In our example, we’d have the IAP with further training play against IAP v1 and, if the win rate is acceptable, we label it IAP v2 and crown it as the new champion.


Pretty simple: keep improving the IAP and test it against your previous version till you are happy with the strength. True, almost. There is a potential problem that might be more or less of an obstacle depending on the game: let say that Bot A beats Bot B, and Bot B beats Bot C, thus Bot C beats Bot A? Probably, but you can’t be sure so you need to test.


It actually happened recently that we developed a baseline agent and tested it against Random. It was good, 80% win, but we wanted to get closer to 100%, so we improved it and had a Baseline v2 that got to 98% wins against random. And then we put Baseline v2 to play against Baseline v1 and… it won 15% of the time… 85% loses. Sounds illogical but this is a real example and if you’d see the games played by these agents you would quickly realize why these numbers are the way they are. For this article, sufficient to say that strength testing needs often an ‘ensemble’ of opponents, often including the original Baseline, some well trained IAP and potentially others (random is only useful for the first step).


Like with any software, testing is key for IAP training. But we can’t just call QA and have human testers tell us the actual strength of an IAP, you need a good process and patience gathering statistics to know where you are and how much work ahead you still have.