QA, bots, humans, and the in between

“I think the Orcs are too strong. The short guys throwing axes need to be nerfed, let’s try 10%, no, make it 15%.” This is a very common ‘decision making process’ when it comes to testing a game, aspiring to achieve a good game balance, but most importantly, trying to make it fun.

The issue is that it is super subjective and the real test will be when players play it… They will find any imbalances and they will not be shy to let the world know.

I bring up game balance because it is particularly hard to test, it can take significant resources and precious time, and also because this is an area where machine learning can help.

So, let’s explore some areas where machine learning can not help, at least with the type of tools and methods we use at Decisive AI. We train artificial players by exposing the agent (which is basically a formula or function, however complex) to lots of games. That generates experience that leads the agent to outcomes we reward, like scoring points or winning a game.

We can’t do this with a few games, not even a few thousand games. We need hundreds of thousands if not millions (we even got to billions!) of games played. Computers are fast at some things, learning is not one of them. Another important consideration is that the training needs to be efficient, and playing using the same devices as humans do is not. Instead of interpreting the screen with complex convolutional neural networks and then using keyboard and mouse (touchscreen and joystick), we simply embed the agent in the game code where it can talk directly to the game engine.

Multiplayer games are designed with a ‘bot’ or ‘AI’ in mind, so it isn’t hard to integrate our learning platform to play the game, and this is outside the UI, it is fast and efficient.

Since our training, or the resulting agent, will not interact with the UI and will not be looking at the graphics or experience the music and sound of the game, none of these (very important) areas of a game are tested by our agents.

But, if what keeps you awake at night is if that new axe you are planning to add as a new card for the Troll faction is too strong, or a new Underworld map to a strategy game, or in general, some new element to keep a good game fresh and the players engaged, then our intelligent artificial players are a fabulous option.

The first highlight is the thoroughness. A learning agent tries new things constantly (exploring the space) and those that work are later further exploited. If your new axe is too strong, our agents will use it to win all the time. If the axe is useless, then they will try it, but then abandon it because they want to win.

The next is precision. As the agent plays games we can configure data output for reporting. Do you want to know how many games are won with the axe? You get a percentage, out of a statistically meaningful number of games. Say for example that you nerf a weapon, we can quickly play 10,000 games before and after the nerf. If the faction with the weapon wins more or losses more, you know why.

This type of quick test does not need retraining or any black magic, just to execute a batch process and check the results from a database. It takes minutes, what otherwise can take weeks of expensive human testing that is often not precise enough.

I often remember when I used to play Clash Royale and every so often, a patch would come tweaking some cards. This is the result of people carefully analyzing data from hundreds of thousands of games played by human players. If they see too many wins with the Miner, then the Miner gets nerfed. How much faster, efficient and precise this process would be if instead of using the players they would use artificial players that don’t mind playing lots of games and finding overpowered or underpowered cards.

There are many more advantages: catching crashes, since the agent plays the game a lot, and in ways not often done by the humans, it is going to find crashes faster. Game length gets in the mix as well. Often games have a specific target on how long an average game should last. This is important in mobile games. If adding some element to a game makes it too long or too short, we are going to find out very quickly with an artificial player that can execute 10,000 games in a short period of time and clearly state how many moves it took to complete the game in average compared with the same test before the latest changes.

There are areas where an artificial player won’t help with QA: graphics, UI, sounds… While in others it can make a massive difference both in the quality of the QA, but also in the amount that can be done and the cost of it. This is true for testing the game engine, the balance of the game, and the impact of changes to an existing game.