Training AI for Video Games: Perfect in their Imperfection

As mentioned in our popular blog post ‘Providing the right AI: Levels of Difficulty as a Service’, it is key to understand that an essential aspect of a successful Intelligent Artificial Player (IAP) is that it is not always the best player.

Unlike Google’s Deep Mind (and other amazing companies of the sort), our primary objective isn’t to advance the field of AI. We are not using games as an environment in which we can test the evolution of AI in our search for the perfect problem solver. We need “imperfection”, otherwise unbeatable IAPs would not only be annoying for humans to play against, but worse: they’d be boring! Sure, the few master players of the world would find an unbeatable opponent fascinating, but the rest of us (99.9% of the population) prefer to have fun and win sometimes.

In that same blog post, we covered the various techniques through which we achieve IAP diversity for different types of games. By now you probably know that we use Machine Learning techniques as well as Deep Reinforcement Learning and others to create and train our IAPs. Without getting too technical, we explored three different approaches in order to generate variety in IAPs playing a game: Interrupted Development vs. Goal Alteration vs. Next Best. In this article, we will cover a fundamental shift in our training methodology as a continuation to that blog.

As traditional Reinforcement Learning proposes, in the beginning we were training the IAPs against each other in order to progress in their performance. All IAPs contributed to the overall learning process. That approach was getting the expected learning rate which is quite slow and resource heavy. The main reason being that the process starts with completely random players. Yes, as ridiculous as that sounds, it’s true, the starting point for traditional Reinforcement Learning where the machine learns by playing against itself, is a complete tabula rasa (or, blank slate).

In the scientific field where a methodical and reproducible approach is necessary, this is very understandable: Start at zero knowledge or understanding. Random is zero. And because Alpha Zero defeated Alpha Go, it is clear once more that this traditional approach makes sense and works in order to create a superior AI within limited parameters (like in a video game). The reason for this is that starting from zero knowledge has the advantage of not making the same mistakes humans have done when researching the game. There is no human bias. Please note that the key here is that this approach is scientific and, how we like to call it here at Decisive.AI, purist.

Our realization was right at the root of our purpose: We are here to create IAPs that are fun to play with/against - not IAPs that are perfect. We want an IAP that plays like a human. We consider this human bias a feature, not a bug.

So instead of using the traditional approach of a random player to start the Reinforcement Learning process, we tried using a slightly good player for the initial training. This slightly good player was hard-coded manually. It is important to note that it was by no means a great player, but a slightly good one. Hard-coding the player consumed minimal resources.

Then, we put this slightly good player to play against itself and our AI was configured to simply observe these players. So now, our AI was learning its initial lessons by observation from a slightly good player rather than a completely random one.

Right after this initial phase of Learning by Observation, the AI is then put to play against itself through Reinforcement Learning. The difference now, is that by the time Reinforcement Learning starts, the AI is not commencing with random behaviour, but at a ‘slightly good’ level.

The results were astonishing.

The overall training effort was shortened significantly, now taking a quarter (yes, 25%!) of the time it used to take to complete the training. And not only that, but within the now shorter training effort, where before it took 80% of the time and computing power to get from random to good, and 20% to get from good to very good; it now took 5% of the time and resources to get from starting point to good, and about 80% to get from there to very good. This means better allocated resources in an already more efficient process. Mind-blowing!

Scientific advances in the AI field are justifiably invested preventing human bias. This makes total sense. However, Decisive

AI is strategically invested in human bias. We want human-like behaviour in our IAPs. This new training method not only (i) shortens the entire process, (ii) reduces the overall effort, (iii) better allocates resources into higher training rather than the lower or starting point, (iv) and renders higher quality IAPs overall, but most significantly, (v) it introduces a very welcome human bias into the IAP’s baseline right from the start, making them quite simply more appealing to our customers: the video game companies.

We are also exploring other techniques (a bit fancier and borderline esoteric) with our AI specialists and trying them out right on the games. Once again, the point is that we are all learning in this evolving field of AI and paired up with the fascinating video game industry, we are here to disrupt how AI is produced, provided, and serviced in digital environments.

Reminding ourselves that our Intelligent Artificial Players are perfect in their imperfection is just part of the fun, and we are loving it!