About a week ago I started my two-minute pitch saying: “Decisive AI trains and delivers intelligent agents for computer environments”. Later I realized that this might make sense, but it isn’t clear to anybody but those that have a pretty good understanding of what training an intelligent agent actually is.
Training is key, because it separates a hard-coded opponent from one that learns. We are all very familiar with video games where you play a level boss that will do the exact same thing over and over again, regardless of what you do. In fact, beating the boss is just learning what the pattern is. This was true 30 years ago and it is true now; the best game of the year, Zelda Breath of the Wild, is a great example of hard-coded-pattern-following level bosses.
A hard-coded agent needs a programmer coding its behaviour. So, it is all about how well the programmer codes this behaviour and then...that’s it, it will never do anything else.
In a video game, you can have intelligent agents as opponents that are either hard-coded, or trained.
A trained agent does not need a programmer to code any behavior. Instead, it learns by playing the game. Humans also learn this way, but computers need a lot more playing because they start from scratch. Even if you never played a particular game, you have probably played some other games. Even if you haven’t, games try to represent something from the physical world so it is not totally new for you - you’re starting with a base of knowledge. Humans also learn faster.
Okay, so a trained agent will learn by playing the game. This needs two things: a ‘brain’ to learn, and a way to play the game many, many times. The ‘brain’ part is solved with neural networks and learning algorithms. Easier said than done, it took us about two years to get to a point where we can design the neural network and select, implement, and tune a learning algorithm.
Eventually, what is needed is a “Learning Platform” which is a computer environment that facilitates the training: playing the game and feeding the results of that game to the learning algorithm, so that it can tweak the neural network to learn for each move of the game played.
The basic concept of this training is that we have a reward that the algorithm wants. For games, this is typically winning, but it doesn’t have to be. I can set things up so the algorithm gets a ‘1’ reward when winning, a ‘0’ when tying and -1 when losing. The algorithm wants to maximize the reward (like dogs with treats, or humans with medals).
Setting the reward can be easy, but it isn't as trivial as it sounds. I remember talking with a game producer who was telling me that he saw a video of a Mario game played by an AI. It played very well, it always won. But, when it played, it would make some weird twisting jump that looked really bad - all the time. It was effective at winning, but it wasn’t something that played ‘naturally’ or ‘human like’.
I am pretty sure that the AI was trained with just a reward related to winning the levels, which makes sense. But if the person judging the behaviour expects something else, then the rewards have to reflect this. If doing the twisting jump is considered something odd, then a small negative reward would take care of it. Or, if time matters (fastest or slowest) then you can tweak the reward accordingly.
But notice that I’m talking about changing the reward that represents what I want as a trainer, but not changing any algorithms. Training an agent is not a ‘fire and forget’ process, but it is completely different than coding an agent.
When the agent will play a multiplayer game as a player (like a human would), it is best to train the agent playing against itself. When training an agent, we need to know how the training is going. The learning platform tells us how the learning agent is performing against a baseline agent. The baseline agent can play using a hard-coded algorithm or even at random; whatever will tell us how well our learning agent is doing.
Here is an example of an agent we were training. We had a validation done every 10,000 training games:
Note the yellow curve. The learning agent starts by losing most of its games, but then it starts to get better until it wins quite consistently. It isn’t linear, sometimes something is learnt that makes for poor decisions so it needs to be tweaked with more gaming. This is especially true for games with lots of randomness in which you could win by pure luck, or lose playing great. Either way you learn the wrong lesson.
Training agents makes for better agents, whatever ‘better’ means: playing well, playing fast, playing ok, etc. You don’t want to have a super strong opponent that wins all the time in a video game. But the best thing is that the agent is not stuck always doing the same thing. It can keep learning from new games played against humans.
A big obstacle for training agents is how hungry they are for playing. In my experience, an AI needs 10,000 games of practise for each game a human needs. Yes, it can be that inefficient. This leads to two questions: Why is it so bad at learning? And: How can it play so much in a reasonable of time? I’ll address these in my next blog. Until then!