Imitation learning (behavior cloning) in driving games

Anish Diwan

Dec 5, 20215 min read

Video games have been around ever since personal computers first came out into the market. Starting from simple ones such as Pong to highly involved and strategy filled games such as Counter Strike, video games have evolved alongside their supporting computer technology. My goal for this blog is to connect the two ends of the spectrum and come full circle to the idea of computer software playing computer games on your personal computer. This blog tries to explain how I attempted to develop an AI model to learn to play the 1995 video game, Road Rash.

The Road Rash AI

The way to do this is not much unlike how you might go about teaching your little brother to play a video game. You show the AI what’s happening on screen, you give it control over keyboard inputs and then you sit beside it and teach it by example. The more technical term for this process is supervised learning. More specifically, the problem, in this case, is defined as a supervised learning classification problem, where the AI model is given a bunch of labelled data and is told to learn the features of the input data in such a way that it can accurately predict the label of some arbitrary unlabeled testing samples. In our case, these labels are keyboard inputs such as turn left, turn right, or keep going straight.

Finally, why just driving games? Well, it is not a stringent requirement that driving-related decision making must only use supervised learning. However, it has shown significantly better results in driving-related scenarios in current literature. One reason why supervised learning-based classification does so well in driving scenarios is that driving (especially in video games where the world is far less unpredictable than real life) requires a fairly simple strategy. Stay on the road, stay in your lane, and avoid traffic. In contrast, open-world games or games with sequential and complex objectives are open to the formation of strategies that can usually never be summed up by a simple classification of in-game decisions. For example, in a game such as Road Rash, if the model learns to stay on the road and avoid incoming vehicles, then it does a fairly good job of driving around. However, in a game such as Mario, the player must learn very specific and correct sets of in-game decisions in order to complete the smallest objectives (such as killing a turtle or getting a power-up). In such as situation supervised learning-based classification is just not good enough and methods like reinforcement learning might be more effective.

The Game

Road Rash is honestly one of those games that will probably never cease to be entertaining. Although it is currently 27 years old, the game still feels more entertaining than the best modern racing games. From a project point of view, Road Rash is both simple enough to be viable for supervised learning and is also complex enough to be a challenge. The driving aspect of the game is fairly easy and only leads to five possible input command classes (accelerate, turn left, turn right, accelerate while turning left, and accelerate while turning right). Furthermore, Road Rash also has a combat element that enables the player to kick or punch an adjacent driver. There’s just something strangely fun about an AI model smacking an opponent while trying to drive xD. However, you can use most of these methods/code in any driving or non-driving game. Just be sure that the in-game decisions are not heavily strategy dependent and can be executed by a fairly low number of inputs (prediction classes).

The Supervised Learning Process

The general procedure to develop a supervised learning AI model for classification is as follows. This methodology is obviously not restricted to gaming decision classification and can be used for essentially any kind of classification. The rest of this blog though is structured around the problem of using CNNs to classify in-game frames to in-game decisions. Also, quick side-note, a lot of the methods and code used in this project were built upon the “Python Plays GTA5” project from the YouTube channel Sentdex. You should definitely watch that series.

Problem Definition & Data Collection

In our case, the problem is mostly already defined. Our input data is going to be frames from the actual game. These frames are accompanied by a label that indicates which in-game decision was made when that frame was recorded. The five possible in-game decisions are straight, left, right, and a combination of straight with left and right. Braking was intentionally neglected from the in-game choices as Road Rash does not really require you to brake that often. The other remaining choices are related to combat. I decided to train a separate model just for combat since combat is a whole new strategy from driving and does not depend on any driving related patterns. Additionally, having a separate model for combat means that the driving decisions are restricted to only a few classes and data regarding driving decisions is not rendered statistically insignificant due to the addition of combat decisions. The training process for both these models is essentially the same.

The in-game frames are collected through cv2 and the python library for windows called pywin32. Most of the screengrab code is essentially the same as the code used by Sentdex in his series. Since Road Rash is currently 27 years old, there is no real straightforward way to play it. What I ended up doing was to obtain an original game copy and run it using compatibility mode on Windows 10. Seriously though, compatibility mode is such a beautiful feature. Props to Windows for being the only OS to enable its users to run the most vintage software in the world. But I digress. The only caveat now is that compatibility mode for Windows 95 only runs full screen. So, you can’t shift to Windows 10 apps and while you are playing the game. The way I solved that issue is simply by running a timer on the screengrab and recording for a fixed amount of time (3 mins in my case as it was the average time it took me to finish a race).

Post Processing

A lot of the data being recorded is simply not required to accurately train the CNN model. This bulky data translates to slower processing times and basically slows down the whole training process. Hence, the data is cut down and resized as much as realistically possible. In my case, I converted the RGB images to grayscale, cropped out the speedometer from the frames, and resized the image to a smaller dimension (while still maintaining the aspect ratio). All of this was very simply done using OpenCV. The processed frames were then stored inside a NumPy array and saved using Pickle. Finally, before training, I shuffled the data and cut out 20% of it for out-of-sample testing.

Model Definition, Training & Evaluation

The CNN architecture I chose to go with is called Alexnet. It was one of the most groundbreaking architectures that, when it was introduced, was the best neural network architecture to run classification problems on images. There are a few other architectures that presently perform better than Alexnet but they tend to be much more resource intense. Sentdex, on his GitHub, presents a few other alternatives, so be sure to check those out too. Alexnet can be easily implemented using the Python package PyTorch. I trained the model in Google Colab with the free GPU. Since the compute time was restricted to five hours (and maybe because of my small dataset), my out-of-sample accuracy maxed out at 58%. However, the Tensorboard graphs clearly indicate that the model was inching towards higher accuracies and lower losses. I might try to train it locally with a better model in the future. For now though, these results translate to a “better than expected” performance when the model was run in the game.