Deep RL agents that learn behavior from human demonstrations (DQfD in minecraft)

Anish Diwan

Mar 12, 202316 min read

Github Repo (for implementation details please refer to the appendix)

Have you ever played Minecraft? If you have then you might already have some idea about its vast, open-world, exploratory characteristics. If you haven't, then here's a quick summary. Minecraft is a sandbox type video game where players can do all kinds of things. Fundamentally, a Minecraft player gathers resources, mines minerals, crafts items, and learns to survive in the wild. The actual tasks required to do this are generally sequential in nature and get more and more challenging as the player progresses in the game. For instance, at the start of the game, the player does not have much in the way of tools and has to first chop down trees to get wood. It can then craft items using this wood to make tools and weapons. These tools enable it to mine minerals and make better tools, build a house, or try to get diamonds (and who doesn't like diamonds!). All of this requires a ton of exploration and sequential decision making.

These vast opportunities for exploration and sequential decision making are the key factors that drive millions of people towards Minecraft and are the same factors that make Minecraft one of the best platforms for research on intelligent agents. Microsoft Research feels the same way and has developed various software packages and wrappers around Minecraft that enable running experiments in this environment. As an extension to this, NeurIPS hosts an annual research challenge [1] that tasks participants to train agents that solve some benchmark tasks in Minecraft -- tasks that are pretty simple for human players but can prove to be very challenging to teach agents, especially in a sample-efficient manner.

My goal with this article is to play around with the NeurIPS software package called mineRL and train an agent to solve some of the simpler tasks in Minecraft using a small set of human demonstration data. I plan to use Deep Q Learning from Demonstrations (DQfD) [2]. This is an extension of DQN that injects human demonstrations into the experience replay to direct the agent's policy to follow human demonstrations.

Description: Policy learnt towards the end of training using the deep Q-learning from demonstrations algorithm

The next few sections discuss the mathematical background of this work, the DQfD algorithm, the training process, results, and a discussion. The implementation details (including installation, datasets, model specifics and more) are covered in the appendix at the end of the article.

Background

MDPs & RL Framework

Most reinforcement learning algorithms adopt the standard Markov Decision Process formalism as described in [3]. An MDP is a 5-tuple < S , A , R , T , γ > where

S is a set of states
A is a set of action
R(s, a) is a reward function
T(s, a, s') = P(s' | s, a) is the transition function, and
γ is a discount factor

The MDP framework works in discrete time. The agent is initialized in some state st and takes an action at (where t is the current time step ; t = 0 when the agent is initialized). Upon taking the action the agent gets some reward rt+1 as per the reward distribution and is then taken to the next state st+1 as per the transition function. The discount factor γ is used to control the agent's outlook on future rewards. Successive rewards are discounted to change the degree to which the agent prioritizes immediate rewards as compared to rewards obtained in the future. This is often very useful in cases where rewards are sparse and the reward obtained at a far-off point in the future is to be attributed to a "good" action in some past state.

A policy π determines the agent's action in every step. The policy is a mapping from states to actions. MDPs operate under the markov assumption. This states that the agent's future state is conditioned only on its current state and action and NOT on any past history of states and actions (first order markov). This essentially implies that the only information needed to take optimal actions is the agent's current state and not its previous states. The agent's goal is to converge to an optimal policy that maximises expected discounted future reward. In simple terms, the agent's goal is to find a mapping π that maps every state to an optimal action such that the expectation over future discounted rewards is maximised.

The agent keeps track of the "value" of every action in every state based on the expected reward that was obtained on taking that action. This value is used to compare multiple actions in a state and subsequently pick the best one (or potentially explore more by choosing an action non-greedily). The action value function Qπ (s, a) maps to the value of picking action a in state s and is defined as the expected future reward that can be obtained from this state on picking this action and then following policy π for subsequent actions. The optimal policy π (s) = argmax a Q* (s, a) where Q* is the optimal value action value function.

Deep Q-Networks (DQN)

Before moving ahead with DQfD, we first need to understand a very successful Q-learning algorithm called DQN [4]. DQfD borrows a lot of components from DQN and makes modifications that greatly improve its performance. A major issue with the reinforcement learning framework described in the previous section is that as the dimensionality of the state space and action space increases it becomes increasingly difficult to store and quickly retrieve action values for every action in every state. Take Minecraft for example. Here, the agent's state is an image of 64 x 64 x 3 resolution and the agent can pick a combination of 9 actions. This means that every observation is part of a 12288 dimensional space and is one of 256 ^ 12288 possible states. For each of these states, the agent is expected to keep track of (and iteratively update) 9 action values (assuming that the agent can only take one action at a time and not simultaneous actions -- something that mineRL allows). This is just way too much information to model tabularly.

This is where deep learning comes into play. DQN approximates the action value function using a deep neural network. This means, instead of manually keeping track of all state - action pairs, we train a neural network to approximate the value function. Q(s, . ; θ) hence refers to the value function that is parameterized by the network parameters θ. The network outputs the action values for all possible actions in state s. DQN also has this concept of a replay memory Dreplay that stores previously sampled transitions < st , at , rt+1 , st+1 >. This speeds up learning as the agent can always sample from the replay memory to update its Q network. This is also the part of DQN upon which DQfD mainly improves. Apart from the replay memory, DQN is also implemented with a frozen target network. An issue with DQN is that it iteratively optimizes a network based on an optimization target that comes from the same network. This could potentially cause chaotic, and undirected updates. The freeze target method simply makes a copy of the Q network every τ steps and uses the "frozen" network as a target to optimize the actual Q network. Below is a sketch of the DQN algorithm.

Deep Q-Learning from Demonstrations

A pitfall of DQN is that it relies solely on the agent’s exploration to “find” a good policy. It requires the agent to “stumble” across highly rewarding actions in the state space and then exploit those trajectories sufficiently so that the network learns to approximate the optimal action value function at those states. This unfortunately needs an extremely high number of training steps. It is often the case that the algorithm get’s “stuck” in sub-optimal policies until it eventually samples a considerable proportion of all possible trajectories. DQfD solves this issue by pre-training the agent using demonstration data. A DQfD agent learns to approximate the optimal action value function by minimizing some notion of loss between its current policy and the policy of the demonstration data. After the agent’s policy has been sufficiently pre-trained, the agent continues to improve it by collecting experience in the environment and then updating its Q network with both self-collected experience and demonstrations. DQfD also introduces its own loss function and utilizes prioritized experience replay to juggle between sampling from self-collected experience and demonstration data. The key novelties of DQfD are as follows.

A permanently retained replay memory with transitions from the demonstration data. Pre-training solely on demonstration data before interacting with the environment.
A novel definition of loss; J(Q) = JDQ(Q) + λ1Jn(Q) + λ2JE(Q) + λ3JL2(Q) comprising of a double Q learning loss, an n-step double Q learning loss, a large margin classification loss, and an L2 regularization term.
The idea of the large margin supervised classification loss is to push the value of the demonstrator's action above the values of other actions in the same state.
The n-step double Q learning loss helps attribute the value of the demonstrator's action to propagate further backwards in the trajectory.
Demonstrator and self-collected experience transitions are weighed to prioritize the sampling frequency of the demonstrator.

The Minecraft Environment

The mineRL package offers a multitude of environments. In this article, I will be using one of the simpler environments called TreeChop. As mentioned earlier, chopping down trees to obtain logs is one of the most fundamental tasks in Minecraft. Wood logs allow you to craft tools and progress in the game. In this environment, the agent's goal is to obtain 64 logs by chopping down trees. The agent begins in a forest biome with an axe that it can use to chop down trees. The agent is given +1 reward for obtaining each unit of wood, and the episode terminates once the agent obtains 64 units. The agent's observation is just a 64x64 RGB point-of-view image similar to the one shown below.

mineRL encodes actions as dictionaries. This allows users to experiment with a variety of RL algorithms even for continuous action spaces. In the TreeChop environment, the agent's action dictionary is as the one shown above. However, the deep network within DQfD returns a finite set of discrete actions. Further, the number of possible combinations of all actions is too large and keeping them all will drastically complicate the learning process. It is hence required to sub-sample from this set of action combinations to have the model return only those actions which are the most likely to occur and which result in tangible progress towards achieving the goal. In the end, a subset of 14 basic actions and action combinations was retained. These actions are shown in the figure above.

Restricting the agent to a smaller set of actions causes a mismatch with the actions in the demonstration data. The players that collected these demonstrations were not restricted in terms of their action choices. The demo set contains actions in the form of the mineRL action dictionary. However, the retained subset of actions accounts for a large majority of demo data actions. To circumvent this issue, the demo actions were mapped to the agent's available actions. The details of this mapping are explained in the appendix.

Results & Discussion

Description: The agent's policies during learning (top left, top right, bottom left, bottom right: episodes 5, 10, 15, 20)

The model architecture, hyperparameters, and other specifics are provided in the appendix. This section covers the training and results of the DQfD algorithm in the TreeChop environment.

The algorithm seems to learn quite rapidly. The agent first quickly learns a policy of running forward and attacking. This causes it to obtain wood blocks if a tree is right in front. However, it often starts digging down and get's stuck in unrecoverable positions. A few more episodes of learning get the agent to a policy where it does not simply attack all the time but only when there are leaves or tree trunks in front of it. However, at this stage, the agent has not learnt to navigate or look around and gets stuck if it can not attack something directly in front of it. This policy further improves as the agent's pre-training phase is completed. Around this point, the agent is better able to navigate and attack tree trunks. However, it still gets stuck in positions and has not learnt recovery behaviours. Learning recovery behaviours seems to be a challenging problem in general as none of the demo data really contains a large sample of recovery behaviour (mostly because experts rarely get stuck in such positions). Learning these behaviours would require the agent to explore much more on its own. Towards the end of learning (this experiment was conducted for 20 episodes -- which frankly is not a lot), the agent has further improved its navigation abilities. It still gets stuck but a bit less than before. Further, it can also recover from positions that are not completely unrecoverable (mostly by jumping over obstacles or attacking to clear a path in front of it). What's interesting to note is that the agent often attacks leaves instead of tree trunks. The current hypothesis for this is that in a lot of the demo data (and even normal exploration), a wood block is obtained after first clearing out the tree's leaves. Attacking leaves seems to lead to a high probability of reward and the agent probably does this as the policy network gives this action a high value seeing as it leads to rewards down the line. However, this preference for leaves leads the agent to ignore tree trunks that are right in front of it. Further learning with more exploration could potentially increase the action value for attacking tree trunks directly instead of leaves. This could help solve this issue.

Meanwhile, the loss seems to go down linearly and indicates fairly quick learning with good generalisation. However, it seems that 20 episodes are not nearly enough to learn to solve the tree chop environment. I will continue to further experiment with this in the future. For now though, these results seem quite promising and suggest that more training can definitely improve the learnt policies.

References

[1] Neural Information Processing Systems Foundation. NeurIPS 2021 Competition Track. NeurIPS 2021 [Internet]. NeurIPS; 2021 [cited 2023 Mar 20]. Available from: https://neurips.cc/Conferences/2021/CompetitionTrack

[2] Hester T, Vecerik M, Pietquin O, Lanctot M, Schaul T, Piot B, Horgan D, Quan J, Sendonaris A, Osband I, Dulac-Arnold G. Deep q-learning from demonstrations. In Proceedings of the AAAI Conference on Artificial Intelligence 2018 Apr 29 (Vol. 32, No. 1).

[3] Sutton RS, Barto AG. Reinforcement learning: An introduction. MIT press; 2018 Nov 13.

[4] Mnih V, Kavukcuoglu K, Silver D, Graves A, Antonoglou I, Wierstra D, Riedmiller M. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602. 2013 Dec 19.

[5] Lillicrap TP, Hunt JJ, Pritzel A, Heess N, Erez T, Tassa Y, Silver D, Wierstra D. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. 2015 Sep 9.

[6] Wang Z, Schaul T, Hessel M, Hasselt H, Lanctot M, Freitas N. Dueling network architectures for deep reinforcement learning. InInternational conference on machine learning 2016 Jun 11 (pp. 1995-2003). PMLR.

[7] labml.ai repository. Labml.ai annotated pytorch paper implementations [Internet]. Annotated PyTorch Paper Implementations. Github; 2021 [cited 2023Apr]. Available from: https://nn.labml.ai/

-----------------------------------------------------

Appendix

Installing MineRL

To be very honest, mineRL isn't the best documented package. As a complete beginner, you might find it a little ambiguous and hard to install. There are multiple versions (with some that are deprecated but seem to still be functional), each of them has different installation and running instructions, and some just result in errors halfway through. Below are some tips and tricks that helped me get this stuff installed.

I chose to use Windows 10 with WSL2 (windows subsystem for linux). WSL is one of the best recent additions to Windows and majorly simplifies things in terms of software development. The recent update also enables you to run GUI apps directly from the virtual OS. My suggestion is to use Linux or to use WSL (mineRL is supposedly tested on WSL2 so this has the most chance of working). PS: I would not recommend virtual machines; I did try installing it on a VM but it's far too slow to be practical.

If you just google "mineRL installation", you'll probably stumble across some past, non-functional version. You can refer to this gihub repo for the installation instructions. It points to the different versions and their docs and is the best source of information that I could find. If you plan to replicate my work then you should install v0.4. This is a slightly older version but is much better documented, and has support for a lot more environments. The latest version (at least at the time of writing this article) is lacking a lot of nice functionality. Below is a step-by-step guide on getting this stuff to work on WSL2.

1. Install WSL (I used it with Ubuntu 18.04 but it should be good with 20.04 too).

2. Enable GUI apps on WSL (this should be enabled by default but you will need to install the vGPU drivers for your specific system). Follow the vGPU tutorial and remember to update and shut down WSL if this applies to your case.

3. Before installing mineRL, you might want to upgrade your pip and pip3 versions. This helps avoid a future issue with opencv-python (module error with skbuild). To do this, first install pip and pip3 and then upgrade them as follows. You might also need to install pyglet (or downgrade to an older version).

$ sudo apt install python-pip
$ sudo apt install python3-pip
$ pip install --upgrade pip

$ pip install pyglet==1.2.4

4. Install mineRL as per the v0.4 docs.

5. Since GUI apps are enabled, you will not need xvfb. However, you might still get errors saying that a desktop was not found. I found that this solution worked for me. Just install xorg via

$ apt-get install xorg openbox

You can use the following python script to test your installation

import gym
import minerl

import logging
logging.basicConfig(level=logging.DEBUG)

env = gym.make('MineRLNavigateDense-v0')

obs = env.reset()

done = False

print("Reset Successful!")

while not done:
    action = env.action_space.sample()
    obs, reward, done, _ = env.step(action)
    print(f"reward: {reward}")
    print(f"action: {action}")
    print("-----")

You should see a tiny Minecraft window and the rewards and actions should get printed to the console. Congratulations! You've managed to install mineRL :)

Finally, install PyTorch using the installation command for your specific system. As of now, PyTorch only supports CUDA 11.7 and 11.8. In my case, the 11.7 version works best (you don't need to install the CUDA toolkit or any other drivers as PyTorch has its own cuda runtime). You can test your PyTorch installation (and if CUDA is enabled) with the following script.

import torch
x = torch.rand(5, 3)
print(x)
print(torch.cuda.is_available())

Downloading & Sampling Demonstration Data

DQfD works by leveraging human demonstrations. Luckily, mineRL is one of the biggest imitation learning datasets and has a ton of demonstrations for a bunch of tasks in Minecraft. For the purposes of this project, we'll be considering the treechop environment. You can download and then sample data as follows.

1. Make a new directory to house your data. Say the path of this directory is "/home/mineRL/datasets"

2. Set the environment variable MINERL_DATA_ROOT to point to that path. You can do this by adding the following to your bashrc.

export MINERL_DATA_ROOT=/home/mineRL/datasets

3. Download the data with

$ python3 -m minerl.data.download --environment "MineRLTreechop-v0"

4. Quick tip: This version of mineRL is from 2021 and depends on a similarly dated version of opencv. This means that the data sampling functions use some deprecated methods from opencv. You can revert to opencv v4.5.4.60 using the following command

$ pip3 install --upgrade opencv-python==4.5.4.60

5. Now try sampling from this dataset using the BufferedBatchIter iterator. The following code should print out some states and rewards to the console. Here, "iterator.buffered_batch_iterator" returns a python generator. You can call the next method on this to sequentially return trajectories.

import minerl

from minerl.data import BufferedBatchIter
data = minerl.data.make('MineRLTreechop-v0')
iterator = BufferedBatchIter(data)

# OPTION 1: RUNNING THE ITERATOR INFINITELY (comment out one option)
for current_state, action, reward, next_state, done \
    in iterator.buffered_batch_iter(batch_size=1, num_epochs=1):

        # Print the POV @ the first step of the sequence
        print(current_state['pov'][0])

        # Print the final reward pf the sequence!
        print(reward[-1])

        # Check if final (next_state) is terminal.
        print(done[-1])

        # ... do something with the data.
        print("At the end of trajectories the length"
              "can be < max_sequence_len", len(reward))
              
# OPTION 2: USING NEXT ON THE GENERATOR (comment out one option) 
gen = iterator.buffered_batch_iter(batch_size=4, num_epochs=1)
current_state, action, reward, next_state, done = next(gen)

6. Finally, the following command should let you view random demonstration data trajectories.

python3 -m minerl.viewer MineRLTreechop-v0

Algorithmic Details

Frame Skipping & Demo Action Mapping

Frame skipping is a computational trick where the agent takes actions in only one out of k frames and skips computing the action in the next k-1 frames by simply re-taking the initial action. This allows the agent to play k times more episodes while having a minimal effect on learning.

The demonstration data transitions and replay memory transitions are pre-processed before feeding into the deep network's optimizer. This pre-processing mainly consists of two steps (concatenation and mapping). To comply with frame skipping, one transition must account for some k steps in the environment. When taking actions and adding the agent's self-explored transitions to the replay memory, the state vector is a concatenation of k pov frames. The agent repeats its action at time step t for the next k frames and adds up the received rewards. The final transition added to the replay memory consists of a k-step concatenation of current states (st), next states (st+1), and rewards while the action is kept the same. The demonstration data transitions must also comply with this framework. Hence, a string of k-step sections is sampled from the demo data and is aggregated to form the final transition. As with the agent's self-exploration, the current states, next states, and rewards are simply concatenations of k individual ones. The actions however can potentially vary across the k steps. Hence, an action concatenation step is carried out. The action dictionaries from the k steps are all added up and the number of occurrences of each individual action is counted. The actions which are not taken are removed from this final dictionary. This aggregated action is then mapped to the agent's available actions based on the frequency of occurrence of each individual or combined action. In the event that a demo action maps to two equally frequently occuring possible actions (say in case the demo action was [forward + jump + attack], which maps to either [forward + jump] or [forward + attack]), one of the possible actions was chosen randomly. Refer to the comments in the demo data script on github for the pseudo-code.

Soft Target Network Updates

The original DQN paper proposes the idea of a frozen target network that is updated in a "hard" manner. The idea is to copy over the weights of the policy Q-network every τ steps and use this network within the loss function. This enables the agent to optimize its Q-network towards a fixed optimization target and improves training stability. A modification of this is the "soft" update scheme where instead of freezing the target network every τ steps, the weights of the target network are very slightly influenced by the policy network. This influence is dictated by some hyperparameter β. This approach was introduced in the paper on Deep Deterministic Policy Gradients [5] and resulted in stability improvements during learning. The DQfD implementation in this work hence uses soft target updates.

θ′ ← β θ +( 1 − β) θ′ [θ - policy net weights; θ' - target net weights; β << 1 ]

DQfD Loss Definition

As described in the previous section, the DQfD paper introduces a custom loss definition comprising a 1-step and an n-step double Q learning loss, a large margin supervised loss, and L2 regularization. This reproduction of the algorithm slightly modifies this loss definition to comprise a 1-step TD loss, L2 regularization, and the large margin classification loss with the margin function having a range {0,1} depending on the agent's and the expert's actions.

Patches To MineRL Demo Data Sampling

As explained briefly in the results section, the MineRL demo sampling method was dumping huge amounts of demonstration data directly into memory. The generator function called buffered_batch_iter was actually just yielding demonstrations from a data_buffer loaded in memory. This was causing issues with memory overloading and was causing the Minecraft server to die abruptly. A fix for this was implemented by monkey patching a modifies buffered_batch_iter method which only loads in one trajectory to the buffer at a time in a sequential manner.

Model Architecture & Hyperparameters

Similar to the original DQfD paper, this implementation uses the duelling state-advantage convolutional network architecture [6]. This architecture contains a backbone with three convolutional layers that finally return a 64-channel 5 x 5 image. It then contains separate heads for the state value and action value. Finally, the action value of each action is computed by combining both as per the state-advantage definition. A PyTorch implementation of this network is available at [7].

The hyperparameters used during training were as follows.

FRAME_STACK = 4 (frame skipping)
BATCH_SIZE = 32
GAMMA = 0.99 (discount factor)
EPS = 0.05 (epsilon greedy exploration)
TAU = 0.005 (soft update values)
LR = 1e-4
BETA = 0 until pretraining and gradually increased to 0.75
Adam optimizer
Expert margin = 1 (when agent action not equal to expert action)
L2 regularization weight = 1e-5
Pre-training steps = 12 episodes
Episode steps = 1500