Anish Diwan
- Sep 16, 2023
- 14 min read

"I know what you think I know" - intentional agents with recursive beliefs in adversarial games

Lately, I have been fascinated by multi-agent systems where learning agents operate under partial observability. I recently also attended an expert talk on such systems where agents make decisions based on their beliefs over other agents and how this leads to behaviour that is almost social in nature. So, here's my attempt at creating such intentional/social agents to play a rather convoluted multiplayer game that's got tons of elements of partial observability.

To formalise this, a partially observable multi-agent system is one where two or more independently acting agents interact with the environment (and by extension, with each other) to optimise their own goals while only having access to their own observations of the environment (and not its true state). An independently acting agent is simply one that acts based on its own judgements and is not instructed to make decisions by some central coordinating authority. An agent's observations are similar to sensing systems. They might inform the agent of variables that give it some general idea of the true state of the environment but are not a complete representation of the exact conditions of the system. Road traffic is a great example of such a system. Each vehicle makes driving decisions independently (disregarding traffic rules). It knows its own speed and position but is only partially aware of the speed and position of other vehicles around it. It can sense the position and speed of other vehicles around it through sensors but can only use them to approximate their true state. Other examples of such systems are card games such as poker and Uno.

What's even more interesting about adversarial games like poker is the added challenge of guessing the other players' intentions. An agent is likely to benefit from changing its playing style depending on what it believes the other agents' intentions are. All this time, its opponent agents are trying to change their own playing style depending on what they believe their opponents' intentions are. This recursive modelling of beliefs leads to players who act with intentional strategies and depict somewhat social behaviour.

In the rest of this article, I discuss ways to formally model such systems, ways to use machine learning to learn behaviour policies based on such recursive modelling, and some results of this work in a very strategic game with a ton of unknowns.

Background

skip background

Markov Decision Processes (MDPs)

A Markov decision process [1] is a framework for modelling sequential decision making problems. It is founded on the Markov assumption that necessitates that state transitions are conditioned only on the current state and action and not on the history of actions. This implies that an agent is assumed to be perfectly capable of taking optimal actions only based on its knowledge of the current state (assuming that it has captured reward and transition dynamics via sufficient experience gathering).

I provide a more detailed explanation of MDPs in one of my previous articles [MDP explanation]. The rest of this article assumes an understanding of the standard MDP framework.

Partial Observability (POMDPs) & Belief States

Most reinforcement learning algorithms make the rather strong assumption that the agent always has access to the state of the system. This is often the case in a lot of RL benchmark problems and is generally an offhand assumption — especially in recent times when the deep networks behind the RL algorithm are massive, trained with tons of data, and capable of generalising even with "bad quality" inputs. While deep learning is incredibly powerful and has shown great success in recent research, it does require a massive amount of training data and careful tuning, and in my opinion, should not be thought of as the panacea for handling every type of complexity in a problem.

Partial observability, as briefly explained in the introduction of this article, refers to the scenario where the agent does not have access to the full, true state of the system. It is a feature of a dominant fraction of both real-world and conceptual problems — including problems such as autonomous driving, card games, robot navigation etc. — and leads to some very interesting challenges.

A partially observable Markov decision process (POMDP) is an extension of the standard MDP to account for partial observability. It is defined as a 7-tuple < S , A , R , T , Ω , O, γ > where, as usual

S is a set of states
A is a set of action
R(s, a) is a reward function
T(s, a, s') = P(s' | s, a) is the transition function, and
γ is a discount factor

And the two new elements are

Ω a set of possible observations
O a set of conditional observation probabilities

A POMDP makes the assumption that the underlying system is still an MDP. However, the agent now does not have access to its true state s ∈ S. It can only see its observation o ∈ O. On taking an action at in true (but unknown) state st, the agent transitions to the new state st+1 with P(st+1 | st, at) as usual. However, it does not directly see st+1. Instead, the agent receives an observation ot+1 ∈ O which is conditioned on the new state and its previous action with P(ot+1 | st+1, at).

The belief state is a representation of the agent's "beliefs" as to its current state. It shows where the agent "thinks" it is. It is a distribution over the state space conditioned on the agent's recent history of observations (other equivalent definitions of the belief state are conditioned over current belief and action-observation pair, however, this article uses the history definition). Belief state, B(ht) ≡ P(st | ht) where ht = [ot, at-1, ot-1, at-2 ...] up to some n. The other recursive definition of belief is bt ≡ P(st | at, ot, bt-1). The belief state is an integral part of learning in POMDPs and can be considered almost a surrogate for the true state. The belief state can be calculated analytically (given the transition and observation models) through the application of the Bayes rule [bayes beliefs]. However, it can also be updated through data-driven approaches such as filtering (as will be seen in the next section on POMCP).

For example, consider a robot in the following box environment. There are five states in the environment, and states 2-4 have a wall around them. The robot can not perceive its exact state directly. All it can do is take a measurement in the north and south directions with its sensors and detect if there is a wall in that direction. Its observation is hence a two-dimensional vector [ON , OS]. The robot can take an action to either move left or right. This system can be modelled as a POMDP. The belief state at the start of the episode is equiprobable as the agent has no idea where it is. Say the agent takes no action and receives an observation of [1, 1]. This implies that it is in one of the states between 2-4. So the belief is 1/3 for states 2, 3, and 4 and zero elsewhere. Now, if the agent receives an observation of [0,0] upon taking the left action, then it is surely in state 1. So the belief is 1 in state 1 and zero elsewhere.

Partially Observable Monte-Carlo Planning (POMCP)

POMCP [ref] is a heuristic search method for online planning in large POMDPs. It is a partial observability extension of one of the more popular model-based algorithms for solving large MDPs called UC-Trees [ref] (which itself is an extension of Monte-Carlo Tree Search). POMCP combines Monte-Carlo belief state updates with a belief state version of UC-Trees to ensure tractability, even in large POMDPs. This section assumes a general understanding of MCTS.

Iterative planning algorithms that consider all states equally tend to perform quite poorly in large problems. The increased size of the state, action, or belief space leads to issues termed the curse of dimensionality and the curse of history. The former refers to the situation where the search space of the algorithm is simply too large for it to make any significant progress at estimating optimal solutions. The latter is a challenge specifically arising in belief-contingent problems where the number of possible action-observation histories grows exponentially with the horizon, leading to a very large possible solution search space. POMCP overcomes these challenges as it is a best-first method that rapidly focuses only on promising solution sub-spaces.

POMCP consists of two main subroutines. First, it uses partially observable UC-Trees (PO-UCT) to select actions at each time step in the real world. Second, it uses a particle filter and the simulated tree to update the agent's belief state.

Figure 1 (from the original paper): An illustration of POMCP in an environment with 2 actions, 2 observations, 50 states, and no intermediate rewards. The agent constructs a search tree from multiple simulations and evaluates each history by its mean return (left). The agent uses the search tree to select a real action a and observes a real observation o (middle). The agent then prunes the tree and begins a new search from the updated history hao (right).

Unlike regular UCT, PO-UCT builds a search tree of histories instead of states. A history vector ht is defined as ht = {a1, o1, ..., at, ot}. The tree is rooted at the current history and contains a node T(h) = <N(h), V (h)> for each represented history h, where N(h) is the number of times each history h is visited and V(h) is its value (mean return from all simulation trajectories from h). Assuming that belief state B(s | h) is known exactly (this is later provided by the second subroutine), Each simulation iteration starts from an initial state sampled from B(. | h) and performs tree traversal and rollouts just like standard UCT (with a UCB1 tree policy and then a history based rollout policy). After multiple iterations of the algorithm in simulation, the action that maximises value in the completed tree is selected as the real-world action.

The second subroutine of POMCP involves estimating the belief state through a particle filter. The belief state for history ht is approximated by K particles, Bi_t ∈ S, 1 ≤ i ≤ K. Each particle is a belief and the belief state is the sum of all particles, Bˆ(s, ht) = 1/K Σ δ_s Bi_t. At the start of the algorithm, particles are sampled from the initial belief distribution. Particles are updated after the real-world action. During the simulation phase of the algorithm (first subroutine), a particle is added if the simulated observation in the iteration matches the last real-world observation. This is repeated until all K particles have been added and the belief state is complete. This is then repeated at every real-world step.

When search is complete, the agent selects the action at with the greatest value and receives a real observation ot+1 from the world. At this point, the node T(ht, at, ot+1) becomes the root of the new search tree, and the belief state B(ht, at, ot) determines the agent’s new belief state. The remainder of the tree is then pruned as all past histories are now impossible.

Multi-Agent Systems with Partial Observability

Interactive POMDP

Partial observability in multi-agent situations can be handled through various approaches depending on the characteristics of the interaction between agents. An interactive POMDP (IPOMDP) [ref] is a multi-agent characterisation of the POMDP where the actions of one agent may observable affect the distribution of expected rewards for the other agents. It follows from this that, to act optimally an agent must also model the actions of other agents. An IPOMDP is defined such that the state of one agent also includes the state (or belief state) of other agents. IPOMDPs have already been applied to dyadic games with recursive beliefs where the state space is augmented by relative belief states [ref]. This article borrows the IPOMDP definition from [ref].

An IPOMDP is a collection of POMDPs such that the following holds:

Agents are indexed by the finite set I. Each agent i ∈ I is described by a single POMDP < Si , Ai , Ri , Ti , Ωi , Oi, γi > denoting its actual decision making process. We first define the physical state space Si_phys: an element si ∈ Si phys is a complete setting of all features of the environment that determine the action possibilities Ai and obtainable rewards Ri of i for the present and all possible following histories, from the point of view of i.

The physical state space Si_phys is augmented by the set Di of models of the partner agents θij ∈ Di called intentional models, which are themselves POMDPs θij = < Sij , Aij , Rij , Tij , Ωij , Oij, γij >. These describe how agent i believes agent j perceives the world and reaches its decisions.

Note that the intentional models θij themselves contain state spaces that encode the history of the game as observed by agent j from the point of view of agent i. The elements of Si are called interactive states. Agents themselves act according to the softmax function of history action values and assume that their interactive partner agents do the same.

Cognitive Hierarchy

Cognitive hierarchy [ref] is a characteristic of some multi-agent games where each agent assumes that their strategy is the most sophisticated. The relative degree of sophistication in strategies is defined by the theory of mind (ToM) level. Under cognitive hierarchy, an agent at level k always assumes that its opponents are distributed through levels 0 to k-1. The simplest agent in such a setting (level 0) tries to infer its partner's state. At a level higher, the agent tries to infer its partner's model of itself. And so on. In our situation, there is also a level -1 agent that only acts based on its beliefs of the environment and not based on the actions of its opponents.

This manner of cognitive modelling has not only shown to be true for many adversarial games (further explanations and references in [ref]) in real subjects but also has the added advantage of preventing intractably large recursive models in the IPOMDP. In fact, it has been shown that the hierarchy terminates finitely for many real games.

The Secret Hitler Board Game

At first glance, this section has a high potential for sounding ever so slightly sketchy. I can only urge you to humour me for the next few paragraphs.

Secret Hitler is a — very real and quite fun — social deduction game where players play in two opposing teams, liberals and fascists. Both groups vote to form the government at every turn. The liberals, who are in the majority, aim to pass liberal policies (policies here refer to constitutional policies as opposed to the policies in an RL algorithm) and identify/kill the secret Hitler. The fascists on the other hand aim to protect the identity of secret Hitler and instead pass fascist policies. The objective of the game is to correctly identify your team members and work together to achieve the respective agendas of each team. Secret Hitler is a very interesting choice for experiments on partially observable multi-agent systems because of the numerous unknowns, inherent stochasticity in the game, the large space of belief states, and the potential for building social tendencies such as probing, coaxing, and misdirection in intelligent agents. The game can be played with various rule sets depending on the number of players, however, I have chosen to use a modified rule set for the simplest version (5 agents) of the game. This version is still very challenging, especially at higher levels of ToM. The exact rule set of my Secret Hitler environment is as follows (parts of the game that involve decision making are highlighted in red).

In this version of the game, there are 5 agents in total. 3 of these are liberals, 1 is a regular fascist, and the final agent is secret Hitler. At the beginning of the episode, each agent is assigned a role. We will ultimately learn a policy (the RL type) for each role. The members of the fascist team know the identity of their team member while the members of the liberal team do not know any other players' roles (they hence also do not know the remaining two liberal agents). The objective of the game for each team is to pass their respective constitutional policies. There are 11 fascist and 6 liberal policies in total. The liberals win if 5 liberal policies are passed and the fascists win if 6 fascist policies are passed. These policies are shuffled in random order.
At the beginning of the game, an agent chosen at random is assigned to be the President. This role rotates every turn. To begin the game, the president proposes another agent as a Chancellor. The other agents then vote on this proposal and a majority vote passes. In case of a failed vote, the turn is finished and the next agent becomes President. If three consecutive votes fail then a random policy is enacted.
The President then draws three policies from the randomly shuffled stack of policies. They then discard one of these three policies, and hand over the remaining two to the Chancellor. The chancellor then enacts one of these two policies. It is important to note that this whole process is carried out in secrecy. This means that none of the other agents see the three policies drawn at random by the President and the one policy that they chose to discard. Further, none of the agents also see the two policies passed on to the Chancellor and the one policy that the Chancellor then discards. The agents only observe the final policy enacted by the government. This adds the possibility of numerous equiprobable beliefs given a history of observations. This means that cognitively advanced agents can use clever tactics such as probing to gain further insights into the true state of the system.
After a policy is enacted, the turn is completed. The next agent (based on some preset sequence) now becomes the President and the episode continues.
On the fourth and fifth instance of a fascist policy, the current president must also kill another player. This also leads to interesting behaviour where players refrain from voting for other suspicious players being Chancellor. If secret Hitler is killed then the liberals win.

Modelling Secret Hitler As An IPOMDP

Much of the challenge in this problem lies in representing the Secret Hitler environment as an IPOMDM such that it is computationally feasible to solve. Due the the inherently high stochasticity in the game, it is quite difficult to exactly determine the size of the decision tree. The rule set has already been modified to minimise complexity by turning most actions into binary choices. The state vector contains both observable and unobservable features. The observable features are:

The agent's role and a vector of known roles of other agents
The current number of enacted liberal and fascist policies
The number of policies remaining in the draw pile
A vector of three random policies in case the agent is the current president (empty vector otherwise)
A vector of two policies handed over by the president in case the agent is the current chancellor (empty vector otherwise)

The unobservable features constitute the agent's belief state. Here, it is assumed that an agent's policy with respect to cooperating with or opposing another agent only directly depends on the other agent's role (and perhaps indirectly on the cards they play, but that is something we discount). It stands to reason that an agent only needs to know the other agent's role to be able to act optimally, even considering recursive modelling. Hence, the unobservable features here are only considered to be the roles of an agent's opponent players. To further simplify matters, the agent only keeps beliefs over two binary variables. The first, αf, indicates the probability that an agent is fascist (with one minus that indicating the probability of being a liberal) and the second, αh, indicates the probability that an agent is secret Hitler. The probability of being secret Hitler is modelled as a separate belief since it is possible for higher ToM secret Hitler agents to behave sub-optimally to avoid giving away their identity. A separate belief enables an agent to also "suspect" very good behaviour. The true value of αf is 1 for fascists and 0 for liberals. While αh is 0 for everyone but secret Hitler. At the start of the episode, the belief state is assumed to be equiprobable.

Finally, as per the definition of an IPOMDP, the actions of one agent must observably affect the reward distribution of the other agents. Hence, the value of taking a certain action for an agent must be tied to the subsequent actions of the other agents. For belief updates to converge over time, the α parameters of other agents must ideally also be a component of the value function (as seen in [ref]).

More stuff coming soon!

References

[1] Sutton RS, Barto AG. Reinforcement learning: An introduction. MIT press; 2018 Nov 13.

[2] Cassandra AR, Kaelbling LP, Littman ML. Acting optimally in partially observable stochastic domains. InAaai 1994 Jul 31 (Vol. 94, pp. 1023-1028).

[3] Silver D, Veness J. Monte-Carlo planning in large POMDPs. Advances in neural information processing systems. 2010;23.

[4] Silver D, Veness J. Monte-Carlo planning in large POMDPs. Advances in neural information processing systems. 2010;23.

[5]Kocsis L, Szepesvári C. Bandit based monte-carlo planning. In European conference on machine learning 2006 Sep 18 (pp. 282-293). Berlin, Heidelberg: Springer Berlin Heidelberg.

[6] Gmytrasiewicz PJ, Doshi P. A framework for sequential planning in multi-agent settings. Journal of Artificial Intelligence Research. 2005 Jul 1;24:49-79.

[7] Hula A, Montague PR, Dayan P. Monte carlo planning method estimates planning horizons during interactive social exchange. PLoS computational biology. 2015 Jun 8;11(6):e1004254.

[8] Camerer CF, Ho TH, Chong JK. A cognitive hierarchy model of games. The Quarterly Journal of Economics. 2004 Aug 1;119(3):861-98.

-----------------------------------------------------

Appendix

Implementation Details

I am currently in the process of programming both the board game environment and the POMCP algorithm to solve it. This might take a while, depending on my other workload, studies, and personal schedules. I'll post more updates as soon as possible. Meanwhile, you can find the code on my github.

Anish Diwan