Learning to model other minds

LOLA, a collaboration by researchers at OpenAI and the University of Oxford, lets a reinforcement learning (RL) agent take account of the learning of others when updating its own strategy. Each LOLA agent adjusts its policy in order to shape the learning of the other agents in a way that is advantageous. This is possible since the learning of the other agents depends on the rewards and observations occurring in the environment, which in turn can be influenced by the agent.

This means that the LOLA agent, “Alice,” models how the parameter updates of the other agent, “Bob,” depend on its own policy and how Bob’s parameter update impacts its own future expected reward. Alice then updates its own policy in order to make the learning step of the other agents, like Bob, more beneficial to its own goals.

LOLA agents can discover effective, reciprocative strategies, in games like the iterated prisoner’s dilemma, or the coin game. In contrast, state-of-the-art deep reinforcement learning methods, like Independent PPO, fail to learn such strategies in these domains. These agents typically learn to take selfish actions that ignore the objectives of other agents. LOLA solves this by letting agents act out of a self-interest that incorporates the goals of others. It also works without requiring hand-crafted rules, or environments set up to encourage cooperation.

The inspiration for LOLA comes from how people collaborate with one another: Humans are great at reasoning about how their actions can affect the future behavior of other humans, and frequently invent ways to collaborate with others that leads to a win–win. One of the reasons humans are good at collaborating with each other is that they have a sense of a “theory of mind” about other humans, letting them come up with strategies that lead to benefits for their collaborators. So far, this sort of “theory of mind” representation has been absent from deep multi-agent reinforcement learning. To a state of the art deep-RL agent there is no inherent difference between another learning agent and a part of the environment, say a tree.

The key to LOLA’S performance is the inclusion of term:

\left( \frac{\partial V^1 (\theta^1_i,\theta^2_i) }{\partial \theta^2_i} \right)^T \frac{\partial^2 V^2 (\theta^1_i,\theta^2_i)}{\partial \theta^1_i \partial \theta^2_i} \cdot \delta \eta,

Here the left hand side captures how Alice’s return depends on the change in Bob’s policy. The right hand side, describes how Bob’s learning step depends on Alice’s policy. Multiplying those two components essentially measures how Alice can change Bob’s learning step such that it leads to an increase in Alice’s rewards.

This means that when we train our agents they try to optimize their return after one anticipated learning step of the opponent. By differentiating through this anticipated learning step the agent can actively shape the parameter update of the opponent in a way that increases their own return.

While the formula above assume access to the true gradient and hessian of the two value functions, we can also estimate all relevant terms using samples. In particular the second order term can be estimated by applying the policy gradient theorem, which makes LOLA suitable for any deep reinforcement learning setting.

LOLA can handle this by including a step of opponent modeling where we fit a model of the opponent to the observed trajectories—predicting the parameters of other agents based on their actions. In the future we would like to extend this by also inferring architectures and rewards from the observed learning.

Results

By considering their impact on the learning process of other agents, LOLA agents (left) learn collaborative strategies, while other approaches like Independent Policy Gradient (right) struggle in environments like the coin game.

LOLA lets us train agents that succeed at the coin game, in which two agents, red and blue, compete with one another to pick up red and blue colored coins. Each agent gets a point for picking up any coin, but if they pick up a coin which isn’t their color then the other agent will receive a –2 penalty. Thus, if both agents greedily pick up both coins, everyone gets zero points on average. LOLA agents learn to predominantly pick up coins of their own color, leading to high scores (shown above).

Drawbacks

LOLA works best when using large batch-sizes and full roll-outs for variance reduction. This means that the method is both memory and compute intensive. Furthermore, under opponent modeling LOLA can exhibit instability which we hope to address with future improvements.

Learning to model other minds

More resources

Results

Drawbacks

Authors