Our hypothesis is that true language understanding will come from agents that learn words in combination with how they affect the world, rather than spotting patterns in a huge corpus of text. As a first step, we wanted to see if cooperative agents could develop a simple language amongst themselves.
Training agents to invent a language
We've just released initial results in which we teach AI agents to create language by dropping them into a set of simple worlds, giving them the ability to communicate, and then giving them goals that can be best achieved by communicating with other agents. If they achieve a goal, then they get rewarded. We train them using reinforcement learning and, due to careful experiment design, they develop a shared language to help them achieve their goals.
Our approach yields agents that invent a (simple!) language which is grounded and compositional. Grounded means that words in a language are tied to something directly experienced by a speaker in their environment, for example, a speaker forming an association between the word "tree" and images or experiences of trees. Compositional means that speakers can assemble multiple words into a sentence to represent a specific idea, such as getting another agent to go to a specific location.
To train the agents, we represent the experiment as a cooperative — rather than competitive — multi-agent reinforcement learning problem. The agents exist in a two-dimensional world with simple landmarks, and each agent has a goal. Goals can vary from looking at or moving to a specific location, to encouraging a separate agent to move to a location. Each agent can broadcast messages to the group. Every agent's reward is the sum of the rewards paid out to all agents, encouraging collaboration.
At each time step, our RL agents can take two kinds of actions — (i) environment actions, like moving around or looking at things, and (ii) communication actions, like broadcasting a word to all other agents. (Note that though the agents come up with words that we found to correspond to objects and other agents, as well as actions like 'Look at' or 'Go to', to the agents these words are abstract symbols represented by one-hot vector — we label these one-hot vectors with English words that capture their meaning for the sake of interpretability.) Before an agent takes an action, it observes the communications from other agents from the previous time step as well as the locations of all entities and objects in the world. It stores that communication in a private recurrent neural network, giving it a memory for the words it hears.
We use discrete communication actions (messages formed of separate, word-like symbols) sent over a differentiable communication channel. A communication channel is differentiable if it allows agents to directly inform each other about what message they should have sent at each time step, by slightly altering their messages to make a positive change in the reward both agents expect to receive. Agents accomplish this by calculating the gradient of future reward with respect to changes in the sent messages (i.e. how much rewards would change with different messages). For example, if one agent realizes that it could have performed a task better if a second agent had sent different information, the first agent can tell the second exactly how to modify its messages to make them as useful as possible. In other words, agents ask the question: 'how should I modify my communication output to get the most communal reward in the future?'.
Previous efforts achieved this sort of differentiable communication by having the agents send a vector of real numbers or a continuous approximation to binary values to each other, or used non-differentiable communication and training. We use the Gumbel-Softmax trick, to approximate discrete communication decisions with a continuous representation during training. This gets us the best of both worlds: during training the differentiable channel means agents can rapidly learn how to communicate with each other via using continuous representation, which at the end of training ends up converging on discrete outputs that are more interpretable and show traits like compositionality.
In the video that follows, we show how our agents evolve languages to fit the complexity of their situation, with solitary agents not needing to communicate, two agents inventing one-word phrases to coordinate with each other in simple tasks, and three agents composing multiple words in sentences to accomplish more challenging tasks.
How experimental setup influences how language evolves
All research projects have complications; in this case, our agents frequently invented languages that didn't display the compositional traits we wanted. And even when they succeeded, their solutions had their own idiosyncrasies.
The first problem we ran into was the agents' tendency to create a single utterance and intersperse it with spaces to create meaning. This Morse code language was hard to decipher and non-compositional. To correct this, we imposed a slight cost on every utterance and added a preference for achieving the task quickly. This encouraged the agents to use their communication channel concisely, which led to the development of a larger vocabulary.
Another issue we faced was agents trying to use single words to encode the meaning of entire sentences. This happened when we gave them the ability to use large vocabularies; they'd eventually create a single utterance that encoded the meaning of an entire sentence such as “red agent, go to blue landmark”. While useful for the agents, this approach requires vocabulary size to grow exponentially with the sentence length and doesn't fit with our broader goal of creating AI that is interpretable to humans.) To deter agents from creating this sort of language we incorporated a preference for compact vocabulary sizes through a preference for using already-popular words, inspired by ideas outlined in The evolution of syntactic communication. We incorporate this by putting a reward for speaking a particular word that is proportional to how frequently that word has been spoken previously.
Lastly, we encountered agents inventing landmark references not based on color, but other cues such as spatial relationships. For example, agents would invent words like "top-most" or "left-most" landmark to refer to locations based on a global 2D coordinate system. While such behavior is very inventive, it is fairly specific to our particular environment implementation, and could cause problems if we substantially changed the geography of the worlds the agents live in. To fix this, we placed agents in an ego-centric coordinate frame (so that there is no single shared coordinate frame). This dealt with the odd directions, and led to them referring to landmarks by their color property.
Can't speak? Let me point the way. Can't hear? Let me guide you.
This method of training also works when agents are unable to communicate with each other via text, and have to instead carry out physical actions actions in the simulated environment. In the animations that follow we show agents improvising in this way by pointing or guiding other agents to targets, or in extreme cases pushing sightless agents to their goal.
Inferred language and grounded language
Today, many people have applied machine learning to language-related tasks with great success. Large-scale ML techniques have led to significant advances in translation, verbal reasoning, language understanding, sentence generation, and other areas. All of these approaches work by feeding them extremely large amounts of textual data, from which the systems extract features and discover patterns. While this work has yielded numerous inventions and innovations, it has drawbacks relating to the representational quality of the language that is learned. There's not much indication that if you train a computer on language in this way it will have a deep understanding of how that language is attached to the real world. With our research, we're trying to deal with this grounding problem by training our agents to invent language which is tied to their perception of the world.
Computers whose language models are trained without grounding are much like the character trapped in John Searle’s Chinese Room, where they compare incoming text against a kind of dictionary of semantic meaning which has been created through the analysis of large quantities of text. It's unclear how much of an idea these computers have about what the text represents, as they've never left this room and been able to interact with the world the text describes.
We hope that this research into growing a language will let us develop machines that have their own language tied to their own lived experience. We think that if we slowly increase the complexity of their environment, and the range of actions the agents themselves are allowed to take, it’s possible they’ll create an expressive language which contains concepts beyond the basic verbs and nouns that evolved here.
As the complexity of this invented language increases, it's going to become challenging for us to make these languages interpretable by humans. That's why for our next project, Ryan Lowe and Igor Mordatch are going to investigate ways to connect the invented languages with English via having the agents communicate with English-speaking agents. This will automate the translation of their language into ours. This is an interdisciplinary undertaking, spanning areas of AI, linguistics, and cognitive science, and as part of it we'll be collaborating with researchers at UC Berkeley. If you're interested in developing smarter language models, then consider working at OpenAI.
For more information
You can find out more information about the technical specifics of our research in this research paper: Emergence of Grounded Compositional Language in Multi-Agent Populations, and more about the motivations for it in: A Paradigm for Situated and Goal-Driven Language Learning.