We've built a system for automatically figuring out which object is meant by a word by having a neural network decide if the word belongs to each of about 100 automatically-discovered "types" (non-exclusive categories). For example, given a sentence like "the prey saw the jaguar cross the jungle", rather than trying to reason directly whether jaguar means the car, the animal, or something else, the system plays "20 questions" with a pre-chosen set of categories. This approach gives a big boost in state-of-the-art on several entity disambiguation datasets.
We achieve 94.88% accuracy on CoNLL (YAGO) (previous state of the arts: 91.50%, 91.70%) and 90.85% on TAC KBP 2010 challenge (previous state of the arts: 87.20%, and 87.70%). Previous methods used distributed representations. Types can go almost all the way on these tasks, as perfect type prediction would give accuracies of 98.6-99%.
Our system uses the following steps:
- Extract every Wikipedia-internal link to determine, for each word, the set of conceivable entities it can refer to. For example, when encountering the link
[jaguar](https://en.wikipedia.org/wiki/Jaguar)in a Wikipedia page, we conclude that
https://en.wikipedia.org/wiki/Jaguaris one of the meanings of
- Walk the Wikipedia category tree (using the Wikidata knowledge graph) to determine, for each entity, the set of categories it belongs to. For example, at the bottom of
https://en.wikipedia.org/wiki/Jaguar_Cars's Wikipedia page, are the following categories (which themselves have their own categories, such as Automobiles):
- Pick a list of ~100 categories to be your "type" system, and optimize over this choice of categories so that they compactly express any entity. We know the mapping of entities to categories, so given a type system, we can represent each entity as a ~100-dimensional binary vector indicating membership in each category.
- Using every Wikipedia-internal link and its surrounding context, produce training data mapping a word plus context to the ~100-dimensional binary representation of the corresponding entity, and train a neural network to predict this mapping. This chains together the previous steps: Wikipedia links map a word to an entity, we know the categories for each entity from step 2, and step 3 picked the categories in our type system.
- At test time, given a word and surrounding context, our neural network's output can be interpreted as the probability that the word belongs to each category. If we knew the exact set of category memberships, we would narrow down to one entity (assuming well-chosen categories). But instead, we must play a probabilistic 20 questions: use Bayes' theorem to calculate the chance of the word disambiguating to each of its possible entities.
Here are some other examples of our system in action:
Cleaning the data
Wikidata's knowledge graph can be turned into a source of training data for mapping fine-grained entities to types. We apply its
instance of relation recursively to determine the set of types for any given entity — for example, any descendent node of the human node has type human. Wikipedia can also provide entity-to-type mapping through its
Wikipedia-internal link statistics provide a good estimate of the chance a particular phrase refers to some entity. However, this is noisy since Wikipedia will often link to specific instance of a type rather than the type itself (anaphora — e.g. king → Charles I of England) or link from a nickname (metonymy). This results in an explosion of associated entities (e.g. king has 974 associated entities) and distorted link frequencies (e.g. queen links to the band Queen 4920 times, Elizabeth II 1430 times, and monarch only 32 times).
After this process, king goes from 974 to 14 associated entities, while the number of links from queen to monarch increases from 32 to 3553.
Learning a good type system
We need to select the best type system and parameters such that disambiguation accuracy is maximized. There's a huge number of possible sets of types, making an exact solution intractable. Instead, we use heuristic search or stochastic optimization (evolutionary algorithm) to select a type system, and gradient descent to train a type classifier to predict the behavior of the type system.
We need to select types that are discriminating (so quickly whittle down the possible set of entities), while being easy to learn (so surrounding context is informative for a neural network to infer that a type applies). We inform our search with two heuristics: learnability (average of area under the curve [AUC] scores of a classifier trained to predict type membership), and oracle accuracy (how well we would disambiguate entities if we predicted all types perfectly).
Type system evolution
We train binary classifiers to predict membership in each of the 150,000 most common types in our dataset, given a window of context. The area under the curve (AUC) of the classifier becomes the "learnability score" for that type. High AUC means it's easy to predict this type from context; poor performance can mean we have little training data or that a word window isn't terribly helpful (this tends to be true for unnatural categories like ISBNs). Our full model takes several days to train, so we instead use a much smaller model as a proxy in our "learnability score", which takes only 2.5s to train.
We can now use these learnability scores and count statistics to estimate the performance of a given subset of types as our type system. Below you can run the Cross Entropy Method to discover types in your browser. Note how changing sample size and penalties affects the solution.
To better visualize what parts of the type system design are easy and hard, we invite you to try your hand at designing your own below. After choosing a high-level domain you can start looking at ambiguous examples. The possible answers are shown as circles on the top row, and the correct answer is the colored circle (hover to see its name). The bottom row contains types you can use. Lines connecting the top to the bottom row are inheritance relations. Select the relations you want. Once you have enough relations to separate the right answer from the rest, the example is disambiguated.
Neural type system
Using the top solution from our type system optimization, we can now label data from Wikipedia using labels generated by the type system. Using this data (in our experiments, 400M tokens for each of English and French), we can now train a bidirectional LSTM to independently predict all the type memberships for each word. On the Wikipedia source text, we only have supervision on intra-wiki links, however this is sufficient to train a deep neural network to predict type membership with an F1 of over 0.91.
One of our type systems, discovered by beam search, includes types such as
Games (as well as surprisingly specific ones like
1754 in Canada — indicating 1754 was an exciting year in the dataset of 1,000 Wikipedia articles it was trained on); you can also view the full type system.
Predicting entities in a document usually relies on a "coherence" metric between different entities, e.g. measuring how well each entity fits with each other, which is
O(N^2) in the length of the document. Instead, our runtime is
O(N) as we need only to look up each phrase in a trie which maps phrases to their possible meanings. We rank each of the possible entities according to the link frequency seen in Wikipedia, refined by weighting each entity by its likelihood under the type classifier. New entities can be added just by specifying their type memberships (person, animal, country of origin, time period, etc..).
Our approach has many differences to previous work on this problem. We're interested in how well end-to-end learning of distributed representations performs in comparison to the type-based inference we developed here. The type systems here were discovered using a small Wikipedia subset; scaling to all of Wikipedia could discover a type system with broad application. We hope you find our code useful!
If you'd like to help push research like this forward, please apply to OpenAI!