Google Knowledge Graph Reconciliation
Exploring how Google’s knowledge graph works can provide some insights into how is growing and improving and may influence what we see on the web. A newly granted Google patent from the end of last month tells us about one way that Google is using to improve the amount of data that its knowledge graph contains. The process involved in that patent doesn’t work quite the same way as the patent I wrote about in the post How the Google Knowledge Graph Updates Itself by Answering Questions but taken together, they tell us about how the knowledge graph is growing and improving. But part of the process involves the entity extraction that I wrote about in Google Shows Us How It Uses Entity Extractions for Knowledge Graphs. This patent tells us that information that may make its way into Google’s knowledge graph isn’t limited to content on the Web, but can also may “originate from another document corpus, such as internal documents not available over the Internet or another private corpus, from a library, from books, from a corpus of scientific data, or from some other large corpus.”What Knowledge Graph Reconciliation is?
The patent tells us about how a knowledge graph is constructed and processes that it follows to update and improve itself. The site Wordlift includes some defintions related to Entities and the Semantic Web. The Definition that they provide for reconciling entities means “providing computers with unambiguous identifications of the entities we talk about.” This patent from Google focuses upon a broader use of the word “Reconciliation” and how it applies to knowledge graphs, to make sure that those take advantage of all of the information from web sources that may be entered into those about entities. This process involves finding missing entities and missing facts about entities from a knowledge graph by using web-based sources to add information to a knowledge graph.Problems with knowledge graphs
Large data graphs like Google’s Knowledge Graph store data and rules that describe knowledge about the data in a way that allows the information they provide to be built upon. A patent granted to Google describes how Google may build upon data within a knowledge graph so that it contains more information. The patent doesn’t just cover information from within the knowledge graph itself, but can look to sources such as online newsTuples as Units of Knowledge Graphs
The patent presents some definitions that are worth learning. One of those is about facts involving entities:A fact for an entity is an object related to the entity by a predicate. A fact for a particular entity may thus be described or represented as a predicate/object pair.The relationship between the Entity (a subject) and a fact about the entity (a predicate/object pair) is known as a tuple. In a knowledge graph, entities, such as people, places, things, concepts, etc., may be stored as nodes and the edges between those nodes may indicate the relationship between the nodes. For example, the nodes “Maryland” and “United States” may be linked by the edges of “in country” and/or “has state.” A basic unit of such a data graph can be a tuple that includes two entities, a subject entity and an object entity, and a relationship between the entities. Tuples often represent real-world facts, such as “Maryland is a state in the United States.” (A Subject, A Verb, and an Object.) A tuple may also include information, such as:
- Context information
- Statistical information
- Audit information
- Metadata about the edges
- etc.
For example, if the potential tuples include the tuple the system may generate an inverse tuple of .Sometimes inverse tuples may be generated for some predicates but not for others. For example, tuples with a date or measurement as the object may not be good candidates for inverse occurrences, and may not have many inverse occurrences.
For example, the tuple is not likely to have an inverse occurrence of <2001, is the year of release, Planet of the Apes> in the target data graph.Clustering of Tuples is also discussed in the patent. We are told that the system may then cluster the potential tuples by:
- source
- provenance
- subject entity type
- subject entity name
The process behind the knowledge graph reconciliation patent:
- Potential entities may be identified from facts generated from web-based sources
- Facts from those sources are analyzed and cleaned, generating a small source data graph that includes entities and facts from those sources
- The source graph may be generated for a potential source entity that does not have a matching entity in the target data graph
- The system may repeat the analysis and generation of source data graphs for many source documents, generating many source graphs, each for a particular source document
- The system may cluster the source data graphs together by type of source entity and source entity name
- The entity name may be a string extracted from the text of the source
- Thus, the system generates clusters of source data graphs of the same source entity name and type
- The system may split a cluster of source graphs into buckets based on the object entity of one of the relationships, or predicates
- The system may use a predicate that is determinative for splitting the cluster
- A determinative predicate generally has a unique value, e.g., object entity, for a particular entity
- The system may repeat the dividing a predetermined number of times, for example using two or three different determinative predicates, splitting the buckets into smaller buckets. When the iteration is complete, graphs in the same bucket share two or three common facts
- The system may discard buckets without sufficient reliability and discard any conflicting facts from graphs in the same bucket
- The system may merge the graphs in the remaining buckets, and use the merged graphs to suggest new entities and new facts for the entities for inclusion in a target data graph
How Googlebot may be Crawling Facts to Build a Knowledge Graph
This is where some clustering comes into play. Imagine that the web sources are about science fiction movies, and they contain information about movies involving the “Planet of the Apes.” series, which has been remade at least once, and there are a number of related movies in the series, and movies with the same names. The information about those movies may be found from sources on the Web, and clustered together and go through a reconciliation process because of the similarities. Relationships between the many entities involved may be determined and captured. We are told about the following steps:-
- Each source data graph is associated with a source document, includes a source entity with an entity type that exists in the target data graph, and includes fact tuples
- The fact tuples identify a subject entity, a relationship connecting the subject entity to an object entity, and the object entity
- The relationship is associated with the entity type of the subject entity in the target data graph
- The computer system also includes instructions that, when executed by the at least one processor, cause the computer system to perform operations that include generating a cluster of source data graphs, the cluster including source data graphs associated with a first source entity of a first source entity type that share at least two fact tuples that have the first source entity as the subject entity and a determinative relationship as the relationship connecting the subject entity to the object entity
- The operations also include generating a reconciled graph by merging the source data graphs in the cluster when the source data graphs meet a similarity threshold and generating a suggested new entity and entity relationships for the target data graph based on the reconciled graph
More Features to Knowledge Graph Reconciliation
There appear to be 9 movies in the Planet of the Apes Series and the rebooted series. The first “Planet of the Apes” was released in 1968, and the second “Planet of the Apes” was released in 2001. Since they have the same name, things could get confusing if they weren’t separated from each other, and using facts about those movies to break the cluster about “Planet of the Apes” down into buckets based upon facts that tell us that there was an original series, and a rebooted series involving the “Planet of the Apes.”
For example, generating the cluster can include generating a first bucket for source data graphs associated with the first source entities and the first source entity type, splitting the first bucket into second buckets based on a first fact tuple, the first fact tuple having the first source entity as the subject entity and a first determinative relationship, so that source data graphs sharing the first fact tuple are in a same second bucket; and generating final buckets by repeating the splitting a quantity of times, each iteration using another fact tuple for the first source entity that represents a distinct determinative relationship, so that source data graphs sharing the first fact tuple and the other fact tuples are in the same final bucket, wherein the cluster is one of the final buckets.So this aspect of knowledge graph reconciliation involves understanding related entities, including some that may share the same name, and removing ambiguity from how they might be presented within a knowledge graph. Another aspect of knowledge graph reconciliation may involve merging data, such as seeing when one of the versions of the movie “Planet of the Apes” has more than one actor who is in the movie and merging that information together to make the knowledge graph more complete. The image below from the patent shows how that can be done:

Inverse Tuples Generated and Discarded

Advantages of the Knowledge Graph Reconciliation Patent Process
- A data graph may be extended more quickly by identifying entities in documents and facts concerning the entities
- The entities and facts may be of high quality due to the corroborative nature of the graph reconciliation process
- The identified entities may be identified from news sources, to more quickly identify new entities to be added to the data graph
- Potential new entities and their facts may be identified from thousands or hundreds of thousands of sources, providing potential entities on a scale that is not possible with manual evaluation of documents
- Entities and facts added to the data graph can be used to provide more complete or accurate search results
Takeaways
The patent points out at one place, that human evaluators may review additions to a knowledge graph. It is interesting seeing how it can use sources such as news sources to add new entities and facts about those entities. Being able to use web-based news to add to the knowledge graph means that it isn’t relying upon human-edited sources such as Wikipedia to grow, and the knowledge graph reconciliation process was interesting to learn about as well.Copyright © 2019 SEO by the Sea ⚓. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator, the site you are looking at may be guilty of copyright infringement. Please contact SEO by the Sea, so we can take appropriate action immediately. Plugin by Taragana The post Google Knowledge Graph Reconciliation appeared first on SEO by the Sea ⚓.