Knowledge Graph Research

4. Building the graph: Entity Linking

We perform the task of named entity linking (NEL) subconsciously throughout our days. Say, for example, we are in a stirring debate with a friend about the dubious motivations of the Founding Fathers’. Our friend mentions that Jefferson spent a staggering amount of money in his life and died in significant debt. We know, because of our brains’ version of NEL that our friend is referring to the person Thomas Jefferson, or the third president, or annoyingly, TJ. We can associate multiple identifiers to one entity, Thomas Jefferson. The computational task of NEL seeks to do the same. On the PubChem database, RDX has 117 synonymous identifiers like hexogen and cyclotrimethylenetrinitramine.[1] Without NEL, facts extracted with these identifiers would remain unassociated with RDX in our knowledge graph. Closely related to NEL is coreference resolution (CR). While NEL links specific entity mentions to a node in a knowledge base, CR resolves ambiguities at the text level. Suppose now that you are reading an article about Thomas Jefferson. The author may use phrases like him, the former president, and the young congressman to refer to Thomas Jefferson. The goal of CR is to detect these references and correctly resolve them to Thomas Jefferson. Martins et al[2] approach NEL and NER at the same time. Their combined approach begins by detecting an entity in a text stream. After an entity is detected, their model first predicts whether that entity is represented in the knowledge base. If it is not, a node is added. If the entity is represented in the knowledge base, a set of candidate entities is assembled and ranked using the sentence representation, the embedding of each candidate entity, the prior probability of each entity, and the last action taken by the NER classifier. The entity is associated with the candidate with the highest score using these features. Their model performed competitively with other NEL models and proved that combining the NEL and NER tasks can improve the performance of both.


[1] https://pubchem.ncbi.nlm.nih.gov/compound/Cyclonite#section=Depositor-Supplied-Synonyms&fullscreen=true

[2] https://www.aclweb.org/anthology/P19-2026/

3. Building the graph: Pre-Processing through Entity Recognition

Several publications present recipes for creating knowledge graphs[1] [2] [3]. The process for creating a basic knowledge graph from unsupervised text is composed of, at a minimum, six subtasks:

  1. Data collection and pre-processing.
  2. Semantic model creation.
  3. Entity recognition.
  4. Entity linking and coreference resolution.
  5. Relation extraction.
  6. Evaluate graph triples and build graph.

Collecting a repository of data is, of course, the first step in the knowledge graph process. Some things to consider when compiling this repository are the format and the state of the documents. Text-based file types like HTML, XML, and TXT are preferable to presentation style formats like PDF. PDFs require an initial step in the pre-processing pipeline to extract the relevant information from each one – pre-pre-processing. Even the best PDF extraction tools produce unreliable results when used on scanned PDFs. Sciencebeam[4], Grobid[5], and PDFMiner[6] are open-source tools built specifically to extract document elements, like body text, from PDF documents.

Once the data is in a text-based file format, we are ready to begin pre-processing. The purpose of pre-processing is to prepare the raw text data for sophisticated methods later in the pipeline. The typical pre-processing tasks are text cleaning, tokenization, part-of-speech (POS) tagging, and dependency parsing. Most NLP methods rely on text that has been split into tokens. Generally, tokens are words, but most modern tokenizers like SpaCy[7] have exceptions to improve the performance of later NLP methods. For example, contractions like don’t are tokenized as their parts do and n’t. Tokens and sequences of tokens become the linguistic features for later NLP tasks. Other tokenizers are included in their respective NLP toolkits. Examples are the Allen AI Institute’s AllenNLP[8], NLTK[9], and Stanford’s CoreNLP[10]. Figure 3.1 shows an excerpt from the Old Royal Naval College Wikipedia page[11] tokenized with NLTK’s word tokenizer.

Figure 3.1

Part of speech (POS) tagging is the task of predicting the correct part of speech for each token. All of the previously mentioned NLP toolkits contain modules for POS tagging. Parts of speech are useful for extracting specific patterns of language in relation extraction. POS tags serve as the features for dependency parsing. Dependency parsing attempts to build structured sentence trees from unstructured text. Once again, most NLP toolkits contain dependency parsing modules. The structured sentences from dependency parsing can improve performance in relation extraction. Figure 3.2 shows how the excerpt is parsed using SpaCy.

Figure 3.2

Creating a semantic schema is important for a useful knowledge graph. A semantic schema can be as simple as an unordered list of entity types or as complicated as a full axiom-rich ontology. Semantic schema provide a framework for the interpretation of unstructured data. Ontology learning is a highly skilled and expensive task, requiring domain expertise and ontology engineering experience. We will dive deeper in Ontology Learning in a later section. A simple, manually crafted taxonomy is enough to begin knowledge graph development. In our trip to London example, a simple schema to identify the relevant entities could be [Lodging, Attractions, Transport]. This schema gives a template for the data. Relevant entities will be recognized as one of these classes. A simple schema for energetics formulations may be [Component, Processing_Technique, Property].

Named entity recognition (NER) is the first step that attempts to identify entities in unstructured text. Chemical names and formulas are rare in common language texts. As a result, most off-the-shelf NER models perform poorly on chemical and other scientific documents. In recent years, researchers have begun developing NER models specific to chemistry and other scientific domains. ChemDataExtractor[12], developed by researchers at The University of Cambridge, is an open-sourced tool designed to process scientific texts. Its main offering is a NER module trained specifically for chemical names, formulas, and abbreviations but it also includes modules for extracting text from various file types, performing pre-processing steps, extracting tabular data, and more. LSTMVoter[13] is another recent NER tool developed by researchers at Goethe University in Frankfurt, Germany. LSTMVoter uses labeled sequences of tokens and recurrent neural networks to predict chemical mentions and classify them as one of seven types: abbreviations, family, formula, identifier, multiple, systematic, and trivial. The next post will look at the final three subtasks for building basic knowledge graphs.


[1] https://www.ontotext.com/knowledgehub/webinars/knowledge-graphs-5-use-cases-and-10-steps-to-get-there/

[2] https://www.poolparty.biz/the-knowledge-graph-cookbook/

[3] https://arxiv.org/abs/2003.02320

[4] https://github.com/elifesciences/sciencebeam

[5] https://grobid.readthedocs.io/en/latest/

[6] https://pypi.org/project/pdfminer/

[7] https://spacy.io/usage/spacy-101#annotations

[8] https://allennlp.org/

[9] https://www.nltk.org/

[10] https://stanfordnlp.github.io/CoreNLP/

[11] https://en.wikipedia.org/wiki/Old_Royal_Naval_College

[12] http://chemdataextractor.org/

[13] https://jcheminf.biomedcentral.com/articles/10.1186/s13321-018-0327-2


2. Background

This is a warm-up post. Some simple definitions are given. Everything mentioned here will be further investigated later in future posts.

Researchers have not yet adopted a simple, unified definition of a knowledge graph. There are as many definitions as there are groups using them. We tend to think of graphs as visualizations of trends such as the rising and falling of stock prices or as a plot of a function like y=x2. Graphs in the mathematical sense are abstractions that try to approximate real-world phenomena. This is what the graph means in knowledge graph. More formally, a graph is “a collection of points and lines connecting some (possibly empty) subset of them.”[1]

Analogy: Imagine a map of the London Underground. Stations on the map like Oxford Circus and Camden Town represent nodes. The train lines that connect them represent edges.

Use Case: Think of an energetic molecule as a graph. Nodes in the graph represent individual atoms and edges represent their bonds.

What is ontology? What is an ontology? How does it relate to taxonomy?

Taxonomy and ontology are important ingredients in knowledge graph work. These ingredients form what is called semantic schema because they specify the structure of concepts in a domain. Taxonomies deal with classifications. They are schemes that allow for classification. The Dewey Decimal System[2] is an example of a taxonomy. There are ten broad categories such as language, technology, and religion. These categories are then further divided into subcategories, uniquely locating literary works for libraries. Literature (Dewey Decimal – 8XX), for example, has a subcategory, “Other Languages” (89X). One could continue this path to “Russian Fiction: 1800-1917” (891.733)[3].

Ontologies in computer science are formal conceptualizations of a domain.[4] Ontologies are like maps or blueprints to a domain of knowledge. Defining an ontology is muddy. Some ontologies are simple lists of domain words while others are full hierarchies of concepts with their associated properties and relations between them. Simple ontologies like lists of domain terms and simple concept hierarchies are called lightweight or informal. Words like heavyweight and formal indicate that an ontology is rich in conditional rules, called axioms. Ontologies are built as graphs where concepts are nodes and relations are edges.

Analogy: Think of a children’s picture book – brand new. An ontology is the black outline on the white page. This outline specifies the structure of the picture. Any color added will then be interpreted through the lens of the outline.

Use Case: Somebody tasks you to characterize the appearance of historical figures for a historical exhibit at a museum. You decide to build an ontology to address this problem. The ontology will not include any information specific to any one figure. Instead, it will be a blueprint to characterize all of them. Body characteristics and societal characteristics may make up your highest ontological level. From there, you may specify facial features, unique features, and general body features as the next level. As you move down the ontological tree, the attributes become more specific. Throughout history, of course we know that attributes other than bodily features influence a figure’s appearance. Things like society of origin, era of history, status in society, and occupation are sometimes crucial for describing someone’s appearance when specific details are unknown. Perhaps in some societies, a historical figure’s occupation indicates something about his or her hair, or their status in society indicates something about his or her dress. These relations are included in a formal ontology.

What makes a graph a knowledge graph? Hogan et al., in their comprehensive study of knowledge graphs, define a knowledge graph as “a graph of data intended to accumulate and convey knowledge of the real world, whose nodes represent entities of interest and whose edges represent relations between these entities.”[5] A knowledge graph, then, is just a graph that contains knowledge, embedded in its interconnectivity. Furthermore, the graph must be a directed graph, or a graph where edges point from one node to another. James McCusker and his research team at Rensselaer Polytechnic Institute present a unified view of knowledge graphs, not by defining them exactly, but by outlining five requirements for them. They are: 1) meaning is expressed in the structure of the graph, 2) statements are unambiguous, 3) a minimum set of relation types is used to construct the graph, 4) justification, or provenance, must be embedded in the graph, and 5) a measure of uncertainty should be assigned to each statement.[6] Said another way, knowledge graphs must contain knowledge and a path to where that knowledge comes from, and may additionally contain a measure of uncertainty. They must be created with the simplest set of parts possible and any node-relation-node set must only correspond with one real-world relation.

Analogy: In the children’s picture book analogy, the finished colored-in picture represents the knowledge graph.

Use Case: Imagine now a map of the whole city of London. You have been tasked with building a tool to help tourists plan their days in the city. You reckon that London travel books, webpages of attractions, and reviews of those attractions will provide sufficient coverage of the domain. You consider anything relevant to travel an entity. These could be attractions like the London Eye and the Royal Naval College. They could be specific activities like museum exhibits and Thames River boat tours. They could also be entities supplementary to travel like London Underground stations, hotels, and restaurants. An ontology will specify attributes of each entity, providing the context for the computer to fill a knowledge graph. A useful attribute could be walking_distance_to_train that would relate an attraction with the time it takes to walk to an Underground station.


[1] https://mathworld.wolfram.com/Graph.html

[2] https://www.britannica.com/science/Dewey-Decimal-Classification

[3] https://www.librarything.com/mds/891.733

[4] https://tomgruber.org/writing/onto-design.pdf

[5] https://arxiv.org/abs/2003.02320

[6] http://www.semantic-web-journal.net/system/files/swj1954.pdf


1. Start here

What are knowledge graphs? This series of posts is meant to answer that question, give a snapshot of the current state of knowledge graph research, provide a template for building your own, and produce a rich collection of resources for further learning.