Published Versions 2 Vol 1 (3) : 262-294 2019
Microsoft Concept Graph:Mining Semantic Concepts for Short Text Understanding
： 2018 - 12 - 20
： 2019 - 01 - 22
： 2019 - 01 - 29
3309 168 0
Abstract & Keywords
Abstract: Knowlege is important for text-related applications. In this paper, we introduce Microsoft Concept Graph, a knowledge graph engine that provides concept tagging APIs to facilitate the understanding of human languages. Microsoft Concept Graph is built upon Probase, a universal probabilistic taxonomy consisting of instances and concepts mined from the Web. We start by introducing the construction of the knowledge graph through iterative semantic extraction and taxonomy construction procedures, which extract 2.7 million concepts from 1.68 billion Web pages. We then use conceptualization models to represent text in the concept space to empower text-related applications, such as topic search, query recommendation, Web table understanding and Ads relevance. Since the release in 2016, Microsoft Concept Graph has received more than 100K pageviews, 2M API calls and 3K registered downloads from 50K visitors over 64 countries.
Keywords: Knowledge Extraction; Conceptualization; Text Understanding
1. Introduction
Concepts are indispensable for humans and machines to understand the semantic meanings underlined in the raw text. Humans understand an instance, especially an unfamiliar instance, by its basic concept in an appropriate level, which is defined as Basic-level Categorization (BLC) by psychologists and linguists. For example, people may not understand “Gor Mahia”, but with the concept “football club” described in Wikipedia, people can capture the semantic meaning easily. Psychologist Gregory Murphy began his highly acclaimed book [1] with the statement “Concepts are the glue that holds our mental world together”. Nature magazine book review [2] calls it an understatement, because “Without concepts, there would be no mental world in the first place”. To enable machines understanding the concept of an instance like human beings, one need a knowledge graph consisted of instances, concepts, as well as their relations. However, we observe two major limitations in existing knowledge graphs, which motivate us to build a brand-new knowledge taxonomy, Probase [3], to tackle general purpose understanding of human language.
1). Previous taxonomies have limited concept space. Many existing taxonomies are constructed by human experts and are difficult to be scaled up. For example, Cyc project [4] contains about 120,000 concepts after 25 years of evolution. Some other projects, like Freebase [5], resort to crowd sourcing efforts to increase the concept space, which still lacks general coverage of many other domains and thus holds a barrier for general purpose text understanding. There are also automatized constructed knowledge graphs, such as YAGO [6] and NELL [7]. Nevertheless, the coverages of these concept spaces are still limited. The number of concepts of Probase and some other popular open-domain taxonomies are shown in Table 1, which demonstrate that Probase has a much larger concept space.
Table 1.   Concept space comparison of existing taxonomies.
 Existing Taxonomies Number of Concepts Freebase [5] 1,450 WordNet [8] 25,229 WikiTaxonomy [9] 111,654 YAGO [6] 352,297 DBPedia [10] 259 Cyc [4] $$≈$$ 120,000 NELL [7] 123 Probase [3] $$≈$$5,400,000
2) Previous taxonomies treat knowledge as black and white. Traditional knowledge base aims at providing standard, well-defined and consistent reusable knowledge, and treats the knowledge as black and white. However, treating knowledge as black and white has obviously restrictions, especially when the concept space is extremely large, because different knowledge has different confidence intervals or probabilities, and the best threshold of probabilities depends on the specific application. Different from previous taxonomies, our philosophy is to annotate knowledge facts with probabilities and let the application itself to decide the best way of using it. We believe this design is more flexible and can be utilized to benefit a broader range of text-based applications.
Based on the Probase knowledge taxonomy, we propose a novel conceptualization model to learn text representation in the Probase concept space. The conceptualization model (a.k.a. the Concept Tagging Model) aims to map text into semantic concept categories with some probabilities. It provides computers the commonsense computing capability and makes machines "aware" of the mental world of human beings, through which machines can better understand human communication in text. Based on the conceptualization model, the Probase taxonomy can be applied to facilitate various text-based applications, including topic search, query recommendation, Ads relevance, Web table understanding, etc.
An overview of the entire framework is illustrated in Figure 1, which mainly consists of three layers: (1) knowledge construction layer, (2) conceptualization layer, and (3) application layer.
1). Knowledge construction layer. In the knowledge construction layer, we address the construction of Probase knowledge network. First, the isA facts are mined automatically from the web via a semantic iteration extraction procedure. Second, the taxonomy is constructed following a rule-based taxonomy construction algorithm. Finally, we calculate the probability score for each related instances/concepts in the taxonomy.
2). Conceptualization layer . Based on the constructed sematic knowledge network, we design a conceptualization model to represent raw text in the Probase concept space. The model can be divided into three major components, namely (1) single instance conceptualization, (2) context-aware single instance conceptualization, and (3) short text conceptualization.
3). Application layer . The conceptualization model enables us to generate “Bag-of-Concepts” representation of raw text, which can be utilized to enhance various categories of applications, including but not limited to topic search, query recommendation, Ads relevance calculation and Web table understanding. We also build a Probase browser and a tagging model demo in the application layer, which provide users a quick insight for a specific query.

Figure 1.   The framework overviews
The rest of this paper is organized as follows. Section 2 discusses related works, and Section 3 presents the construction of Probase semantic network. Section 4 introduces the conceptualization model built upon the sematic network; and Section 5 shows some example applications. In addition, Section 6 lists the data sets and statistics. Finally, we make a conclusion in Section 7.
2. Related Work
Knowledge graph has attracted great interests in many research fields. There are many existing knowledge graphs built either manually or automatically.
1). WordNet [8] Different from traditional thesaurus, Wordnet groups words together based on their meanings instead of morphology. Terms are grouped into Synsets, where each term represents a distinct concept. Synsets are interlinked by conceptual-semantics and lexical relations. WordNet have more than 117k Synsets for 146k terms. Wordnet is widely used for term disambiguation as it defines various senses for a term.
2). DBpedia [10] is a project of extracting Wikipedia Infobox into knowledge facts which can be semantically queried using SPARQL. The knowledge in DBpedia is extracted from Wikipedia and collaboratively edited by the community. DBpedia has released 670 million triples in RDF format. DBpedia provides an ontology including 280 classes, 3.5 million entities and 9k attributes.
3). YAGO [6], abbreviation for Yet Another Great Ontology, is an open sourced huge semantic knowledge base, which has fused knowledge extracted from Wordnet, GeoNames and Wikipedia. YAGO combines the taxonomy of WordNet with the Wikipedia category to assign entities to more than 200k classes. YAGO is comprised by more than 120 million facts about 3 million entities. Based on manual evaluation, the accuracy of YAGO is about 95%. YAGO also attaches temporal and special information to the entities and relations by some declarative extraction rules.
4). Freebase [5] is a large knowledge base containing a lot of human labeled data contributed by community members. Freebase extracts data from many sources including Wikipedia, NNDB, Fashion Model Directoryand MusicBrainz. The Freebase has more than 125 million facts, 4,000 types and 7,000 properties. After 2016, all data in Freebase have been transferred to Wikidata.
5). ConceptNet [11] is a semantic network aiming to build a large commonsense knowledge base in a crowd sourcing manner, which focuses on the commonsense relations including isA, partOf, usedFor and capableOf. ConceptNet contains over 21 million edges and over 8 million nodes.
6). NELL [7] is a continuously running system which extracts facts from text in hundreds of millions of Web pages in an iterative way, while patterns are boosted during the process. NELL has accumulated more than 50 million candidate beliefs and has extracted 2,810,379 asserted instances of 1,186 different categories and relations.
7). WikiTaxonomy [9] presents a taxonomy automatically generated from the categories in Wikipedia, which generates 105,418 isA links from a network of 127,325 categories and 267,707 links. The system achieves 87.9 balanced F-measure according to a manual evaluation on the taxonomy.
8). KnowItAll [12] aims to automate the process of extracting large collections of facts from the Web in an unsupervised, domain-independent, and scalable manner. The system has three major components: Pattern Learning (PL), Subclass Extraction (SE), and List Extraction (LE), achieving great improvement on the recall while maintaining precision and enhancing the extraction rate.
The existing knowledge graphs suffer from low coverage and lack of well-organized taxonomies. Moreover, none of them focuses on extracting the super-concepts and sub-concepts of an instance. To automatically generate the taxonomy, we propose Probase which automatically construct the semantic network of isA facts.
3. Semantic Network Construction
The entire data construction procedure of the Probase semantic network will be introduced in detail in the following subsections. First, in Section 3.1, we describe the iterative data extraction process; then, the taxonomy construction step is introduced in Section 3.2; Section 3.3 introduces the algorithm of probability calculation.
3.1   Iterative Extraction
The Probase semantic network [3] is built upon isA facts extracted from the Web. The isA facts can be formulated as pairs consisting of a super-concept and a sub-concept. For example, “cat isA animal” forms a pair, where “cat” is the sub-concept and “animal” is the super-concept. In this work, we propose a method based on semantic iteration to mine the isA pairs from Web.
3.1.1   Syntactic vs. Semantic Iteration
Before our work, the state-of-the-art information extraction methods [7, 8, 12] rely on a syntactic integrative (bootstrapping) approach. It starts with a set of seed patterns to discover example pairs with high confidence. Then, we can derive new patterns from the extracted pairs and apply the new patterns to extract more pairs. The iterative process is continued until a certain stop criterion is reached. However, we observe in practice that the syntactic iteration methods have indispensable barriers in deep knowledge acquisition because high quality syntactic patterns are rare. Therefore, we propose a semantic interactive approach by which the new pairs can be extracted with high confidence based on knowledge accumulated from existing pairs.
3.1.2   The Semantic Iteration Algorithm
First, we extract a set of candidate pairs by Hearst patterns [13] (Table 2). For instance, if we have sentence “… animals such as cat …”, we can extract a pair <animal, cat> which stands cat isA animal. Sometimes, there is ambiguity in the pattern matching process. For instance, in the sentence “… animals other than dogs such as cats …”, we can extract two candidate pairs <animals, cats>, and <dogs, cats>. The algorithm must have the ability to decide which one is the correct super-concept among “animals” and “dogs”. Another example is “… companies such as IBM, Nokia, Proctor and Gamble …”. In this case, we have multiple choices in the sub-concept, namely, <companies, Proctor> and < companies, Proctor and Gamble>. Again, the algorithm should automatically determine their correctness.
Table 2.   Hearst patterns examples
 ID Pattern 1 NP such as {NP,}*{(or | and)} NP 2 Such NP as {NP,}*{(or | and)} NP 3 NP{,} including {NP,}* {(or | and)} NP 4 NP{,NP}*{,} and other NP 5 NP{,NP}*{,} or other NP 6 NP{,} especially {NP,}*{(or | and)} NP
We denote the candidate set of super-concepts for sentence as $$X_s$$, and the candidate set of sub-concepts for sentence as $$Y_s$$. $$Г$$is the set of isA pairs that have already extracted as ground truth. The remaining task for the algorithm is to detect the correct super-concepts and sub-concepts from $$X_s$$and $$Y_s$$, respectively, based on the knowledge accumulated in . For each sentence , if we have multiple choices for the super-concepts, we must choose one as the correct super-concept. It is based on the observation that when ambiguity exists in super-concept matching, there is only one correct answer. After determining the correct super-concept, the goal is to filter out the correct sub-concepts from candidates in . Unlike the super-concept case, we may expect more than one valid sub-concept, and the result sub-concepts should be a subset of $$Y_s$$.
We denote the candidate set of super-concepts for sentence $$s$$ as $${\mathrm{X}}_{s}$$, and the candidate set of sub-concepts for sentence $$s$$ as $${\mathrm{Y}}_{s}$$.$$\mathrm{ }\mathrm{Г}$$ is the set of isA pairs that have already extracted as ground truth. The remaining task for the algorithm is to detect the correct super-concepts and sub-concepts from $${\mathrm{X}}_{s}$$ and $${\mathrm{Y}}_{s}$$ respectively, based on the knowledge accumulated in $$\mathrm{Г}$$. For each sentence $$s$$, if we have multiple choices for the super-concepts, we must choose one of it as the correct super-concept. It is based on the observation that when ambiguity exists in super-concept matching, there is only one correct answer. After determining the correct super-concept, the goal is to filter out the correct sub-concepts from candidates in $${\mathrm{Y}}_{s}$$. Unlike the super-concept case, we may expect more than one valid sub-concept, and the result sub-concepts should be a subset of $${\mathrm{Y}}_{s}$$.
3.1.3   Super-Concept Detection
Let $${X}_{s}=\left\{{\mathrm{x}}_{1},{\mathrm{x}}_{2},\dots ,{\mathrm{x}}_{m}\right\}$$, we compute likelihood $$p(x_k |Y_s)$$for each candidate $${\mathrm{x}}_{k}$$. We assume $${\mathrm{y}}_{1},{\mathrm{y}}_{2},\dots ,{\mathrm{y}}_{n}\in {Y}_{s}$$ are independent with each other given any super-concept $${\mathrm{x}}_{k}$$, then we have
$p(x_k |Y_s )∝p(x_k)p(Y_s |x_k)=p(x_k)∏_(i=1)^np(y_i |x_k)$
（1）
where $$p\left({\mathrm{x}}_{k}\right)$$ is the percentage of pairs in $$\mathrm{Г}$$ that contain $${\mathrm{x}}_{k}$$ as the super-concept, and $$p(y_i |x_k)$$ is the percentage of pairs in $$\mathrm{Г}$$ that have $${\mathrm{y}}_{i}$$ as the sub-concept when the super-concept is $${\mathrm{x}}_{k}$$. There are pairs that do not exist in $$\mathrm{Г}$$, and therefore, we let $$p(y_i |x_k)=ε$$ if no existence can be found for that pair; where $$\mathrm{\epsilon }$$ is set to be a small positive number. Without losing generality, we assume $${\mathrm{x}}_{1}$$ and $${\mathrm{x}}_{2}$$ are the top two candidates with the highest probability scores, and $$p(x_1 |Y_s )>p(x_2 |Y_s )$$. Then, we can compute the ratio of two probabilities as follows:
$r(x_1,x_2)=( p(x_1)∏_(i=1)^np(y_i |x_1))/( p(x_2)∏_(i=1)^np(y_i |x_2))$
(2)
$${\mathrm{x}}_{1}$$ will be chosen as the correct super-concept if $$r(x_1,x_2)$$ is larger than a pre-defined threshold. If not, this sentence is skipped in the current iteration and may be recovered in a later round when $$\mathrm{Г}$$ is more informative. Intuitively, the likelihood p (animals|cats) should be much larger than p (dogs|cats); so, we can correctly select “animals” as the result super-concept.
3.1.4   Sub-Concept Detection
In our implementation, sub-concept detection is based on the features extracted from the original sentences, which is motivated by two basic observations.
Observation 1 : The closer a candidate sub-concept is to the pattern keyword, the more likely it is a valid sub-concept.
Observation 2 : If we are certain that a candidate sub-concept at the k-th position (calculated by the distance from the pattern keyword) is valid, the candidate sub-concepts from position 1 to position k – 1 are also likely to be correct.
For example, consider the following sentence: “… representatives in North America, Europe, the Middle East, Australia, Mexico, Brazil, Japan, China, and other countries …”. Here “and other” is the pattern keyword; China is the candidate sub-concept closest to the pattern keyword. Obviously, China is a correct sub-concept. In addition, if we know that Australia is a legal sub-concept of “countries”, we can be more confident that Mexico, Brazil, and Japan are all correct sub-concepts of the same category; while North America, Europe, and the Middle East are less likely to be the instances of the same category.
Specifically, we find the largest k such that the likelihood $$p(y_k |x)$$ is above a pre-defined threshold. Then, $$y_1,y_2,…,y_k$$are all treated as valid sub-concepts of the super-concept $$x$$. Note that there are sometimes ambiguity in the sub-concept . For instance, in the sentence “… companies such as IBM, Nokia, Proctor and Gamble…” at position 3, the noun phrase can be “Proctor” or “Proctor and Gamble”. Suppose the candidate at the position $$k$$ to be $$c_1, c_2, ...$$, we calculate the conditional probability of each candidate given the super-concept $$x$$ as well as all sub-concepts before position $$k$$:
$p(y_k=c_i |x,y_1,y_2,…,y_(k-1) )=p(c_i│x) ∏_(j=1)^(k-1)p(y_j |c_i,x)$
(3)
As before, $$y_1,y_2,…,y_(k-1)$$are assumed to be independent given the super-concept $$x$$; $$p(c_i |x)$$ is the percentage of existing pairs in $$Г$$where $$c_i$$is a sub-concept when x is the super-concept; $$p(y_j|c_i,x)$$ is the likelihood that $$y_j$$is a valid sub-concept given x is the super-concept and $$c_i$$ is another valid sub-concept in the same sentence. Suppose $$c_1$$ and $$c_2$$ are the top 2 ranked concepts, and we pick $$c_1$$as the final concept if $$r(c_1,c_2)$$exceeds a certain ratio. The value of can be calculated as:
$r(c_1,c_2)=( p(c_1 |x)∏_(j=1)^(k-1)p(y_j |c_1,x))/(p(c_2 |x)∏_(j=1)^(k-1)p(y_j |c_2,x))$
(4)
Intuitively, the probability p(Proctor and Gamble | companies) should be much larger than p(Proctor | companies) after $$\mathrm{Г}$$ accumulating enough knowledge, thus the algorithm is able to select the correct candidate automatically.
3.2   Taxonomy Construction
The taxonomy construction procedure mainly consists of three steps, (1) local taxonomy construction, (2) horizontal merge and (3) vertical merge. First, we build the local taxonomy for each sentence. Then, we perform horizontal merge and vertical merge in sequence on the local taxonomy collection to construct a global taxonomy.
1). Local Taxonomy Construction Figure 2 illustrates the process to build a local taxonomy from each sentence. For example, in the sentence “A such as B, C, D”, we can get three pairs <A, B>, <A, C>, and <A, D> where A is the super-concept and B, C, D are the sub-concepts. We can build a local taxonomy with A as the root node and B, C, D as child nodes.

Figure 2.   Local taxonomy construction.
2). Horizontal Merge After the local taxonomy for each sentence is built, the next step is horizontal merge. Suppose we have two local taxonomies $${T}_{A}^{1}$$ and $${T}_{A}^{2}$$ whose root is $${A}^{1}$$ and $${A}^{2}$$, respectively, where $${A}^{1}$$and $${A}^{2}$$ correspond to the same surface form. If we are sure that $${A}^{1}$$ and $${A}^{2}$$ express the same sense, we can merge the two trees horizontally, as shown in Figure 3. We design a similarity function to calculate the probability that $${A}^{1}$$ and $${A}^{2}$$ are semantically eq ual. Intuitively, if the children of $${A}^{1}$$ and $${A}^{2}$$ are more overlapped, we can be more confident that they are of the same sense. Therefore, we adopt absolute overlap for the similarity calculation function, i.e., $$f\left({A}^{1}, {A}^{2}\right)=\left|{A}^{1}\cap {A}^{2}\right|$$; and a constant threshold $$\delta$$ is applied to determine whether two local taxonomies can be horizontally merged.

Figure 3.   Horizontal merge.
3). Vertical Merge Given two local taxonomies rooted by $${A}^{i}$$ and $${B}^{1}$$, where $$B$$ is a child of $${A}^{i}$$. If we are confident that B is of the same sense as $${B}^{1}$$, we can merge the two taxonomies vertically. As shown in Figure 4, after merging, $${A}^{i}$$ is the root; the original subtree $$B$$ is merged with the taxonomy rooted by $${B}^{1}$$; and the other subtrees $$C$$ and $$D$$ remain at the same position. To determine if B and $${B}^{1}$$ are of the same sense, we calculate the absolute overlap of $${B}^{1}$$’s children and B’s siblings (emphasized by colored nodes in Figure 4).

Figure 4.   Vertical merge: single sense alignment.
There is another case which is more complicated (Figure 5). Both $${T}_{B}^{1}$$ and $${T}_{B}^{2}$$ can be vertically merged with $${T}_{A}^{i}$$, as the child nodes of $${B}^{1}$$ and $${B}^{2}$$ have considerable overlap with B’s siblings in $${T}_{A}^{i}$$. However, the child nodes of $${B}^{1}$$ and $${B}^{2}$$ do not have enough overlap, indicating that $${B}^{1}$$ and $${B}^{2}$$ may express different senses. In this case, we split two senses in the merged taxonomy, i.e., $${T}_{B}^{1}$$ and $${T}_{B}^{2}$$ are merged as two distinct sub-trees in taxonomy $${T}_{A}^{i}$$.

Figure 5.   Vertical merge: multiple sense alignment.
3.3   Probability Calculation
Probability is an important feature of our taxonomy design. Different from previous approaches which treat knowledge as black and white, our solution is to live with noisy data with probability scores and let the application make the best use of it. We provide two kinds of probability scores, namely plausibility and typicality.
Plausibility is the joint probability of an isA pair. For each isA claim $$\mathrm{E}$$, we use $$p\left(\mathrm{E}\right)$$ to denote its probability to be true. Here we adopt a simple noisy-or model to calculate the probability. Assume Eis derived from a set of sentences {$${s}_{1}. {s}_{2}$$, …, $${s}_{n}$$}, a claim is false if every piece of evidence from {$${s}_{1}. {s}_{2}$$, …, $${s}_{n}$$} is false. Assuming every piece of evidence is independent, we have:
$p(E)=1-p(E ̅ )=1-∏_(i=1)^n(1-p(s_i))$
(5)
where $$p(s_i)$$ is the probability of evidence $${s}_{i}$$ to be true. We characterize each $${s}_{i}$$ by a set of features $${F}_{i}$$ including: (1) the PageRank score of the webpage from which sentence $${s}_{i}$$ is extracted; (2) the Hearst pattern used to extract evidence pair <x, y> from $${s}_{i}$$; (3) the number of sentences with x as the super-concept; (4) the number of sentences with y as the sub-concept; (5) number of sub-concepts extracted from sentence $${s}_{i}$$; (6) position of y in the sub-concepts list from sentence $${s}_{i}$$. Supposing the features are independent, we apply Naïve Bayes [14] to derive $$p(s_i)$$ based on the corresponding feature set $${F}_{i}$$. We exploit WordNet to build a training set used for learning the Naïve Bayes Model. Given a pair <x, y> that appears in the WordNet, if there is a path between x and y in the WordNet taxonomy, (i.e., x is an ancestor of y), we regard the pair as a positive example; otherwise, we treat the pair as negative.
Typicality is the conditional probability between a super-concept and its instance (sub-concept). Intuitively, robin is more typical of the concept bird than is ostrich, while Microsoft is more typical of the concept company than is Xyz inc. Therefore, we need a probability score to stand for the typicality of an instance (sub-concept) to its corresponding super-concept. The typicality measure of an instance $$\mathrm{i}$$ to a super-concept $$\mathrm{x}$$ is formulated as:
$$\gamma \left(\mathrm{i}|\mathrm{x}\right)=\frac{\mathrm{n}\left(\mathrm{x},\mathrm{i}\right)\bullet \mathrm{p}\left(\mathrm{x},\mathrm{i}\right)}{{\sum }_{{\mathrm{i}}^{\mathrm{\text{'}}}\in {\mathrm{I}}_{\mathrm{x}}}\mathrm{n}\left(\mathrm{x},{\mathrm{i}}^{\mathrm{\text{'}}}\right)\mathrm{p}\left(\mathrm{x},{\mathrm{i}}^{\mathrm{\text{'}}}\right)}$$ ,
where $$x$$ is a super-concept, $$\mathrm{i}$$ is an instance of $$x$$, $$\mathrm{n}\left(\mathrm{x},\mathrm{i}\right)$$ is the number of appearance that support $$\mathrm{i}$$ as an instance of $$\mathrm{x}$$; $$\mathrm{p}\left(\mathrm{x},\mathrm{i}\right)$$ is the plausibility of pair <x, i>, and $${\mathrm{I}}_{\mathrm{x}}$$ is a set consisting of all instances of super-concept$$x$$ . For example, suppose x = Company, i = Microsoft, j= Xyz Inc. It can be imagined that $$\mathrm{n}\left(x,\mathrm{i}\right)$$ should be much larger than $$\mathrm{n}\left(x,\mathrm{j}\right)$$, so Microsoft should obtain a much bigger typicality score than Xyz Inc with respect to Company. Similarly, we can also define the typicality score denoting the probability of a concept $$\mathrm{x}$$ to instance $$\mathrm{i}$$.
$$\gamma \left(\mathrm{x}|\mathrm{i}\right)=\frac{\mathrm{n}\left(\mathrm{x},\mathrm{i}\right)\bullet \mathrm{p}\left(\mathrm{x},\mathrm{i}\right)}{{\sum }_{{\mathrm{x}}^{\mathrm{\text{'}}}\in {\mathrm{I}}_{\mathrm{i}}}\mathrm{n}\left({\mathrm{x}}^{\mathrm{\text{'}}},\mathrm{i}\right)\mathrm{p}\left({\mathrm{x}}^{\mathrm{\text{'}}},\mathrm{i}\right)}$$ .
4. Conceptualization
In this section, we introduce the conceptualization model which leverages the Probase semantic knowledge network to facilitate text understanding. Conceptualization model (a.k.a. the Concept Tagging model) aims to map text into semantic concept categories with some probabilities. It provides computers the commonsense of semantics and makes machines "aware" of the mental world of human beings, through which the machines can better understand human communication in text.
We consider three sub-tasks for building a conceptualization model. (1) Single Instance Conceptualization, which returns Basic-Level Categorization (BLC) for a single instance. As an example, “Microsoft” could be automatically mapped to Software Company and Fortune 500 company, etc., with some prior probabilities. (2) Context-Aware Single Instance Conceptualization, which produces the most appropriate concepts based on different contexts. As an example, “Apple” could be mapped to Fruit or Company without context, but with context word “pie”, “Apple” should be mapped to Fruit with higher probability. (3) Short Text Conceptualization, which returns the types and concepts related to a short sequence of text. For example, in the sentence “He is playing game on Apple iPhone and eating an apple”, the first “Apple” is Company while the second “Apple” is Fruit.
4.1   Single Instance Conceptualization
Assume e is an instance, and c is a super-concept; we can obtain the Basic-Level Categorization (BLC) results based on typicality scores $$P\left(e|c\right)$$ and $$P\left(c|e\right)$$, where $$P\left(e|c\right)$$ denotes the typicality of an instance e to concept c, and $$P\left(c|e\right)$$ denotes the typicality of a concept c to instance e. We propose several metrics as representative measures used for BLC [15].
MI is the mutual information between $$e$$ and $$c$$, defined as:
$$MI\left(e,c\right)=P\left(e,c\right)log\frac{P\left(e,c\right)}{P\left(e\right)P\left(c\right)}$$ ,
PMI denotes the pointwise mutual information, which is a widely used measure of the association between two discrete variables.
$$PMI\left(e,c\right)=log\frac{P\left(e,c\right)}{P\left(e\right)P\left(c\right)}=log\frac{P\left(e|c\right)P\left(c\right)}{P\left(e\right)P\left(c\right)}=log P\left(e|c\right)-logP\left(e\right)$$ ,
NPMI is a normalized version of PMI, which is proposed to make PMI less sensitive to occurrence frequency and is more interpretable.
$$NPMI\left(e,c\right)=\frac{PMI\left(e,c\right)}{-logP\left(e,c\right)}=\frac{log P\left(e|c\right)-logP\left(e\right)}{-logP\left(e,c\right)}$$ ,
Both PMI and NPMI suffer from the trade-off between general and specific concepts. On the one hand, general concepts may be the correct answer, but they do not have the capability to distinguish instances of different sub-categories; on the other hand, specific concepts may be more representative, but the coverage is low. Therefore, we further propose PMIk and Graph Traversal measures to tackle this problem.
PMIk makes a compromise to avoid producing either too general or too specific concepts.
$$Rep\left(e,c\right)=P\left(c|e\right)P\left(e|c\right)$$ ,
If we take the logarithm of scoring function $$Rep\left(e,c\right)$$, we get:
$$\mathrm{log}Rep\left(e,c\right)=log\frac{{P\left(e,c\right)}^{2}}{P\left(e\right)P\left(c\right)}=PMI\left(e,c\right)+\mathrm{log}P\left(e,c\right)$$ ,
This in fact corresponds to PMI2, which is a normalized form in the PMIk Family. [16]
Graph Traversal is a common way to calculate the relatedness of two nodes in a large network. The scores calculated by general graph transversal can be formulate as:
where $${P}_{k}\left(e,c\right)$$ is the probability of staring from $$e$$ to $$c$$ and back to $$e$$ in 2k steps. When k = 2 which represents a 4-step random walk, we have:
Thus, it is verified that the simple, easy-to-compute scoring method of $$Rep\left(e,c\right)$$ is equivalent to a graph traversal approach under the constraint of 4-step random walk.
4.2 Context-Aware Single Instance Conceptualization
One instance may be mapped to distinct concepts according to different contexts. For example, for “apple ipad”, we want to annotate “apple” with company or brand, and “ipad” with device or product. The major challenge is how to distinguish and detect the correct concepts in different contexts. We propose a framework of context-aware single instance conceptualization [17] which consists of two parts: (1) offline knowledge graph construction and (2) online concept inference.
4.2.1   Offline Knowledge Graph Construction
There are millions of concepts in the Probase semantic network. We first perform clustering on all the concepts to construct the concept clusters. Concretely, we adopt a K-Medoids clustering algorithm [18] to group the concepts into 5,000 clusters. We adopt concept clusters instead of concepts themselves in the graph for noise reduction and computation efficiency. In the rest of this section, we use concept to denote concept cluster. We build a large knowledge graph offline, which is comprised of three kinds of sub-graphs, including (1) instance to concept graph, (2) concept to concept graph and (3) non-instance term to concept graph. Figure 6 shows a piece of the graph around the term watch.

Figure 6.   A subgraph of heterogeneous semantic network around watch.
Instance to concept graph. We directly use the Probase semantic network as the instance to concept graph. $$P\left(c|e\right)$$ is served as the typicality probability of concept (cluster) $$c$$ to instance $$e$$, which can be computed by
$P\left(c|e\right)=\sum _{{c}^{*}\in c}{P}^{*}\left({c}^{*}|e\right) ,$
where $${c}^{*}$$ represents a concept belonging to cluster $$c$$, $${P}^{*}\left({c}^{*}|e\right)$$ is the typicality of the concept $${c}^{*}$$to instance $$e$$.
Concept to concept graph . We assign a transition probability $$P\left({c}_{2}|{c}_{1}\right)$$ to the edge between two concepts $${c}_{1}$$ and $${c}_{2}$$. The probability is derived by aggregating the co-occurrences between all (unambiguous) instances of the two concepts.
$$P\left({c}_{2}|{c}_{1}\right)=\frac{{\sum }_{{e}_{i}\in {c}_{1},{e}_{j}\in {c}_{2}}n\left({e}_{i},{e}_{j}\right)}{{\sum }_{c\in C}{\sum }_{{e}_{i}\in {c}_{1},{e}_{j}\in c}n\left({e}_{i},{e}_{j}\right)}$$ ,
where C denotes a set containing all concepts (clusters); and the denominator is applied for normalization. In practice, we only take the top 25 related concepts for each concept for edge construction.
Non-instance term to concept graph There are terms that cannot be matched to any instances or concepts, i.e., verbs or adjectives. For better understanding of the short text, we also mine lexical relationships between non-instance terms and concepts. Specifically, we use the Stanford Parser [19] to obtain POS tagging and dependency relationships between tokens in the text, and the POS tagging results reveal the type (e.g., adjective, verb, etc.) of each token. Our goal is to obtain the following two distributions.
$$P(z|t)$$ stands for the prior probability that a term $$t$$ is of a particular type $$z$$. For instance, the word watch is a verb with probability $$P\left(verb|watch\right)=0.8374$$. We compute the probability as $$P\left(z|t\right)=\frac{n\left(t,z\right)}{n\left(t\right)}$$, where $$n\left(t,z\right)$$ is the frequency term $$t$$ annotated as type $$z$$ in the corpus, and $$n\left(t\right)$$ is the total frequency of term $$t$$.
$$P(c|t,z)$$ denotesthe probability of a concept $$c$$, given the term $$t$$ of a specific type $$z$$. For example, $$P\left(movie|watch,verb\right)$$ depicts how likely the verb watch is associated with the concept movie lexically. Speciﬁcally, we detect co-occurrence relationships between instances, verbs and adjectives in Web documents parsed by the Stanford Parser. To obtain meaningful co-occurrence relationships, we require that the co-occurrence be embodied by dependency, rather than merely appearing in the same sentence. We first obtain $$P\left(e|t,z\right)$$, which denotes how likely a term $$t$$ of type $$z$$ co-occurs with instance $$e$$:
$$P\left(e|t,z\right)=\frac{{n}_{z}\left(e,t\right)}{{\sum }_{{e}^{*}}{n}_{z}\left({e}^{*},t\right)}$$ ,
where $$P(c|t,z)$$ is the frequency of term $$t$$ and instance $$e$$ having a dependency relationship when $$t$$ is of type $$z$$. Then, taking instances as the bridge, we calculate the relatedness between non-instance terms (adjectives/verbs) and concepts.
$$P\left(c|t,z\right)=\sum _{e\in c}P\left(c,e|t,z\right)=\sum _{e\in c}P\left(c|e\right)×P\left(e|t,z\right)$$ .
In the above equation, $$e\in c$$ means that $$e$$ is an instance of concept $$c$$, and we make an assumption that a concept $$c$$ is conditionally independent with term $$t$$ and type $$z$$ when the instance $$e$$ is given.
4.2.2 Online Concept Inference
We adopt the heterogeneous semantic graph built offline to annotate the concepts for a query. First, we segment the query into a set of candidate terms which use Probase as lexicon and identify all occurrences of terms in the query. Second, we retrieve the subgraph out of the heterogeneous semantic graph by concentrating on the query terms. Finally, we perform multiple random walks on the sub-graph to find the concepts with the highest weights after convergence.
4.3   Short Text Conceptualization
Short text is hard to understand in three aspects: (1) Compared with long sentences, short sentences lack syntactic features and cannot directly apply POS tagging; (2) Short texts do not have sufficient statistical signals; (3) Short text is usually ambiguous due to the lack of context terms. Many research works focus on statistical approaches like topic models [20], which extract latent topics as implicit semantics. However, we argue that semantic knowledge is needed to get a better understanding of short texts. Thus, we aim to utilize the semantic knowledge network to enhance short text understanding [21]. Different from the knowledge graph built for context-aware single instance conceptualization (Section 4.2), we add a novel sub-graph, typed-term co-occurrence graph, which is an undirected graph where nodes are typed-terms and edge weights represent the lexical co-occurrence between two typed-terms. Nevertheless, the number of typed-terms is extremely large, which will increase storage cost and aﬀect the eﬃciency of calculation on the network. Therefore, we compress the original typed-term co-occurrence network by retrieving Probase concepts for each instance. Then, the typed-terms can be replaced by the related concept clusters and the edge weights are aggregated accordingly. Figure 7 shows an example of the compression procedure.

Figure 7.   The compression procedure of typed-term co-occurrence network.
Given a piece of short text, each term is associated with the type and corresponding concepts. We define the types as instance, verb, adjective and concept. For each term with type instance, we also learn the corresponding concepts for the term. Given a piece of short text, the online inference procedure contains three steps, including text segmentation, type detection and concept labeling. An example is illustrated in Figure 8. Given the query “book disneyland hotel california”, the algorithm first segments it as “bookdisneylandhotelcalifornia ”; then, it detects the type for each segment; at last, the concepts are labeled for each instance. As shown in the figure, the output is “book[v] disneyland[e](park) hotel[c] california[e](state)”, which means that book is a verb, disneyland is an instance of the concept park, hotel is a concept, and california is an instance of the concept state.

Figure 8   An example of short text understanding.
4.3.1   Text Segmentation
The goal of text segmentation is to divide a short text into a sequence of meaningful components. To determine a valid segmentation, we define two heuristic rules: (1) except for stop words, each word belongs to one and only one term; (2) terms are coherent (i.e., terms mutually reinforce each other). Intuitively, {april in parislyrics } is a better segmentation of “april in paris lyrics” than {april in parislyrics }, since “lyrics” is more semantically related to songs than months or cities. Similarly, {vacationapril in paris} is a better segmentation of “vacation april in paris” than {vacationapril in paris }, because “vacation”, “april”, and “paris” are highly coherent with each other, while “vacation” and “april in paris” is less coherent.
The text segmentation algorithm can be conducted in the following steps. First, we construct a term graph (TG) which consists of candidate terms and their relationships. Next, we add an edge between two candidate terms when they are not mutually exclusive and set the edge weight to reﬂect the strength of mutual reinforcement. Finally, the problem of ﬁnding the best segmentation is transformed into the task of ﬁnding a clique in the original TG, with 100% word coverage while maximizing the average edge weights.
4.3.2   Type Detection
The type detection procedure annotates the type for each term as verb, adjective, instance or concept. For example, term “watch” appears in the instance-list, concept-list, as well as verb-list of our vocabulary, thus the possible typed-terms of “watch” are watch[c], watch[e] and watch[v]. For each term, the type detection algorithm determines the best typed-term from the set of possible candidates. In the case of “watch free movie”, the best typed-terms for “watch”, “free”, and “movie” are watch[v], free[adj] and movie[c], respectively.
Traditional approaches resort to POS tagging algorithms which consider lexical features only, e.g., Markov Model [22]. However, such surface features are insuﬃcient to determine types of terms especially in the case of short text. In our solution, we calculate the probability by considering both traditional lexical features and semantic coherence features. We formulate the problem of type detection into a graph model and propose two models, namely Chain model and Pairwise model.
Chain model borrows the idea of ﬁrst order bilexical grammar and considers topical coherence between adjacent typed-terms. In particular, we build a chain-like graph where nodes are typed-terms retrieved from the original short text. Edges are added between each pair of typed-terms mapped from adjacent terms in the original text sequence, and the edge weights between typed-terms are calculated by aﬃnity scores (see the example in Figure 9(a)). Chain model only considers semantic relations between consecutive terms which will lead to mistakes.
Pairwise model adds edges between typed-terms mapped from each pair of terms rather than adjacent terms only. As an example, in Figure 9 (b), there are edges between non-adjacent terms “watch” and “movie”. The goal of the Pairwise Model is to ﬁnd the best sequence of typed-terms which guarantees that the maximum spanning tree (MST) constructed by the selected sequence has the largest total weight. As shown in Figure 9 (b), as long as the total weight of edges between (watch[v], movie[c]) and (free[adj], movie[c]) is the largest, {watch[v],free[adj],movie[c]} can be successfully recognized as the best sequence of type detection for the query “watch free movie”. We employ the Pairwise model in our system as it achieves higher accuracy experimentally compared with the Chain model.

Figure 9.   Example of Chain Model and Pairwise Model.
4.3.3   Concept Labeling
The goal of concept labeling is to re-rank the candidate concepts according to the context for each instance. The most challenging task in concept labeling is to deal with ambiguous instances. Our intuition is that a concept is appropriate for an instance only if it is a common semantic concept of that instance and is supported by the surrounding context at the same time. Take “hotel California eagles” as an example, where both animal and music band are popular semantic concepts to describe “eagles”. If we find a concept song in the context, we can be more confident that music band should be the correct concept for “eagles”.
After type detection, we have obtained a set of instances for concept labeling. For each instance, we collect a set of concept candidates and perform instance disambiguation based on a Weighted-Vote approach, which is a combination of Self-Vote and Context-Vote. Self-Vote denotes the original affinity weight (calculated by normalized co-occurrence) of a concept cluster c associated to the target instance; while Context-Vote leverages the affinity weights between the target instance and other concepts found in the context. In the case of “hotel california eagles”, the original concept vector of the instance eagles is (<animal, 0.2379>, <band, 0.1277>, <bird, 0.1101>, <celebrity, 0.0463>, …) and the concept vector for context “hotel california” is (<singer, 0.0237>, <band, 0.0181>, <celebrity, 0.0137>, <album, 0.0132>, …). After disambiguation by Weighted-Vote, the final conceptualization result of eagles is (<band, 0.4562>, <celebrity, 0.1583>, <animal, 0.1317>, …).
5. Application
Probase Browser shows the backbone of the Probase taxonomy and provides an interface to search for super-concepts, sub-concepts, as well as similar concepts corresponding to the given query. Figure 10 is a snapshot of the Probase browser.

Figure 10.   A snapshot of the Probase browser.
Tagging Service provides thecapability of tagging a piece of text with a concept vector, based on which the semantic similarity can be calculated, and various text processing applications can be affiliated. Figure 11 shows a snapshot of the instance conceptualization demo when query a single instance “python”. Figure 12 shows the snapshot given “apple” in the context “pie” and “ipad” respectively. As shown in the figure, our tagging service can map the term “apple” into correct concepts under different context. An example of short text conceptualization is illustrated in Figure 13 by querying the tagging service with “apple engineer is eating the apple”. The result shows the capability of our tagging algorithm to distinguish different senses of “apple” in the short text scenario.

Figure 11.   Snapshot of single instance conceptualization.

Figure 12.   Snapshot of context-aware single instance conceptualization.

Figure 13.   An example of short text conceptualization.
Topic Search [23] aims at understanding the topic underlined in each query for better search relevance. Figure 14 shows a snapshot when user queries “database conference in Asian cites”. As shown in the figure, the search results correctly rank VLDB 2002, 2006, and 2010 at the top, which are hold in Hong Kong, Seoul, and Singapore, respectively. Traditional Web search takes queries as sequences of keywords instead of understanding the semantic meanings, so it is hard to generate the correct answers for the example query. To achieve this goal, we present a framework that improves Web search experiences using Probase knowledge and the conceptualization models. First, it classifies Web queries into different patterns according to the concepts and entities in addition to keywords contained in the queries. Then, it produces answers by interpreting the queries with the help of Probase semantic concepts. Our preliminary results showed that the framework was able to understand various types of topic-like queries and achieved much higher user satisfaction.

Figure 14.   The framework of topic search.
Understanding Web Tables [24]. We use Probase to help interpreting and understanding Web tables, which unlocks the wealth of information hidden in the Web pages. To tackle this problem, we build a pipeline for detecting web tables, understanding their contents, and applying the derived knowledge to support semantic search. From 300 million web documents, we extract 1.95 billion raw HTML tables. Many of them do not contain useful or relational information (e.g., be used for page layout purpose); others have structures that are too complicated for machines to understand. We use a rule-based filtering method to acquire 65.5 million tables (3.4% of all the raw tables) that contain potentially useful information. We adopt Probase taxonomy to facilitate the understanding of table content, by associating the table content with one or more semantic concepts in Probase. Based on the knowledge mined from Web tables, we build a semantic search engine over tables to demonstrate how structured data can empower information retrieval on the Web. A snapshot of our sematic search engine is shown in Figure 15. As illustrated in the figure, when user queries “American politicians birthday”, the search engine returns with aggerated web tables consisting of birthday and other related knowledge of various American politicians.
Understanding Web Tables [24]. We use Probase to help interpreting and understanding web tables, which unlocks the wealth of information hidden in the web pages. To tackle this problem, we build a pipeline for detecting web tables, understanding their contents, and applying the derived knowledge to support semantic search.  From 0.3 billion web documents, we extract 1.95 billion raw HTML tables. Many of them do not contain useful or relational information (e.g., be used for page layout purpose); others have structures that are too complicated for machines to understand. We use a rule-based filtering method to acquire 65.5 million tables (3.4% of all the raw tables) that contain potentially useful information. We adopt Probase taxonomy to facilitate the understanding of table content, by associating the table content with one or more semantic concepts in Probase. Based on the knowledge mined from web tables, we build a semantic search engine over tables to demonstrate how structured data can empower information retrieval on the web. A snapshot of our sematic search engine is shown in Figure 15. As illustrated in the figure, when user queries “American politicians birthday”, the search engine returns with aggerated web tables consisting of birthday and other related knowledge of various American politicians.

Figure 15.   Snapshot of the web tables.
Channel-based Query Recommendation [25] aims to anticipate user search needs when browsing different channels, by recommending the hottest and highly related queries for a given channel. As shown in Figure 16, there are three channels, News, Sports and Entertainment, and several queries are recommended under each channel to enable the users to explore the hottest topics related to the target channel. One of the main challenges is how to represent queries and channels which are short pieces of text. Traditional representation methodology treats text as bag of words, which suffers from mismatch of surface forms of words, especially when text is short. Therefore, we leverage the Probase taxonomy to obtain a “Bag of Concepts” representation for each query and channel, in order to avoid surface mismatching and handle the problem of synonym and polysemy. By leveraging the large taxonomy knowledgebase of Probase, we learn a concept model for each category, and conceptualize a short text to a set of relevant concepts. Moreover, a concept-based similarity mechanism is presented to classify the given short text to the most similar category. Experiments showed that our framework could map queries to channels with a high degree of precision (avg. precision = 90.3%).
Channel-based Query Recommendation [25] aims to anticipate user search needs when browsing different channels, by recommending the hottest and highly related queries for a given channel. As shown in Figure 16, there are three channels, News, Sports and Entertainment, and several queries are recommended under each channel to enable the users to explore the hottest topics related to the target channel. One of the main challenges is how to represent queries and channels which are short pieces of text. Traditional representation methodology treats text as bag of words, which suffers from mismatch of surface forms of words, especially when text is short. Therefore, we leverage the Probase taxonomy to obtain a “Bag of Concepts” representation for each query and channel, in order to avoid surface mismatching and handle the problem of synonym and polysemy. By leveraging the large taxonomy knowledgebase of Probase, we learn a concept model for each category, and conceptualize a short text to a set of relevant concepts. Moreover, a concept-based similarity mechanism is presented to classify the given short text to the most similar category. Experiments showed that our framework could map queries to channels with a high degree of precision (avg. precision = 90.3%).

Figure 16.   Query recommendation snapshot.
Ads Relevance [3].In sponsored search, the search engine maps each query to the related ad bidding keywords. Since both the query and bidding keywords are short, the traditional bag-of-words approaches do not work well in this scenario. Therefore, we can leverage Probase concept taxonomy to enhance ads relevance calculation. For each short text, we first identify instances from it, and map it to basic-level concepts with score $$Rep\left(e,c\right)$$; then, we merge the concepts to generate a concept vector representing the semantics of the target text. Finally, the relevance score can be calculated through the cosine similarity measure between the concept vectors of the query and bidding keywords. We conduct our experiments on real Ads click logs collected from Microsoft Bing search engine. We calculate the relevance score of each candidate pair of query and bidding keywords, divide the pairs into 10 buckets based on the relevance score, and report the average CTR within each bucket. The result is demonstrated in Figure 17, which shows that the CTR numbers have strong linear correlation with the relevance scores calculated by our model.

Figure 17.   The correlation of CTR with Ads relevance score.
5. Data and Analysis
5.1   Data
The Microsoft Concept Graph can be download at https://concept.research.microsoft.com/, which is a sub-graph of the semantic network we introduce in this paper. The core taxonomy of Microsoft Concept Graph contains above 5.4 million concepts, whose distribution is shown in Figure 18, where Y axis is the number of instances each concept contains (logarithmic scale), and on the X axis are the 5.4 million concepts ordered by their size. Our concept space is much large than other existing knowledge bases. (Freebase contains no more than 2,000 concepts, and Cyc has about 120,000 concepts).

Figure 18.   The distribution of concepts in Microsoft Concept Graph.
5.2   Concept Coverage
Based on the query logs of Microsoft Bing search engine, we estimate the coverage of concepts mined by our methodology. If at least one concept can be found in a query, the query is considered as covered. We thus calculate the percentage of queries that can be covered by Probase and compare the metrics against the other four taxonomies. We utilize the top 50 million queries in Bing’s query log to compute the coverage metrics, and the results are shown in Figure 19. We can see clearly from the figure that Probase has the largest coverage, YAGO ranks the second, and Freebase has a rather low coverage although its instance space is large. It demonstrates that Freebase’s instance distribution is very skew (most instances are distributed in several top categories and lack the general coverage of other concepts).

Figure 19.   Concept coverage of different taxonomy.
5.3   Precision of isA pairs
To estimate the correctness of extracted isA pairs in Probase, we create a benchmark data set of 40 concepts in various domains. For each concept, we randomly pick up 50 instances (sub-concepts) and ask human judgers to evaluate their correctness. Figure 20 depicts the precision of isA pairs for all the 40 concepts we manually evaluate. The overall precision is 92.8% by averaging over all the concepts.

Figure 20.   Precision of extracted isA pairs on 40 concepts.
We further draw the curve of average precision after each iteration (shown in Figure 21). Moreover, the number of concepts and isA pairs discovered after each iteration is drawn in Figure 22. It is obvious that the precision degrades after each round while the number of discovered concepts and isA pairs increases. So, the best iteration number must be chosen as a trade-off between precision and facts number (set as 11 in our implementation).

Figure 21.   Precision of isA pairs after each iteration.

Figure 22.   The number of discovered concepts and isA pairs after each iteration.
5.4   Conceptualization Experiment
In this section, we mainly present the experimental results of conceptualization for both single instance and short text.
(1) Single Instance Conceptualization
Data Set Preparation. We asked human labeler to manually annotate the correctness of the concepts. The label is defined in four categories as shown in Table 3. Each (instance, concept) is assigned to three labelers to annotate. The final label is defined by the majority label. We will ask the fourth annotator to make the final vote for records without majority label. In all, there are 5,049 labeled records.
Table 3. Labeling guideline for conceptualization.

 Label Meaning Examples Excellent Good matched concepts at the basic level (bluetooth, wireless communication protocol) Good A little general or speciﬁc (bluetooth, accessory) Fair Too general or speciﬁc (bluetooth, feature) Bad Non-sense concepts (bluetooth, issue)
Measurement. We employ precision@K and nDCG to evaluate the effectiveness. For precision@K, we treat Good/Excellent as score 1, and Bad/Fair as 0. We calculate the precision of top-K concepts as follows:
$$Precision@k=\frac{{\sum }_{i=1}^{k}{rel}_{i}}{k}$$ ,
where reli is the score we deﬁne above. For nDCG, we treat Bad as 0, Fair as 1, Good as 2, and Excellent as 3. Then we calculate nDCG as follows:
$$nDCG=\frac{{rel}_{i}+{\sum }_{i=2}^{k}\frac{{rel}_{i}}{logi}}{{ideal_rel}_{i}+{\sum }_{i=2}^{k}\frac{{ideal_rel}_{i}}{logi}}$$ ,
where reli is the relevance score of the result at rank i, and ideal reli is the relevance score at rank i of an ideal list, obtained by sorting all relevant concepts in decreasing order of the relevance score.
Experimental result. Figure 23 shows the evaluation of the top 20 results using both Precision and nDCG with and without smoothing. We compare our proposed scoring functions with various baselines: MI, NPMI, PMI3.Where PMI3 is defined as:
$${PMI}^{3}\left(e,c\right)=log\frac{{p\left(e,c\right)}^{3}}{p\left(e\right)p\left(c\right)}$$ .
From the result, we can see that our proposed scoring function outperforms baseline on both precision and nDCG.
(2) Short Text Conceptualization
Data Set Preparation. To validate our generalizability, we build two data sets to evaluate our algorithm including user search query and tweets. To build query data set, we first manually picked 11 ambiguous terms including “apple”, “fox” with instance ambiguity, “watch”, “book”, “pink”, “blue”, “population”, “birthday” with type ambiguity, and “April in Paris”, “hotel California” with segmentation ambiguous. Then we randomly selected 1100 queries containing 11 ambiguous terms. Moreover, we randomly sampled another 400 queries without any restriction. In all, there are 1500 queries. To build tweets data set, we also randomly sampled 1,500 tweets using Twitter’s API. To clean the tweets, we removed some tweet-specific features, such as @username, hashtags, urls, etc. We asked human labeler to manually annotate the correctness. For each record, we assign at lease three labelers and pick up the majority vote as final label.
Experimental Result. To evaluate the effectiveness of the algorithm, we compared our methods with several baseline methods. [27] which conducts instance disambiguation in queries based on similar instances, [28] which conducts instance disambiguation in queries based on related instances, and [26] which conducts instance disambiguation in tweets based on both similar and related instances. Table 4 presents the results which show that our conceptualization is not only effective but also robust.

Figure 23.   Precision and nDCG Comparison.
Table 4.   Precision of short text understanding.
 [27] [28] [26] Our query 0.694 0.701 - 0.943 tweet - - 0.841 0.922
6. Conclusion
In this paper, we present the end-to-end framework of building and utilizing the Microsoft Concept Graph, which consists of three major layers, namely semantic network construction, concept conceptualization, and applications. In the sematic network construction layer, we focus on the knowledge extraction and taxonomy construction procedures to build an open-domain and probabilistic taxonomy known as Probase. Like human mental process, we then represent text by concept space using conceptualization models, and empower many applications including topic search, query recommendation, Ads relevance calculation as well as Web table understanding. The system has received wide public attention ever since released in 2016 (100K+ pageviews, 2M+ API calls and 3K+ registered downloads from 50K+ visitors over 64 countries).
Acknowledgments
[1]. G. Murphy. The big book of concepts. Cambridge, MA:　The MIT Press, 2004. isbn:　9780262632997.
[2]. P. Bloom. Glue for the mental world. Nature, Jan 2003, 421:212–213. doi: 10.1038/421212a.
[3]. W. Wu, H. Li, H. Wang, K. Zhu. Probase: A probabilistic taxonomy for text understanding. Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data. ACM, 2012. doi: 10.1145/2213836.2213891
[4]. D. B. Lenat and R. V. Guha. Building Large Knowledge-Based Systems: Representation and Inference in the Cyc Project. Reading, MA:　Addison-Wesley, 990. isbn: 9780201517521.
[5]. K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. Freebase: Ａ collaboratively created graph database for structuring human knowledge. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, ACM, 2008. doi: 10.1145/1376616.1376746.
[6]. F. M. Suchanek, G. Kasneci, & G. Weikum. Yago: A core of semantic knowledge. In: Proceedings of the 16th International Conference on World Wide Web, ACM, 2007, pp. 697–706. doi: 10.1145/1242572.1242667.
[7]. A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E.R.H. Jr., & T.M. Mitchell. Toward an architecture for neverending language learning. In: Proceedings of the 24th Conference on Artificial Intelligence, AAAI, 2010, pp. 1306-1313 . doi: 10.1145/1807406.1807449.
[8]. Miller, George A. WordNet: A lexical database for English. Communications of the ACM 38(11)(1995), 39-41. doi: 10.1145/219717.219748
[9]. S.P. Ponzetto, & M. Strube. Deriving a large-scale taxonomy from wikipedia. In: Proceedings of the 22nd National Conference on Artificial Intelligence, AAAI, 2007, pp. 1440–1445. Available at: http://www.aaai.org/Papers/AAAI/2007/AAAI07-228.pdf.
[10] S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, & Z. Ives. DBpedia: A nucleus for a web of open data. In: Proceedings of the 6th International Semantic Web Conference/Asian Semantic Web Conference, Springer, 2007, pp. 722–735. doi: 10.1007/978-3-540-76298-0_52.
[11]. R. Speer, & C. Havasi. Representing general relational knowledge in ConceptNet 5. In: Proceedings of the 8th International Conference on Language Resources and Evaluation, 2012, pp. 3679–3686. Available at: http://www.lrec-conf.org/proceedings/lrec2012/pdf/1072_Paper.pdf
[12]. O. Etzioni, M. Cafarella, D. Downey, A. P. Shaked, S. Soderland, D. S. Weld, and A. Yates. Web-scale information extraction in knowitall: preliminary results. Proceedings of the 13th international conference on World Wide Web, ACM, 2004, pp. 100-110. doi: 10.1145/988672.988687
[13]. M.A. Hearst. Automatic acquisition of hyponyms from large text corpora. In: Proceedings of the 14th conference on Computational linguistics-Volume 2. ACL, 1992, pp. 539–545.doi: 10.3115/992133.992154.
[14]. A. McCallum, & K. Nigam. A comparison of event models for naive bayes text classification. In: Proceedings of AAAI-98 Workshop on Learning for Text Categorization , AAAI, 1998, pp. 41--48.
[15]. Z. Wang, H.X. Wang, J. Wen, & Y. Xiao. An inference approach to basic level of categorization. In: Proceedings of the 24th ACM International Conference on Information and Knowledge Management (CIKM), ACM, 2015, 653-662. doi: 10.1145/2806416.2806533
[16]. B. Daille. Approche mixte pour l’extraction de terminologie: Statistique lexicale et filtres linguistiques. PhD dissertation, Université Paris VII, 1994.
[17]. Z. Wang, K. Zhao, H. Wang, X. Meng, & J. Wen. Query understanding through knowledge-based conceptualization. In: Proceedings of the 24th International Joint Conference on Artificial Intelligence, AAAI, 2015, pp. 3264-3270.
[18]. P. Li, H. Wang, K. Zhu, Z. Wang, & X. Wu. Computing term similarity by large probabilistic isA knowledge. In: Proceedings of the 22nd ACM International Conference on Conference on Information & Knowledge Management, ACM, 2013, 1401-1410. doi: 10.1145/2505515.2505567.
[19]. D. Marneffe, B. MacCartney, & C.D. Manning. Generating typed dependency parses from phrase structure parses. In: Proceedings of the 5th International Conference on Language Resources and Evaluations, 2006, pp. 449–454.
[20]. D. Blei, A.Y. Ng, & M.I. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research 3(Jan) (2003), 993–1022. Available: http://jmlr.csail.mit.edu/papers/v3/blei03a.html.
[21]. W. Hua, Z. Wang, H. Wang, K. Zheng, and X. Zhou, Short text understanding through lexical-semantic analysis. In: IEEE 31st International Conference on Data Engineering, IEEE, pp. 495-506. doi: 10.1109/ICDE.2015.7113309.
[22]. J. Kupiec. Robust part-of-speech tagging using a hidden Markov model. Computer Speech & Language 6(3) (1992), 225–242. doi: 10.1016/0885-2308(92)90019-Z.
[23]. J. Kupiec. Robust part-of-speech tagging using a hidden Markov model. Computer Speech & Language 6(3) (1992), 225–242. doi: 10.1016/0885-2308(92)90019-Z. J. Kupiec. Robust part-of-speech tagging using a hidden Markov model. Computer Speech & Language6(3) (1992), 225–242. doi: 10.1016/0885-2308(92)90019-Z.
[24]. J. Wang, H. Wang, Z. Wang, and K. Q. Zhu. Understanding tables on the web. International Conference on Conceptual Modeling. Springer, 2012. pp. 1–14. doi:10.1007/978-3-642-34002-4_11.
[25]. F. Wang, Z. Wang, Z. Li, & J. Wen. Concept-based short text classification and ranking. In: Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, ACM, 2014. doi: 10.1145/2661829.2662067.
[26]. P. Ferragina, & U. Scaiella. TAGME: On-the-fly annotation of short text fragments (by wikipedia entities). In: Proceedings of the 19th ACM International Conference on Information and Knowledge Management, ACM, 2010, ,pp. 1625-1628. doi: 10.1145/1871437.1871689.
[27]. Y. Song, H. Wang, Z. Wang, H. Li, & W. Chen. Short text conceptualization using a probabilistic knowledgebase. In: Proceedings of the 22nd International Joint Conference on Artificial Intelligence, AAAI, 2011, pp. 2330-2336. doi: 10.5591/978-1-57735-516-8/IJCAI11-388.
[28]. D. Kim, H. Wang, and A. Oh. Context-dependent conceptualization. In Proceedings of the Twenty-Third International Joint Conference on Artiﬁcial Intelligence, AAAI, 2013, pp 2654-2661.
Dawei Zhang received his Master degree from Peking University. He is currently the Founder of MIX Labs. His research interests include machine learning, natural language processing and knowledge graph.
Article and author information
Cite As
L. Ji, Y. Wang, B Shi, D. Zhang, Z. Wang, & J. Yan. Microsoft concept graph: Mining semantic concepts for short text understanding. Data Intelligence 1( 2019). doi: 10.1162/dint_a_00013
Lei Ji
All of the authors contributed equally to the work. L. Ji (leiji@microsoft.com) is the leader and is developing new component for V2.0. Y. Wang (Yujing.Wang@microsoft.com) and L. Ji summarized the methodology part of this paper. All the authors have made meaningful and valuable contributions in revising and proofreading the manuscript.
leiji@microsoft.com
Lei Ji received her Bachelor and Master degrees from Beijing Institute of Technology and is currently a Ph.D. student in computer science from Institute of Computing Technology Chinese academy of Science. She is currently a researcher at Microsoft Research Asia. Her research interests include knowledge graph, cross-modality visual and language learning.
0000-0002-7569-3265
Yujing Wang
All of the authors contributed equally to the work. Y. Wang (Yujing.Wang@microsoft.com) and L. Ji summarized the methodology part of this paper. Y. Wang (Yujing.Wang@microsoft.com) and B. Shi (botianshi@bit.edu.cn) summarized the applications and experiment. All the authors have made meaningful and valuable contributions in revising and proofreading the manuscript.
Yujing Wang received her Bachelor and Master degrees from Peking University. She is currently a Researcher at Microsoft Research Asia. Her research interests include AutoML, text understanding, and knowledge graph.
Botian Shi
All of the authors contributed equally to the work. Y. Wang (Yujing.Wang@microsoft.com) and B. Shi (botianshi@bit.edu.cn) summarized the applications and experiment. All the authors have made meaningful and valuable contributions in revising and proofreading the manuscript.
Botian Shi is currently a Ph.D. student in computer science from Beijing Institute of Technology, who conduct this work during internship at MSRA. His research interest include natural language processing, multi-modal knowledge based image/video understanding.
Dawei Zhang
All of the authors contributed equally to the work. J. Yan (jun.yan@yiducloud.cn), Z.Y. Wang(wzhy@outlook.com), D.W. Zhang (zhangdawei@outlook.com) used to be the leaders of the Microsoft Concept Graph system, and now L. Ji (leiji@microsoft.com) is the leader and is developing new component for V2.0.
Dawei Zhang received his Master degree from Peking University. He is currently the Founder of MIX Labs. His research interests include machine learning, natural language processing and knowledge graph.
Zhongyuan Wang
All of the authors contributed equally to the work. J. Yan (jun.yan@yiducloud.cn), Z.Y. Wang(wzhy@outlook.com), D.W. Zhang (zhangdawei@outlook.com) used to be the leaders of the Microsoft Concept Graph system, and now L. Ji (leiji@microsoft.com) is the leader and is developing new component for V2.0.
Zhongyuan Wang received his Bachelor, Master, and Ph.D. degrees in computer science at Renmin University of China in 2007, 2010, and 2015 respectively. He is currently a Senior Researcher and Senior Director of Meituan-Dianping. His research interests include knowledge base, web data mining, online advertising, machine learning, and natural language processing.
Jun Yan
All of the authors contributed equally to the work. J. Yan (jun.yan@yiducloud.cn), Z.Y. Wang(wzhy@outlook.com), D.W. Zhang (zhangdawei@outlook.com) used to be the leaders of the Microsoft Concept Graph system, and now L. Ji (leiji@microsoft.com) is the leader and is developing new component for V2.0.
Jun Yan received his Ph.D. degree in school of mathematical science at Peking University. He is currently the director of AI Lab of Yiducloud. His research interests include medical AI research, knowledge graph mining, and online advertising.
Publication records
Published: None （Versions2
References
Data Intelligence