Tailored semantic annotation for semantic search

This paper presents a novel method for semantic annotation and search of a target corpus using several knowledge resources (KRs). This method relies on a formal statistical framework in which KR concepts and corpus documents are homogeneously represented using statistical language models. Under this framework, we can perform all the necessary operations for an eﬃcient and eﬀective semantic annotation of the corpus. Firstly, we propose a coarse tailoring of the KRs w.r.t the target corpus with the main goal of reducing the ambiguity of the annotations and their computational overhead. Then, we propose the generation of concept proﬁles, which allow measuring the semantic overlap of the KRs as well as performing a ﬁner tailoring of them. Finally, we propose how to semantically represent documents and queries in terms of the KRs concepts and the statistical framework to perform semantic search. Experiments have been carried out with a corpus about web resources which includes several Life Sciences catalogues and Wikipedia pages related to web resources in general (e.g., databases, tools, services, etc). Results demonstrate that the proposed method is more eﬀective and eﬃcient than state-of-the-art methods relying on either context-free annotation or keyword-based search.


Introduction
Semantic annotation is the process of linking the meaning of unstructured data to concepts that are unambiguously described in a knowledge resource (KR).Automatic semantic annotation is playing a crucial role in a great variety of applications of the Semantic Web such as linked data generation, open information extraction, ontology alignment, and semantic search.Specifically, semantic search allows users to express their information needs in terms of concepts taken from one or several KRs.Unlike traditional keyword-based searches, semantic search can make use of the KR semantic relationships to perform new tasks such as to refine the user queries with broader or more specific concepts of the KR, to browse the whole content of the collection through the taxonomies provided by the KRs, and to provide friendlier visualizations to explore the retrieved documents [6].Successful applications like PubMed/Medline [1], the most popular search engine for the biomedical community, have demonstrated the enormous potential that semantic annotations have for end-users and third-party information consumer applications.Unfortunately, PubMed/Medline relies on manual semantic indexing performed by experts, which cannot be extrapolated to other domains and other scenarios that require massive annotation of texts such as opinion analysis.As a consequence, there is currently a great demand of fully automatic annotation methods.
Automatic semantic annotation has been widely applied to Life Sciences.For example, the biomedical community is interested in finding out new relationships between biological systems and clinical research.Many text mining approaches rely on the semantic annotation of the scientific literature in order to identify relevant biomedical entities such as proteins, genes, diseases, etc. and their relationships [41].Outside the Life Sciences area, semantic annotation has been mainly focused on Named Entities such as people, organizations, places, etc. 1 .Most of these methods rely on the dictionary look-up approach, which consists in identifying the entities mentioned in a text by looking for slightly variants of them in the KR lexicon.It is well known that for open target collections and large KRs, a text chunk can match several concepts of the KR, leading this way to the ambiguity issue.Even in specialized scenarios like Biomedicine, ambiguity can produce noise enough to hamper the effectiveness of the semantic searches (this will be further discussed in Section 4).
Several decades of research on word sense disambiguation (WSD) have demonstrated how hard it is to deal with ambiguity in natural language processing.Traditionally, WSD has been defined in terms of an inventory of word senses (e.g., WordNet).A WSD method aims at selecting the right senses for the words present in a text.Two main trends have been explored in the literature [39] namely, supervised approaches, which learn how to disambiguate a word given a series of examples about its senses, and knowledge-based approaches, which use the KR information to select the right sense without supervision.Nowadays the application of existing WSD methods to automatic semantic annotation is an open challenge due to three main issues: (1) semantic annotation must deal with arbitrary and usually large KRs, (2) the WSD method must be highly scalable in order to annotate very large collections, and (3) they should deal with the incompleteness of a KR, which usually does not contain all the possible senses of a term.Currently, the first issue makes supervised methods impractical, as we cannot gather examples of use for all the concepts in the KRs.The second issue makes current knowledge-based methods very time consuming as they need either to compare very large profiles of terms (e.g., [2,22,4]) or to compute large graphs with all senses involved at each sentence (e.g., [40,16]).As for the third issue, most WSD methods will consider unambiguous many strings since only one sense is covered in the KR.
In this paper we propose a novel method to perform context-based semantic annotation.The goal of this method is not only to find the concepts that best lexically fit in with the target text but also that their latent meanings fit in as well with those of the corpus to be annotated.We evaluate our semantic annotation method for semantic search tasks, in particular, for web resource discovery, where user queries are heterogeneous and usually expressed in a high level of abstraction.
The outline of the paper is as follows: in Section 2 we state the main contributions and novelties of the proposed method.Section 3 introduces some notation and background about the underlying foundations and Section 4 discusses the main semantic annotation issues overcome by our method.In Section 5 we present the proposed method and then explain each of the components.Section 6 is devoted to the experimental evaluation.In Section 7 we review current related work and in Section 8 we discuss the main conclusions and future work.

Contribution
The main contribution of the paper is a novel method for performing context-based semantic annotation and search based on a statistical formal background, more specifically, on statistical language models.The main novelties of this method are: • It is able to deal with several arbitrary and large KRs.
• Unlike current Wikipedia and UMLS R -based annotators, our method is independent of the specific characteristics of each KR.For example, it does not make use of disambiguation pages, internal links and other Wikipedia specific features.
• It is able to deal with both global and local contexts in order to validate the generated annotations.
• It uses a statistical framework for performing all the required operations for semantic indexing and search.
As far as we know, this is the first time language models are used to define a formal framework for semantic annotation and search.Statistical language models have provided in the last decades a sound background to perform most of text processing tasks, such as information retrieval, text categorization and automatic text translation.Language models define a theoretical framework to represent and operate over text semantics in terms of word distributions.The main advantage of these models is that they do not require any kind of natural language processing, making them quite attractive to define scalable methods for automatic semantic annotation and semantic search.Moreover, in this paper we show how language models can be naturally used to tailor KRs to target corpus in order to reduce ambiguity and increase efficiency.

Background
In this section we introduce the concepts and foundations that underlie the developed method.First, we define the concept of KR.Then, we define the notion of semantic annotation.Finally, we introduce statistical language models as the main foundation of our approach.

Knowledge Resource
In the following, we formalize the concept of KR and the minimal elements it must provide in order to be useful for semantic annotation and search.In order to find out candidate concepts for a text chunk, the KR must provide a lexicon describing its concepts.We assume that there exists a function lex(c) that returns the set of strings describing the concept c (e.g., labels, synonyms, etc).This set of strings can contain different lexical variants of c and synonyms of these variants.Moreover, we also assume that the KR provides a function def (c) that provides a short description of the concept.
The concepts in a KR can be taxonomically related by their subsumption (is-a) or by "broader-than" relationships.The taxonomic relationship between two concepts c and c is represented as c c .A KR can provide other concept relationships but they are not considered in our approach.
In this work, we make use of the largest and most popular KRs currently used for semantic search: UMLS R and Wikipedia.For the latter, we adopt Wikinet [38] since it fits to our definition of KR.
For illustration purposes, we show an example of the information that Wikinet provides for the concept with identifier "W11258494": lex(c)= { residue, chemical residue } def(c)= ''In chemistry, residue is the material remaining after a distillation or an evaporation of a methyl group.It may also refer to the undesired byproducts of a reaction''

Semantic Annotation
Performing the semantic annotation of a document d consists in finding mappings between text chunks t of d (i.e., sequences of adjacent terms), and the concepts that best semantically describe the contents of d.As concepts of a KR are usually expressed as noun phrases, text chunks are usually associated to these syntactic structures.We formally define semantic annotation as follows: Definition 3.2.Given a knowledge resource KR, and d = (w 1 , ..., w n ) a document (i.e., input set of sequences over terms from the vocabulary V), a semantic annotation is a pair < c, t > where c ∈ KR and t is a subsequence of d such that there exists a mapping from lex(c) to a subset t', t ⊆ t.
Next, we show an example of semantic annotation of the sentence: BriX is a database containing some protein fragments from 4 to 14 residue from protein homology, which is a description of a database.Each annotation includes the KR name (i.e., src), the concept identifier (i.e., cui ), the semantic type and semantic group (i.e., type and grp resp.), the offset and length of the annotation in the text (i.e., offset and len).
A semantic annotation is ambiguous if more than one concept has been assigned the exact same subset of tokens.In the previous example, the string homology has been annotated with three different concepts from UMLS R that belong to different semantic types (i.e., Quantitative concept, Qualitative concept and Gene or Genome).
Current automatic annotation is performed independently from the context in which concepts are identified, assuming that the lexicons are well suited to the corpus to be annotated.However, the semantics of a concept may not fit in with the context in which it occurs.Additionally, we have the problem of erroneously assigning a unique concept to a text chunk because the correct concept is not present in the KR.To detect these cases we also need to take into account the context of the generated annotations.Next, we present the main foundation used to validate semantic annotations, which uses statistical language models to characterize both the KR concepts and the context where the annotations take place.

Statistical Language Models
In order to characterize the KRs and the corpora to be annotated, as well as to capture the main contexts they can generate, we adopt a statistical framework based on language models.A statistical language model assigns a probability to a sequence of n words p(w 1 , . . ., w n ) by means of a probability distribution.
Let V = {w 1 , • • • , w N } be the vocabulary used in the KRs as well as the corpora to be annotated.We consider that any text description d consists of an observed sequence of terms (w 1 , • • • , w k ) with w i ∈ V for which a language model θ d can be associated.This language model represents the word distribution {p(w|θ d )} w∈V .When this distribution is estimated via Maximum Likelihood Expectation (MLE), we denote the model as θ d .MLE only uses the relative frequency of the terms in d (i.e., p(w|d) ∝ tf (w, d)).Due to the sparsity of θ d , several smoothing approaches have been proposed to estimate more appropriate models for d (e.g., Dirichlet prior and Jelinek-Mercer).Basically, the goal with these techniques is to build an approximate model θd by using the global information provided by a background corpus over the same vocabulary V.As our aim is to validate the annotations by characterizing the KR concepts assigned (i.e., building richer concept profiles) and capturing the context where the annotation occurs, we will focus on smoothing techniques based on statistical translation [28].
A translation model estimates the translation probabilities between the words of a given corpus G.We represent this translation model as T G = {p(w|w )} w,w ∈V where p(w|w ) indicates the probability of observing w if we have observed w in a given context.Statistical translation has been used in information retrieval (IR) for query expansion [21] and recommendation systems [45].In our paper, translation models are mainly used to get richer profiles for the KR concepts.
The MLE estimation of a translation model T G , denoted T G , can be performed by applying the following formulas: This estimation requires a set of local contexts W taken from the target corpus G, in which word co-occurrence is estimated.In this work, we define these local contexts by moving a window of fixed size across the whole collection [18].In this case, p(s) = 1/|s| and p(w|s) is estimated by counting the occurrences of w in the context s.
The computation of translation models can be efficiently performed when the size of local contexts are relatively small (around 4-6 words).Moreover, the implementation of this computation can be massively distributed and parallelized [31].
Several techniques have been proposed to smooth translation models, all of them relying on random walks techniques.Thus, a k-step random walk of T G with diffusion factor α can be calculated as follows: When k → ∞ we obtain the eigen-based smoothing of T G , which has been widely adopted for document classification and spectral clustering [45].In this paper, we will use these kernels only for smoothing semantic query models across concept taxonomies.For computational reasons, translation models of the corpora and the KRs will be smoothed with a 1-step random walk (i.e., k=1).
There exist alternative ways to refine the language models associated to documents and queries.In this paper we will use an adaptation of the parsimonious methods used in IR [21].Basically, these methods assume that the observed model for documents θ d (res.queries) is a mixture of a document-specific model (θ s d ) and a background model (θ B ).To determine the specific model, an Expectation Maximization algorithm [14] is applied in order to maximize the likelihood w.r.t. the observed model, that is: E-step: M-step: Language models obtained with parsimonious smoothing play a similar role to the application of the inverse document frequency (IDF) in vector space models: meaningless terms will present higher probabilities in B and lower probabilities in θ s d .All the language models defined over the vocabulary V fall in a (|V| − 1)simplex space, which can be used to measure the distance between them.Thus, we can measure the distance between the models of KR concepts, corpus documents and queries.For this purpose, in this paper we adopt the Fisher geodesic distance [27], which is defined as follows:

Semantic annotation issues
The main issue to be addressed when performing the semantic annotation of a document is the treatment of ambiguous, spurious and wrong annotations, especially when performing context-free semantic annotation.
An ambiguous annotation arises when a sequence of words in a text is assigned to more than one concept from the KR.There are two main factors that characterize ambiguous annotations: the size of the matched text, and the specificity of the terms involved in the annotation.The latter factor can be measured with the inverse document frequency (IDF).Figure 1 shows the percentage of ambiguous annotations w.r.t. the number of words they comprise for the two evaluated KRs (UMLS R and Wikinet).Figure 2 shows the percentage of ambiguous annotations of one word w.r.t the word IDF also for UMLS R and Wikinet.As expected, most ambiguous annotations fall in the short-size and low-IDF regions.This fact has a great impact in the semantic annotation process as ambiguous annotations occur very frequently, producing considerable noise in the resulting annotated collection.Moreover, any WSD method will considerably overload the annotation process.Wrong annotations are those that involve a concept whose meaning does not fit in at all with the context in which it is identified.These annotations are very frequent in acronyms and named entities such as programs, databases, algorithms, tools, and so on.Notice that WSD methods cannot reject wrong annotations since they are devised to choose at least one of the possible senses assigned to a word.However, the right sense of the word could be not included in the KR, and consequently it could be non-ambiguous for the KR.
Finally, spurious annotations are those that do not provide any value for performing semantic searches.Notice that the KRs have not been devised for semantic annotation but for representing knowledge.The KR can contain concepts that have only sense within the KR, as they are used to organize and classify concept descriptions.These annotations also overload the semantic annotation process apart from introducing more noise to the annotated collection.
The method for semantic annotation and search that we present in the following section is aimed at reducing as much as possible the number of ambiguous, spurious and wrong annotations.

Method
In this section we present our method for semantic annotation and search, which is based on a fine tailoring of the KR based on the target corpus statistics.Moreover, we also propose to validate the generated annotations with the tailored KR by taking into account the contexts where they occur.As mentioned in the introduction, our hypothesis is that the better the tailoring process is, the less overhead and the more effectiveness we obtain in the semantic annotation process, thus reducing the number of ambiguous, spurious and wrong annotations.
Figure 3 sketches the proposed method.Starting from the original KRs and the corpus to be annotated, the first step consists in tailoring the KRs according to the corpus contents (step 1).This step is optional and aims to get coarse refinements of very large and heterogeneous KRs like Wikipedia.From the tailored (or original) KRs we estimate the language models for their concepts, that is, the concept profiles (step 2).These profiles can be similarly tailored using the corpus (step 3).The profiles can also be used to align concepts with similar lexica by assessing their contents overlap (step 4).Alignments can give us information about how much complementary they are as well as to reduce their redundancies.Once the concept profiles and their alignments are calculated, the semantic annotation of the corpus can be performed (step 5).The annotated corpus is then used to build the semantic document models (i.e., expressed in terms of the KRs concepts) which will be the basis for performing semantic searches.Queries are built by users by picking up concepts of interest from the tailored KRs (step 6).From these sets of concepts, an expanded query model is generated.Finally, document models are ranked according to their distance to the expanded query model, and presented to the user (step 6).In the following sections, we explain each of the main components in detail.

Tailoring of a KR
We aim at selecting those concepts in the KR that are semantically related to the target corpus G.For this purpose, we first calculate the unigram model of the KR lexicon ( θ KR ), which considers the texts returned by the functions lex(c) and def (c).This model is then refined by applying the EM procedure of Section 3.3 taking as background the corpus model θ G .Let θ s KR be the resulting refined model.Then, each concept c ∈ C KR is selected if its definition is more likely to be generated from Assuming the independence of the terms, p(def (c)|θ .) can be estimated as: This test is aimed at filtering out those KR concepts that are completely out of context w.r.t. the target corpus.In very large and heterogeneous KRs like Wikinet this coarse tailoring allows the system to manage a much smaller KR to efficiently perform semantic annotation.

Concept profile construction
For each concept in a KR, we build a concept profile based on language models as follows: The model of the concept profile is based on a mixture of the models θ lex(c) and θ def (c) obtained from the lexical variants of the concept and the concept definition, respectively.They are calculated as follows: These models are at the same time a mixture of the MLE model and a smoothed model obtained by applying the translation model generated from the KR concept definitions (i.e., T KR ) to the MLE model.In this case, we apply a 1-step random walk as shown in formula 4.
The parameter α weights the contribution of the lexical variants vs. the definition of the concept and the parameter β weights the contribution of smoothed model generated by applying the translation model.
Moreover, we have also devised an extended version for the context's profiles based on the direct parents and children of the concept: The parameter γ serves to calibrate the contributions of the models of the parents and children.The prior p(θ c ) is assumed to be uniform.Parameters α, β and γ will be empirically set.

Tailoring concept profiles
Given a concept c and its profile θ c , we can measure the distance of the concept's profile w.r.t. the corpus as D(θ c , θ c,G ), where θ c,G is the joint distribution of θ c and the translation model of the corpus: where T G is the translation model generated from the target corpus G.
Finally, to obtain the tailored profiles, we filter out all the concepts whose profile produces a distance above a given threshold.We show an excerpt of the tailored biochemistry residue profile, where the word protein dominates.

Measuring the semantic overlap of the KRs
Modeling the concepts in a KR as concept profiles gives us several advantages.For example, for each pair of KRs, we can estimate the level of redundancy between them by checking their associated concept profiles.Thus, to obtain the alignments, for each pair of concepts (c, c ) such that c ∈ KR and c ∈ KR and lex(c) ≈ lex(c ) we can estimate their semantic overlap with D(θ c , θ c ) and a predefined threshold over it.Those pairs of concepts with similar lexica having a high overlap in their profiles are candidates to represent the same meaning.

Context-based semantic annotation
In this paper, we adopt the IR-based approach described in [7], which maps text chunks t to the KR lexicon strings of each concept c according to the following information-theoretical measure: The function inf o(s) = w∈s −log(p(w|B)) estimates the information of a string s in terms of its probability in a background corpus (e.g., Wikipedia).
Notice that highly frequent words in the KR contribute little to the final score of the strings containing them.As a result, sim(t, c) returns a list of candidate concepts for t with a normalized score between 0 and 1.This approach is similar to dictionary look-up approaches but it is flexible enough as it allows to select candidate concepts c whose lex(c) better discriminates it and partially matches t.
To deal with the problem of ambiguous and wrong annotations we resort to the context-based validation of the candidate concepts.For this, we measure the distance between the local context of the annotation θ ann and the tailored profile for the candidate concept θ c,G , resulting in the final score D(θ ann , θ c,G ).To validate the annotation, we filter out all the concepts producing a distance above some threshold.In case of an ambiguous annotation, the selected concepts are: The local context θ ann for the annotation is obtained by taking a window of fixed size around the annotation and building its corresponding language model estimated via MLE.

Semantic search
The semantic search proposed in this paper relies on the distributions space where concepts, documents and query language models are placed.Basically, a semantic search consists of picking up a set of concepts from the KRs, building the corresponding query model, and selecting the nearest document models.Next subsections describe in detail this process.

Semantic representation of documents
Once the documents have been semantically annotated, they can be represented with the corresponding distribution of concepts involved in the annotations.In this way, each d ∈ G has associated a semantic model θ d , which is estimated as follows: This model clearly benefits the most frequent concepts, which are usually those with broader meanings.In order to capture the topicality of the concepts, we apply the parsimonious method previously described (Section 3.3), taking as background model the distribution of concepts in the target corpus G.The resulting model θ s d is then used for indexing the document d.

Semantic query models
A semantic search (query) consists of the set of concepts q = {c 1 , ..., c k } that best fit the user's information need.Without any prior knowledge about the relevance of these concepts w.r.t. the user requirements, we assume that the basic query model follows the uniform distribution, that is p(c| θ q ) = 1/|q|.However, as the target corpus is biased towards very frequent concepts, we need to capture somehow the topicality of the query's concepts.Again, we apply the parsimonious smoothing to the query model to favor more specific concepts.In this case, we also use the concept distribution of the target corpus as background model.The resulting model is denoted as θ s q .As mentioned in the introduction, semantic search can take advantage from the KRs by expanding queries with their concept taxonomic relationships ( ).For any query, we can consider the downwards expansion of a query q as: We can also consider the upwards expansion of a query q as: and a combination of both expansions, represented with q .Now the problem is how to smooth the original query model in order to take into account the new expanded concepts.For this purpose, we use a smoothing operator based on random walks [42] following the regularization framework presented in [47].Firstly, we define the affinity matrix M to embed the taxonomic relations involved in the query as follows: From this matrix, we obtain the translation model T as follows: where δ is the diffusion factor (i.e., how much mass from the original query is diffused to the expanded concepts), and I is the identity matrix.
In this way, the model for the expanded query q is generated by applying this translation model as follows: Finally, the semantic search is just performed by computing the distance D( θq , θ s d ) over all the indexed documents d, ranking them from lower to higher values.The implementation details of this method are given in Section 6.7.

Experiments
We have performed several experiments in order to evaluate each of the phases of the proposed method.First, we describe the general setup in which the experiments take place.Then, for each experiment, we describe its objective and the specific datasets and resources used.

Datasets and characteristics of the KRs
For the experiments, we have considered a dataset, two large KRs, a pool of queries and four gold standards (GS) that involve several domains.The dataset used for annotation and semantic search, W ebRes, is composed by metadata from 10,692 web resources for Life Sciences.As for the KRs, we have selected two well-known knowledge resources: UMLS R [10] and Wikinet [38].We evaluate our semantic annotation method and compare it against others using the GS MSH-WSD [25], which is used by state-of-theart disambiguation approaches and has been specifically designed to evaluate hard disambiguation cases over UMLS R .We have built the GSs, GS UMLS and GS W ikinet , to evaluate the performance of the semantic annotation over W ebRes.For the semantic search evaluation, we have created a query pool that consists of descriptions of bioinformatics tasks and a GS, GS query , to evaluate the retrieval results.All the datasets used are freely available2 .
Regarding the KRs, Table 1 shows the number of concepts of each KR, the size of their lexicon, the number of concept definitions and the number of "is-a" relationships.The characteristics of the annotation dataset and the three GSs will be explained in more detail in the experiments that make use of them.In the previous resources, we distinguish two main domains that overlap: Biomedicine, which combines vocabularies from Biology and Medicine, and Bioinformatics, which combines vocabularies from Biology and Computer Science.UMLS R and MSH-WSD are both located in the Biomedicine domain, whereas Wikinet does not have a specific location because it covers several domains but with low specificity.The W ebRes dataset overlaps only partially with the Biomedicine and Bioinformatics domain, and GS UMLS , GS W ikinet and GS query are located inside the W ebRes and overlapping with the two main domains.This heterogeneous setup makes semantic annotation w.r.t. the KRs especially hard because the W ebRes dataset overlaps only partially with the reference KRs.The aim of the following experiments is to show that the context-validated semantic annotations using profiles based on statistical language models and the tailoring (both of the KR and the profiles) in such an heterogeneous scenario improves semantic annotation and therefore, semantic search.

Tailoring of the KRs
The tailoring of a KR (Section 5.1) consists in selecting those concepts from the KR that are semantically related to the target corpus.This filtering process can reduce the overhead of the semantic annotation process, specially when the KR is very large.We have applied the tailoring to both UMLS R and Wikinet.As a result, we obtain 171,274 concepts for UMLS R and 510,390 for Wikinet.Recall that this process selects only concepts whose definition is more likely to be generated from the corpus than the KR.Therefore, the tailoring depends on the number of definitions of the KR.The proportion of the concepts selected w.r.t. to the definitions for UMLS R is 92%, which indicates that UMLS R is well-suited to the corpus and the tailoring process discards few concepts.However, for Wikinet this proportion is only 12.3%, which means it contains a lot of noise (i.e., concepts not related to the target corpus) that has been removed through the tailoring process.Therefore, from now on we use the tailored version of Wikinet, Wikinet T , and UMLS R without tailoring.

Concept profile evaluation
The KR concept profiles play a crucial role in the semantic annotation process, as they serve to disambiguate ambiguous semantic annotations (see Definition 3.3).In this section we evaluate the quality of the concept profiles by means of two experiments.
In the first experiment, we compare our approach for context-validated semantic annotation with state-of-the art WSD methods.Recall that knowledgebased WSD methods deal with the problem of selecting one of the senses of a word from an inventory of words and their senses, whereas our method is thought to perform an unsupervised, full-fledged semantic annotation.The phase that resembles WSD is the context validation phase, where we have a profile based on translation models for each candidate concept and compare it with the context around the annotation to select valid concepts for such annotation.
We use the MSH-WSD dataset [25] for evaluating this phase.This corpus contains 203 strings that are associated with more than one possible MeSH code in the UMLS R Metathesaurus (106 of these are ambiguous abbreviations, 88 ambiguous terms and 9 a combination of both).The corpus contains up to 100 examples for each possible sense, and a total of 37,888 examples of ambiguous strings taken from Medline.
We evaluate the context validation phase with both the concept profiles (TrM) and the extended concept profiles (TrMExt) described in Section 5.2.The parameters α, β and γ have been empirically set to 0.45, 0.50, 0.40, respectively.Performance is compared against various alternative approaches.Accuracy results of the experiments are shown in Table 2.Both MRD [33] and 2-MRD [34] are unsupervised approaches based on building concept vector profiles normalized by IDF and comparing them with the context vector using cosine similarity.PPR [3] is also unsupervised and relies on a graphbased algorithm similar to the page rank that converts UMLS R into a graph where the possible meanings of ambiguous words are nodes and relations between them are edges.AEC [23] and UB [11]  scores against unsupervised approaches of the literature and near to semisupervised ones.Moreover, the extended version improves results over the original one, that is, including information about the concept hierarchy in the profiles helps disambiguation.The aim of the second experiment is to evaluate our method for contextvalidated semantic annotation in the web resource discovery domain, which is hampered by the heterogeneity of data and where the use of general words introduces a lot of ambiguity.Thus, we have built up a dataset of 2,260 web resources from BioCatalogue [8], which is a popular registry in the Life Sciences domain.The web resources metadata registered in this repository consists of well-defined fields, such as categories and tags, and textual descriptions.
To evaluate the semantic annotation over the previous dataset, we have manually created two GSs for the two KRs, GS UMLS and GS W ikinet , with those annotations matching a single word, as single word concepts are much prone to ambiguity and errors.GS UMLS contains 11,041 single-word semantic annotations and GS W ikinet contains 5,386.
We have evaluated five configurations of our semantic annotation method: context-free, context-validated using the TrM method for the concept profiles, context-validated using the TrMExt method, and the tailored versions of the last two methods, that is, where the concept profiles have been filtered as indicated in Section 5.3.The threshold used to filter concept profiles is 0. First, we present Table 3, which shows the average number of concept profiles in the original and tailored versions.As observed, the reduction of the number of concept profiles in the tailored versions is very significant, all of them reaching a reduction around 90% or more.
Table 4 shows the results of the semantic annotation evaluation for the previous five configurations using both UMLS R and Wikinet T .We use the standard measures precision, recall, F measure and accuracy to evaluate the  resulting annotations.
The results show that all the proposed methods improve the results of the context-free annotation method.In general, we observe that the extended versions of the methods do not improve results in any of the cases, which means that the information provided by the concept hierarchy is not decisive for validation in this dataset.This may be due to the mismatch of domains between the annotation dataset W ebRes w.r.t.both UMLS R and Wikinet.In this case, including information about the hierarchy in the concept profiles seems to introduce noise that does not help disambiguation, as opposed to the performance of the extended version in MSH-WSD (see Table 2), where the domains of the GS and the annotation dataset are the same.
The tailored versions suffer a decrease in all the measures but results are still comparable to state-of-the-art WSD approaches.The lower performance is more noticeable in UMLS R , specially w.r.t. the recall.This indicates that the concept tailoring in UMLS R may be too aggressive, and potentially good concepts are being filtered, whereas the concept tailoring in Wikinet T seems to work better, as it is able to keep performance while reducing the number of concept profiles.Notice that in this dataset, resource descriptions focus on software aspects and, therefore, the contexts are not related to biological terms.Still, the reduction in the number of concept profiles of the tailored versions (see Table 3) make them an ideal choice when dealing with huge amounts of concept profiles.
From now on, when we refer to the semantic annotation process or the concept profiles, we mean the context-validated semantic annotation using the concept profiles generated by the method TrM T , which is the method that offers the best trade-off between all the measures.

Alignments between KRs
In this experiment, we measure the overlap between Wikinet T and UMLS R by obtaining a set of concept alignments.For each pair of concepts (c, c ) such that c ∈ C UMLS and c ∈ C W ikinet T and lex(c) ≈ lex(c ), we estimate their semantic overlap by comparing their profiles D(θ c , θ c ) and filtering out those below a predefined threshold.As a result, we obtain a set of 6,058 alignments.Notice that the resulting set of alignments is rather small, which indicates both KRs are complementary.From this set, we distinguish the alignments between concepts of only one word, S one , (91 alignments), and concepts with more than one word, S n , (5,967 alignments).The set S one was manually assessed and has a precision of 54%, whereas for the set S n we manually assessed a hundred random samples, resulting in a precision of 87%.This confirms the hypothesis that short-lengthed concepts are more difficult to disambiguate and, in this case, to correctly align.

Semantic annotation evaluation
In this experiment we evaluate the impact of the context-free vs. contextvalidated annotations.The dataset that will be annotated is composed by metadata from 10,692 web resources, of which 6,226 are related to the Life Sciences domain and 4,466 are of general domains registered in Wikipedia.We have downloaded the metadata of the Life Sciences web resources from BioCatalogue (more than 2,200 web resources), myExperiment [20] (more than 2,000 workflows), and SSWAP [19] (more than 2,700 web resources).With respect to the web resources registered in Wikipedia, we have considered those entries that describe web resources, independently of their domain.In order to select those entries, we have applied category filters and lexical patterns to identify expressions related to web resources, e.g., "is a web service", "is a database", etc.
Table 5 shows the number of different concepts in the annotations of the dataset, the total number of annotations, and their ambiguity3 in contextfree versus context-validated annotations.The experiments are reproduced for two configurations of the KRs, with and without tailoring of Wikinet.We observe that the number of context-validated annotations has been reduced to roughly a third w.r.t the number of context-free annotations.However, the most remarkable fact is that the ambiguity of annotations is much higher in context-free annotations, and this affects the semantic search as will be demonstrated in the next section.In the context-validated annotations, with the TrM T method we reduce the ambiguity and also fewer annotations are produced.Similarly, regarding the semantic annotation using the tailored version of Wikinet, we observe that the ambiguity is reduced and also fewer annotations are produced, which may affect recall.However, as shown in the previous Table 4, the trade-off between precision and recall when using tailoring over the KR (i.e., Wikinet T ) and over the method for profile generation (i.e., TrM T ) is good.5: Results of the semantic annotation process using different configurations of the KRs for context-free vs. context validated annotations.T means tailored version.

Time performance evaluation
The proposed context-validated semantic annotation process does not imply a computational overhead as many WSD methods do.Table 6 shows the time performance of each of the components for the semantic annotation of the 10,692 web resources dataset.The first three phases are done only once off-line.In the profile generation phase, we distinguish between the normal and the extended version because in the extended version all the direct parents and children of the concept are considered for generating the profile, thus incurring in extra time.We also distinguish between UMLS R and Wikinet because the performance is significantly different.While the profile generation is faster in Wikinet, probably because of shorter concept labels and definitions, it happens the opposite for the extended version.This is due to the fact that the average number of direct parents and children in Wikinet is three times more than for UMLS R , making the extended version in Wikinet slower.In the profile tailoring phase we also make the distinction because extended profiles are considerably larger, thus affecting the tailoring performance.Finally, it is worth mentioning that both the context-free and the context-validated annotations have a similar performance, which we measure in annotated documents per second.

Semantic search evaluation
The experiments carried out to perform the evaluation of the semantic search consist in the execution of a set of heterogeneous queries (i.e., task description examples) over the dataset of 10,692 web resources.These queries capture different ways to describe bioinformatics tasks (see Table 7), thus reflecting the variability in the users' information needs.The query pool was created by selecting more than 250 short descriptions extracted from other Life Sciences resource catalogues such as OBRC 4 (Online Bioinformatics Resource Collection) and ExPaSy5 (SIB Bioinformatics Resource Portal).Thus, we have selected as queries the short descriptions of the resources registered on these catalogues.All the queries have been semantically annotated and expanded with related concepts in the KR, as described in Section 5. 6 Find genes with functional relationships 42 a result, each query has associated a semantic query model.To evaluate the retrieval results, we have built an assessment dataset, GS query , with relevant descriptions associated to each task.This dataset was set-up by selecting predefined categories and tags from the target catalogues which are relevant to each task.
In these experiments, we implemented a search engine based on language models, indexed under a traditional inverted file [32].Thus, indexed descriptions are retrieved and ranked according to their similarity to the query, in this case calculated with the distance between models (Section 5.6.2).On top of this basic search engine, we implemented both a keyword-based and a semantic-based search method.The former defines language models directly from words, whereas the latter uses the semantic models defined in Section 3.3.The keyword-based method is used as baseline to demonstrate that semantic annotations improve the retrieval effectiveness.Table 8 shows the precision at 5, 10 and 20, and the Mean Average Precision (MAP) for the query results using the keyword-based method evaluated against GS query .
We have evaluated the semantic-based search using different configurations in order to evaluate the impact of using tailored KRs and contexts on the retrieval results.Table 9 shows the precision at 5, 10 and 20, and the MAP measure of the query results using the different configurations.As it can be noticed, in general the semantic search presents higher precision scores at the first top-ranked positions than the keyword-based search using smaller models (39,253 terms against 14,678 concepts in the best performance configuration, tailoring with TrM concepts profiles).Next, we analyze in detail the different configurations evaluated in these experiments.
With respect to the consideration of the context during the semantic Topic P@5 P@10 P@20 MAP  annotation, we have executed the queries without validation, and validating annotations with the two best configurations of concepts profiles, the TrM concept profiles model and its tailored version TrM T .The results show that the precision scores are better when validating the annotations contexts.In contrast, the MAP measure is better when not considering the context because the recall is higher when all senses are included.Regarding the results for the two different context models, there is not much difference between them, although the tailored version obtains slightly worse precision at the first positions.
Regarding the tailoring of KRs, the use of a tailored KR reduces considerably the semantic index and also the ambiguity of the annotations (see Table 5), while the results are not affected by the reduction of annotations.Moreover, the precision at the top-ranked positions is slightly higher when using the tailored version of Wikinet.
Finally, we have also analyzed the impact of the query expansion on the retrieval results.We have executed the queries for the best configuration in Table 9 but without expanding the query.The resulting precision scores are slightly lower (P@5=0.74,P@10=0.7,P@20=0.66,MAP=0.19).
In conclusion, semantic search obtains better results than the keywordbased search using considerably much smaller indexes.We have demonstrated that using tailored KRs in the semantic search reduces the size of an initial search, to the user's query, and then performs a keywords-based retrieval.Other approaches do not consider the conceptual representation of documents, and only use the knowledge in KRs to expand the query.For example, [24] uses the concepts representing the user's intent to expand the query with the terms associated to those concepts in the KR, then the retrieval is based on keyword matching.Currently, few approaches consider the conceptual representation of both the user's requirements and the documents.There are approaches in which the documents are semantically represented as entity-relationship graphs and make use of graph-based query languages to perform semantic search [26,15].In the Life Sciences domain, SADI [46] performs semantic search via SPARQL queries of web services that have been previously semantically represented in RDF.These approaches require the documents to be in RDF format, which is not very frequent in general, even through there is current research towards this direction [17].

Semantic annotation
With the proliferation of the Web of Data and initiatives such as the Linked Data project, which promotes a series of best practices to publish and link entities across the Web in a machine understandable way, many KRs ranging from lexicons, terminologies and thesauri to expressive ontologies, are publicly accessible and ready to be used for annotation purposes such as dbpedia6 , yago7 , freebase8 and schema.org 9.Specially in the biomedical domain we can find several lexical/ontological specialized resources such as MeSH, SNOMED, UMLS R and BioPortal among others.The use of knowledgebased semantic annotation can have a great impact on semantic search, as both the user query and the documents are represented in a conceptual space.
For semantic annotation, the available tools range from simple dictionarybased approaches, to more sophisticated NLP approaches that use NER tools, POS tagging, dependency parsing, etc.Some examples include DBpedia Spotlight [37], The Wiki Machine 10 , AlchemyAPI 11 and Open Calais 12 , for annotating general-purpose entities, or MetaMap [5] and Whatizit [44] for annotating biomedical entities.
Most of the unsupervised semantic annotation methods rely on a dictionary look-up strategy.Basically, it consists of finding occurrences of concept strings in a text chunk by applying strict string matching.To allow some small variations in the matching (e.g., plural forms), concept strings can be translated into regular expressions, which are applied to the text chunks to obtain the mappings [44,13] Other approaches adopt an information retrieval (IR) strategy [5,7].Basically, it consists of viewing the text chunk T as a query, and the concept strings as documents to be retrieved.This strategy notably increases the recall since it disregards the order and continuity of the matched words.To allow more flexibility in the matching, the query generated by T can be expanded with the variants of each word w i (e.g., plurals, hyphenation, abbreviations, etc.) to perform the retrieval.
The majority of these approaches still perform poorly with ambiguous annotations.Some of them make use of contextual information (e.g., words around the annotation) to improve disambiguation.Still, results are not satisfactory mainly because of two reasons: 1) the KR does not have the appropriate sense and 2) the method for comparing the contexts is too trivial.This issue has been thoroughly studied by WSD methods, which are explained in the following section.

Word sense disambiguation
WSD is one of the key tasks in natural language processing (NLP) applications.Although WSD is focused on choosing the right sense for each word in a sentence, it can be somehow extrapolated to the problem of disambiguating semantic annotations.More specifically, knowledge-based WSD methods [39] can be adapted to choose the concepts that best fit to the text where they are identified.Most knowledge-based WSD methods are almost unsupervised as they mainly rely on the information provided by the lexical knowledge resource (mainly WordNet and its variants).Some additional heuristics such as the most frequent sense (MFS) are often included to help in the final decisions, hence the almost.Former approaches to knowledge-based WSD consisted of variations of the Lesk algorithm [29], which basically compares the glosses of the senses provided by the KR with the words surrounding the word to be disambiguated.In this way, the disambiguation problem consists of measuring the overlap between the term-vectors associated to each concept (also called topic signatures [2]) in the KR and the term-vector of the target word context, and then to select the concept that gives the highest score.These approaches assume that the richer the topic signatures are the better is the chance to choose the right one.Thus, in [2] term-vectors are built by querying Google with monosemous synonyms or hyponyms of each concept, and then weighting them with a tf-idf scheme.In [22] a similar approach is proposed to build term-vectors for UMLS R concepts by querying PubMed with MeSH terms.In [4] term-vectors are built with the words of the glosses of the hyperonyms and hyponyms of each word sense, also weighted with a tf-idf scheme.More recent approaches attempt to extract the knowledge encapsulated in the KR to get more evidence for decision making.For example, in [4] implicit relations are found by comparing the topic signatures of all the senses involved in a sentence.In [12] topic signatures are used to discover relations between word senses.Such discovered relations have shown useful for WSD when applying random walks techniques over the resulting word sense graphs [40,16].In the context of semantic annotations with arbitrary KRs, knowledge-based methods are difficult to apply mainly because they have been developed taking advantage from the particular characteristics of the lexical KR they are aimed at, such as the rich WordNet relations, or the link structure of Wikipedia [35].Moreover, they are not aimed at validating the generated annotations but at choosing one of the existing senses, which can lead to wrong annotations if the right sense is not covered by the KR.Our approach for validation is inspired in the Lesk principle combined with topic signatures.However, we rely on a statistical framework to generate the concept language models and to compare them with the corpus contexts, also represented as language models.

Conclusions
In this paper we have proposed a novel method for semantic annotation and search based on statistical language models.Our main hypothesis is that reconciling the vocabulary in the KRs and the target corpus can lead to more precise and useful annotations.We achieved such reconciliation by means of statistical translation models, which enable to define rich language models for both the KR concepts and the corpus contexts.From the experiments we can draw several conclusions: • Coarse tailoring is useful for very large and heterogeneous KRs like Wikinet, since we can easily reject those parts of the KR that have nothing to do with the target corpus.However, more specialist KRs like UMLS R are much more homogeneous and take little advantage from the coarse tailoring.
• In some scenarios it is necessary to combine more than one KR in order to get a proper coverage of the target corpus.Otherwise, semantic search will be less effective than keyword-based search.In our experiments, web resource catalogues combine computer science and biomedical terminologies, which cannot be properly covered with a unique KR.
We have shown that UMLS R and Wikinet complement each other quite well for this domain.
• Language models generated with translation models have proved very useful in tailoring and validating semantic annotations.
• Results show a dramatic reduction in the number of obtained annotations (and therefore the size of the semantic search structures) at the same time that precision increases with little lost in recall.
As future work, there are several interesting research lines derived from this work.First, we plan to study new approaches for concept profile construction that combine topic-based models like Latent Diritchlet Allocation (LDA) [9] with the translation models proposed in this paper.LDA has been shown very useful in WSD tasks [30] and provides a statistical framework that captures word co-occurrence patterns at collection level.We also plan to apply topic-based models for performing semantic searches.This idea has been previously explored in [43] by using context-free annotations with good results.The main limitation of this approach is that topics must be defined a priori and they are dependent on the application domain.We will investigate how to automatically generate topics of interest from the profiles of the KRs and the corpus at hand.Finally, we will study how to take more profit from the KR taxonomic relationships in order to enhance the KRs translation models and the generated concept profiles.Moreover, we will consider the construction of the graph of concepts induced by their contexts relationships similarly to some knowledge-based graph WSD approaches [40].

Definition 3 . 1 .
A knowledge resource is a formalization of the semantics of a domain by means of a set of concepts C = {c 1 , ..., c n }.A concept c ∈ C represents the semantic definition of a meaningful entity in a specific domain.

Figure 1 :
Figure 1: Ambiguity w.r.t. the length of the matched text.

Figure 3 :
Figure 3: Summary of the proposed method for semantic annotation and search.The phases are: 1) tailoring of the KRs, 2) concept profile construction, 3) tailoring of concept profiles, 4) semantic overlap of the KRs, 5) context-based semantic annotation and 6) semantic search

Figure Click here to download high resolution image
FigureClick here to download high resolution image

Figure
Figure

Table 1 :
Features of the KRs.*Only English lexicon.

Table 2 :
are supervised learning algorithms that alleviate the problem of requiring manually annotated training data by querying Medline documents.Our methods present very good WSD evaluation results in terms of accuracy on MSH-WSD dataset.MRD stands for Machine Readable dictionary, 2-MRD stands for 2nd Order Cooccurrence MRD, PPR stands for Personalised Page Rank, AEC stands for Automatic Extracted Corpus, UB stands for Uniform Bias, TrM stands for Translation Model and TrMExt stands for Translation Model Extended.

Table 3 :
Average size of the concept profiles used for the validation of semantic annotations in each method.

Table 4 :
Macro average Precision (P), recall (R), F-measure (F) and accuracy (Acc) of semantic annotations with different configurations of context validation.

Table 6 :
Performance in concepts per second (c/sec) and documents per second (d/sec) of each of the phases of the semantic annotation for the web resources dataset (1 CPU). .2.As

Table 7 :
Bioinformatics base tasks considered for evaluation.

Table 8 :
Precision at n (P@n) for the top-5, top-10, and top-20 results, and MAP measure for the keyword-based search.