Mostrar el registro sencillo del ítem

dc.contributor.authorGarcía Castro, Leyla Jael
dc.contributor.authorBerlanga Llavori, Rafael
dc.contributor.authorGarcia, Alexander
dc.date.accessioned2016-06-24T09:27:44Z
dc.date.available2016-06-24T09:27:44Z
dc.date.issued2015-10
dc.identifier.citationGARCÍA CASTRO, Leyla Jael; BERLANGA LLAVORI, Rafael; GARCIA, Alexander. In the pursuit of a semantic similarity metric based on UMLS annotations for articles in PubMed Central. Journal of Biomedical Informatics (2015), v. 57, pp. 204-218, v. 57ca_CA
dc.identifier.urihttp://hdl.handle.net/10234/161076
dc.description.abstractMotivation Although full-text articles are provided by the publishers in electronic formats, it remains a challenge to find related work beyond the title and abstract context. Identifying related articles based on their abstract is indeed a good starting point; this process is straightforward and does not consume as many resources as full-text based similarity would require. However, further analyses may require in-depth understanding of the full content. Two articles with highly related abstracts can be substantially different regarding the full content. How similarity differs when considering title-and-abstract versus full-text and which semantic similarity metric provides better results when dealing with full-text articles are the main issues addressed in this manuscript. Methods We have benchmarked three similarity metrics – BM25, PMRA, and Cosine, in order to determine which one performs best when using concept-based annotations on full-text documents. We also evaluated variations in similarity values based on title-and-abstract against those relying on full-text. Our test dataset comprises the Genomics track article collection from the 2005 Text Retrieval Conference. Initially, we used an entity recognition software to semantically annotate titles and abstracts as well as full-text with concepts defined in the Unified Medical Language System (UMLS®). For each article, we created a document profile, i.e., a set of identified concepts, term frequency, and inverse document frequency; we then applied various similarity metrics to those document profiles. We considered correlation, precision, recall, and F1 in order to determine which similarity metric performs best with concept-based annotations. For those full-text articles available in PubMed Central Open Access (PMC-OA), we also performed dispersion analyses in order to understand how similarity varies when considering full-text articles. Results We have found that the PubMed Related Articles similarity metric is the most suitable for full-text articles annotated with UMLS concepts. For similarity values above 0.8, all metrics exhibited an F1 around 0.2 and a recall around 0.1; BM25 showed the highest precision close to 1; in all cases the concept-based metrics performed better than the word-stem-based one. Our experiments show that similarity values vary when considering only title-and-abstract versus full-text similarity. Therefore, analyses based on full-text become useful when a given research requires going beyond title and abstract, particularly regarding connectivity across articles. Availability Visualization available at ljgarcia.github.io/semsim.benchmark/, data available at http://dx.doi.org/10.5281/zenodo.13323.ca_CA
dc.description.sponsorShipThe authors acknowledge the support from the members of Temporal Knowledge Bases Group at Universitat Jaume I. Funding: LJGC and AGC are both self-funded, RB is funded by the “Ministerio de Economía y Competitividad” with contract number TIN2011-24147.ca_CA
dc.format.extent22 p.ca_CA
dc.format.mimetypeapplication/pdfca_CA
dc.language.isoengca_CA
dc.publisherElsevierca_CA
dc.relation.isPartOfJournal of Biomedical Informatics (2015), v. 57ca_CA
dc.rights.urihttp://rightsstatements.org/vocab/CNE/1.0/*
dc.subjectSemantic similarityca_CA
dc.subjectScientific publicationsca_CA
dc.subjectSimilarity metricsca_CA
dc.subjectSemantic annotationsca_CA
dc.subjectRelated articlesca_CA
dc.titleIn the pursuit of a semantic similarity metric based on UMLS annotations for articles in PubMed Centralca_CA
dc.typeinfo:eu-repo/semantics/articleca_CA
dc.identifier.doihttp://dx.doi.org/10.1016/j.jbi.2015.07.015
dc.rights.accessRightsinfo:eu-repo/semantics/openAccessca_CA
dc.relation.publisherVersionhttp://www.sciencedirect.com/science/article/pii/S1532046415001550ca_CA
dc.editionPreprintca_CA
dc.type.versioninfo:eu-repo/semantics/publishedVersion


Ficheros en el ítem

Thumbnail

Este ítem aparece en la(s) siguiente(s) colección(ones)

  • LSI_Articles [361]
    Articles de publicacions periòdiques escrits per professors del Departament de Llenguatges i Sistemes Informàtics

Mostrar el registro sencillo del ítem