Flexible techniques for heterogeneous XML data retrieval

Sanz Blasco, Ismael

dc.contributor	Universitat Jaume I. Departament d'Enginyeria i Ciència dels Computadors
dc.contributor.author	Sanz Blasco, Ismael
dc.date.accessioned	2011-04-12T20:00:56Z
dc.date.accessioned	2024-04-22T12:11:55Z
dc.date.available	2007-12-05
dc.date.available	2024-04-22T12:11:55Z
dc.date.issued	2007-10-31
dc.date.submitted	2007-12-05
dc.identifier.isbn	9788469105832
dc.identifier.uri	http://www.tdx.cat/TDX-1205107-095639
dc.identifier.uri	http://hdl.handle.net/10803/10373
dc.description.abstract	The progressive adoption of XML by new communities of users has motivated the appearance of applications that require the management of large and complex collections, which present a large amount of heterogeneity. Some relevant examples are present in the fields of bioinformatics, cultural heritage, ontology management and geographic information systems, where heterogeneity is not only reflected in the textual content of documents, but also in the presence of rich structures which cannot be properly accounted for using fixed schema definitions. Current approaches for dealing with heterogeneous XML data are, however, mainly focused at the content level, whereas at the structural level only a limited amount of heterogeneity is tolerated; for instance, weakening the parent-child relationship between nodes into the ancestor-descendant relationship. <br/>The main objective of this thesis is devising new approaches for querying heterogeneous XML collections. This general objective has several implications: First, a collection can present different levels of heterogeneity in different granularity levels; this fact has a significant impact in the selection of specific approaches for handling, indexing and querying the collection. Therefore, several metrics are proposed for evaluating the level of heterogeneity at different levels, based on information-theoretical considerations. These metrics can be employed for characterizing collections, and clustering together those collections which present similar characteristics. <br/>Second, the high structural variability implies that query techniques based on exact tree matching, such as the standard XPath and XQuery languages, are not suitable for heterogeneous XML collections. As a consequence, approximate querying techniques based on similarity measures must be adopted. Within the thesis, we present a formal framework for the creation of similarity measures which is based on a study of the literature that shows that most approaches for approximate XML retrieval (i) are highly tailored to very specific problems and (ii) use similarity measures for ranking that can be expressed as ad-hoc combinations of a set of --basic' measures. Some examples of these widely used measures are tf-idf for textual information and several variations of edit distances. Our approach wraps these basic measures into generic, parametrizable components that can be combined into complex measures by exploiting the composite pattern, commonly used in Software Engineering. This approach also allows us to integrate seamlessly highly specific measures, such as protein-oriented matching functions.<br/>Finally, these measures are employed for the approximate retrieval of data in a context of highly structural heterogeneity, using a new approach based on the concepts of pattern and fragment. In our context, a pattern is a concise representations of the information needs of a user, and a fragment is a match of a pattern found in the database. A pattern consists of a set of tree-structured elements --- basically an XML subtree that is intended to be found in the database, but with a flexible semantics that is strongly dependent on a particular similarity measure. For example, depending on a particular measure, the particular hierarchy of elements, or the ordering of siblings, may or may not be deemed to be relevant when searching for occurrences in the database. <br/>Fragment matching, as a query primitive, can deal with a much higher degree of flexibility than existing approaches. In this thesis we provide exhaustive and top-k query algorithms. In the latter case, we adopt an approach that does not require the similarity measure to be monotonic, as all previous XML top-k algorithms (usually based on Fagin's algorithm) do. We also presents two extensions which are important in practical settings: a specification for the integration of the aforementioned techniques into XQuery, and a clustering algorithm that is useful to manage complex result sets.<br/>All of the algorithms have been implemented as part of ArHeX, a toolkit for the development of multi-similarity XML applications, which supports fragment-based queries through an extension of the XQuery language, and includes graphical tools for designing similarity measures and querying collections. We have used ArHeX to demonstrate the effectiveness of our approach using both synthetic and real data sets, in the context of a biomedical research project.	cat
dc.format.mimetype	application/pdf
dc.language.iso	eng
dc.publisher	Universitat Jaume I
dc.source	TDX (Tesis Doctorals en Xarxa)
dc.subject	similarity
dc.subject	approximate query processing
dc.subject	heterogeneous data management
dc.subject	XML
dc.subject.other	Bases de dades / bases de datos / databases
dc.title	Flexible techniques for heterogeneous XML data retrieval
dc.type	info:eu-repo/semantics/doctoralThesis
dc.type	info:eu-repo/semantics/publishedVersion
dc.subject.udc	004	cat
dc.contributor.director	Berlanga Llavori, Rafael
dc.contributor.director	Mesiti, Marco
dc.rights.license	ADVERTIMENT. L'accés als continguts d'aquesta tesi doctoral i la seva utilització ha de respectar els drets de la persona autora. Pot ser utilitzada per a consulta o estudi personal, així com en activitats o materials d'investigació i docència en els termes establerts a l'art. 32 del Text Refós de la Llei de Propietat Intel·lectual (RDL 1/1996). Per altres utilitzacions es requereix l'autorització prèvia i expressa de la persona autora. En qualsevol cas, en la utilització dels seus continguts caldrà indicar de forma clara el nom i cognoms de la persona autora i el títol de la tesi doctoral. No s'autoritza la seva reproducció o altres formes d'explotació efectuades amb finalitats de lucre ni la seva comunicació pública des d'un lloc aliè al servei TDX. Tampoc s'autoritza la presentació del seu contingut en una finestra o marc aliè a TDX (framing). Aquesta reserva de drets afecta tant als continguts de la tesi com als seus resums i índexs.
dc.rights.accessLevel	info:eu-repo/semantics/openAccess	cat
dc.local.notes	berlanga@icc.uji.es, mesiti@dico.unimi.it

Ficheros en el ítem

Ficheros	Tamaño	Formato	Ver
No hay ficheros asociados a este ítem.

Este ítem aparece en la(s) siguiente(s) colección(ones)

Programa de Doctorat en Informàtica [94]

Mostrar el registro sencillo del ítem