Methods and Models for the Analysis of Biological Signifïcance Based on High­Throughput Data

Author

Mosquera Mayo, José Luís

Director

Sànchez, Àlex (Sànchez Pla)

Tutor

Oller i Sala, Josep Maria

Date of defense

2014-12-12

Legal Deposit

B 8319-2015

Pages

334 p.



Department/Institute

Universitat de Barcelona. Departament d'Estadística

Abstract

The advent of high-throughput technologies has generated a huge quantity of omics data. The results of these experiments usually are long lists of genes that can be used as biomarkers. A major challenge for the researchers is to attribute a biological interpretation or significance to these list of potential biomarkers, by using biological information stored in bioinformatics resources such as the Gene Ontology (GO) or the Kyoto Encyclopedia of Genes and Genomes (KEGG), or combining them with other types of omics data. This dissertation had two main objectives. First, to study mathematical properties of two types of semantic similarity measures for exploring GO categories, and second, to classify and to study the evolution of GO tools for enrichment analysis. The first measure considered was a semantic similarity measure proposed by Lord et al. It is a node- based approach based on the Graph Theory. The second measure actually was a group pseudo- distances proposed Joslyn et al. They were edge-based approaches based on the algebraic point of view of the Partially Ordered Sets (POSET) Theory. So, in order of reaching our objectives, first of all a review and description of main methods about graph theory and POSET theory was carried out. This fact allowed us to realized that there are to ways for mapping objects (e.g. genes) in to the terms of an ontology (e.g. GO). First formulation is called Object-Ontology Complex (OOC). It was proposed by Carey in order to perform statistical computations. Second formulation is called POSET Ontology (POSO) and it was introduced by Joslyn et al. In order to classify the GO for enrichment analysis the first 26 GO available at the website of The GO Consortium were surveyed. These left us list of 205 features that were used for building an Standard Functionalities Set. Based on these functionalities the 26 GO tools were classified according to their capabilities. The study of the GO tools evolution was based on the monitoring of these 26 GO tools. So the statistical analysis consisted of a descriptive statistics, an inferential analysis and a multivariate analysis. With regard to the first objective, we have seen the Lord's measure is the same as the Resnik's measure, previously published. It has observed that there exists a certain level of analogy between the formalization of the OOC and the POSO for mapping genes to objects to terms of an ontology. A property and a corollary to calculate semantic similarity measures from node-based approaches based on a matrix point of view have been proposed. It has been proved that the Lord's measure and the Joslyn's measure can be redefined in terms of metric distance. An R package called sims for computing semantic similarity measures between terms of an arbitrary ontology and comparing semantic similarity profiles based on the GO terms associated with two lists of genes has been developed. Based on the classification of the GO programs a web-based tool called SerbGO devoted to select and compare GO tools stored in was developed. The statistical analysis about the evolution of GO tools suggested that the promoters have introduced improvements over time, but clear models of GO tools have been detected. According to the results of the statistical analysis an ontology called DeGOT was built in order to provide an structured vocabulary for the developers when they dealing with the task of introducing improvements in the existing GO tools for enrichment analysis or designing a new one program. DeGOT can be used for supporting queries and comparison results of SerbGO.


L'aparició de les tecnologies d'alt rendiment ha generat una quantitat ingent de dades òmiques. Els resultats d'aquests experiment són llargues llistes de gens, que poden ser utilitzats com a biomarcadors. Un dels grans reptes dels investigadors experimentals és atribuir una interpretació o significació biològica a aquests biomarcadors potencials, ja be sigui extraient la informació bioblògica emmagatzemada en recursos com la Gene Ontology (GO) o la Kyoto Encyclopedia of Genes and Genomes (KEGG), o be combinant-les amb altres dades òmiques. Els objectius de la tesis eren: primer, estudiar les propietats matemàtiques de dos tipus de mesures de similaritat semàntica per a explorar categories GO, i segon, classificar i estudiar l'evolució de les eines GO per a l'anàlisi d'enriquiment. La primera mesura de similaritat semàntica considerada, proposada per en Lord et al., es fonamentava en la teoria de grafs, i la segona era un grup de pseudo-distàncies, proposades per Joslyn et al., fonamentades en la teoria dels Partially Ordered Sets (POSETs). L'estudi de les eines GO es va basar en les primeres 26 eines disponibles al web del The GO Consortium. S'ha vist que la mesura d'en Lord et al. és la mateixa mesura que la d'en Resnik, anteriorment publicada. S'ha observat una analogia en la forma de mapejar els gens a la GO via grafs i/o via POSETs. S'han proposat una propietat i un corol·lari que permeten calcular matricialment les la primera mesura de similaritat semàntica. S'ha demostrat que ambdues mesures estan associades a la distància mètrica. A'ha desenvolupat un paquet R, anomenat sims, que permet calcular similaritats semàntiques d'una ontologia arbitraria i comparar perfils de similaritat semàntica de la GO. S'ha proposat un Conjunt de Funcionalitats Estàndard per a classificar eines GO i s'ha desenvolupat un programari web, anomenat SerbGO, dirigit a seleccionar i comparar eines GO. L'estudi estadístic ha revelat que els promotors de les eines GO han introduït millores al llarg del temps, però no s'han detectat models ben definits. S'ha desenvolupat una ontologia, anomenada DeGOT, que proporciona un vocabulari als desenvolupadors per a introduir millores a les eines o dissenyar una de nova.

Keywords

Genòmica; Genómica; Genomics; Marcadors bioquímics; Marcadores bioquímicos; Biochemical markers; Semàntica; Semántica; Semantics; Ontologies (Informàtica); Ontologías (Informática); Ontologies (Information retrieval)

Subjects

311 - Statistics

Knowledge Area

Ciències Experimentals i Matemàtiques

Documents

JLMM_PhD_THESIS.pdf

10.13Mb

 

Rights

L'accés als continguts d'aquesta tesi queda condicionat a l'acceptació de les condicions d'ús establertes per la següent llicència Creative Commons: http://creativecommons.org/licenses/by-nc-sa/3.0/es/
L'accés als continguts d'aquesta tesi queda condicionat a l'acceptació de les condicions d'ús establertes per la següent llicència Creative Commons: http://creativecommons.org/licenses/by-nc-sa/3.0/es/

This item appears in the following Collection(s)