Universitat de Barcelona. Facultat de Biologia
Se ha demostrado en un número de estudios de metagenómica que la adición y la pérdida de genes específicos ha permitido a los microbiomas adaptarse a condiciones ambientales típicas de los entornos. Pero, aún no se conoce mucho acerca de como la regulación de la expresión génica contribuye a la adaptación. Aquí hemos caracterizado y analizado los metaregulomas de tres ambientes diferentes (muestra tomada de una Mina Acidificada en Iron Mountain, California; muestra de restos de ballenas en sedimentos colectados cerca de la península Antártica y en la costa oeste de Estados Unidos y suelo de granja colectado en Minessota, USA), así también hemos evaluado su impacto en la adaptación a condiciones físico-químicas variables. Para ello, se ha desarrollado un protocolo computacional para extraer las regiones reguladoras y sus correspondientes sitios de unión a factores de transcripción. Tomando la densidad de sitios de unión por promotor como una medida del potencial y la complejidad de la regulación de genes; encontramos que ésta se mantiene constante en los tres nichos analizados; a pesar de sus diferentes condiciones físico-químicas y la composición de especies. Sin embargo, se encontró que cada entorno distribuye su potencial de regulación diferente a través de su espacio funcional. Entre las funciones con mayor potencial regulador en cada nicho, se observó un importante enriquecimiento en los procesos relacionados con la detección y amortiguación de factores dinámicos en cada entorno, como por ejemplo, la disponibilidad de cofactores en alta mar; de oligosacáridos en el suelo dedicado a agricultura y la regulación del pH en la antigua mina dedicada a extracción de hierro. Unido a este estudio también evaluamos la capacidad de regulación de cepas de E. coli con diferente biolocalización. Todos los resultados de estos estudios ponen en relieve el impacto del potencial de regulación de genes en la adaptación de las bacterias a los diferentes hábitats. La adaptación ocurre a través de la distribución de su potencial de regulación; entre las funciones específicas con redes de factores de transcripción complejas, observamos alta prevalencia de genes que respondían a factores ambientales críticos para el crecimiento de cualquier comunidad microbiana.
The incapacity to culture over 90% of the species of microorganisms and the impossibility to reproduce the interaction among species in controlled conditions have limited the knowledge about microorganism communities in their natural habitats. In this sense, metagenomics studies have brought some light about, (1) natural microorganism communities compositions, (2) non-cultivable species (Allen & Banfield, 2005) and (3) the identification of functional fingerprints related to specific habitats (Tringe et al., 2005; Gianoulis et al., 2009). More recently single cell genomics analysis allows for the first time the sequencing of the complete genomes of environment-isolated bacterial cells collected directly from their natural habitats. Both approaches generate a massive amount of sequence information of communities and organisms living under different physicochemical conditions. All this allows, for the first time, to search for the molecular and genetic basis of adaptation through the comparison and the study of genomes of different species sharing the same environment, and of similar species living in different conditions. It has been shown in a number of metagenomic studies that the addition and removal of particular molecular functions have allowed microbiomes to adapt to specific environmental conditions by losing and gaining specific genes. But little is known about to which level the regulation of the expression of these genes contributes to adaptation. To cover this limitation, my PhD thesis aims to the study whether and how the environment shapes the regulatory regions of organisms to allow adaptation to variable external factors. This is done through the identification, analysis and comparison of regulatory regions (promoters) of Bacteria and Archaea species sharing the same environment, and of the same (or similar) species living under different physicochemical conditions. Our general approach is based on the analysis of proximal regulatory regions from free living (metagenomic) and laboratory prokaryotic organisms, as to their regulatory potential, i.e. their density of transcription factors binding sites (TFBSs). More specifically, we want to answer the following questions: (1) Do prokaryotic communities from different environments distribute and organize their regulatory potential differently? (2) Can these differences be explained and correlated with variable physico-chemical factors of the environment they live in? (3) Does the same bacteria (or clade) distribute its regulatory potential differently in environments with distinct external conditions? Below we explain the work and the results obtained in the last four years of my PhD thesis. The work is divided in two main blocks. In the first one, we investigate how prokaryotes in different environments distribute their regulatory potential and how this is correlated with variable external factors of each of the niches. This part is finished and the manuscript has been already sent for publication. The second block aims at answering the same main question, but now, by comparing the distributions of the regulatory potential through the same species inhabit different environments I. REGULATORY POTENTIAL VARIABILITY AT THE COMMUNITY LEVEL. 1. Selection of regulatory regions through metagenomics sequencing data. The first step of this project consisted in the identification of putative promoter using metagenomic sequencing reads. The pipeline we developed could be applied to Sanger sequencing reads as well as to partial genomes assembling from shorter reads obtained by Illumina or 454 sequencing. All the promoter data sets we have identified using this protocol in three metagenomic samples (Acid Mine (AM), Whale Falls (WhF), Waseca Soil (WS) will be publicly available in due time. These data sets have been carefully curated at different levels, not only to ensure their reliability by comparing with known sets of TFBS. 2. TFBS occurrences per promoter as a way to estimate regulatory potential. When the expression of a gene is regulated by a high number of transcriptions factors (TFs) makes the gene response more versatile in order to sense environmental physicochemical parameters specific of the habitat. In addition, we suspect that the species able to moderate their transcription requirements to face dynamic environmental factors will have more chance to be selected to growth under these conditions. What is the same, we expect that the habitat shapes the regulatory regions positively selecting TFBSs upstream of genes, which are on charge of sensing key environmental factors strongly related to microorganism survival rates. Based on this hypothesis we aimed to the identification of putative TFBSs in prokaryotic promoters as a way to estimate the regulatory potential of a gene. First, we applied a de novo method (Li et al, 2002) suitable to new sequencing genomes without any previous information about their TFBSs and also useful in the comparison of communities with high diversity, to avoid biases towards well known or abundant species. This method is based on the searching for overrepresented palindromic sequences, preferred DNA structures for TFs binding in prokaryotes (Rodionov, 2007), in the set of promoters we previously selected. Furthermore, in order to evaluate the presence of false positive, we performed different quantitative and qualitative comparisons with available independent data and methodology. From a quantitative point of view, we (1) first observed that the global average of 10 TFBS per promoter (with 0 as minimum and 25 as maximum values) that we identify from all three environments is in agreement with previous estimates obtained with different bacterial species and methodologies (Liu et al, 2008) (Sun et al, 2007). (2) We also evaluated the performance of our methodology by comparing our results with those obtained with an independent method, MotifClick, which predicts cis-regulatory regions using a graph-based polynomial-time algorithm (Zhang et al, 2011). After running both predictors over intergenic E. Coli regions, we observed that the densities of TFBS resulting from one or the other strategy showed high correlation values (rho= 0.52, p-value < 2.2 x 10-6 ). (3) Furthermore, we also assessed the biological significance of our predictions by carrying out a randomization test consisting in applying the same prediction pipeline to our collection of promoters with their nucleotide sequence completely shuffled, i.e. with no biological information. From a qualitative point of view, we (1) screened for coincidences between our predicted TFBSs and those reported in the Regprecise database (Novichkov et al, 2010), which consist on manually curated site reconstructions in various bacteria genomes. We observed that our approach has been able to find at least one site described among all possible binding sequences for all TFs described in Regprecise. This overlap involves 28% of our predictions. (2) In addition, we also searched for a particular type of false predictions, which consist on regulatory palindromic repeats with no binding potential, named Clustered Regularly Interspaced Short Palindromic Repeats (CRISPRs). The results that we obtained on our promoters using the CRISPRFinder web tool (Grissa et al, 2007) showed a negligible amount of these regions among our TFBS predictions, i.e. less than 1% of the promoters that were removed from the analysis. In summary, all these evaluation tests suggest that our TFBS prediction method is both, quantitatively and qualitatively reliable, as they show a small fraction of false positives. But, most importantly, these are not expected to affect our final conclusions because they derive from comparisons within and between environments and do not rely on absolute TFBSs counts. 3. Analysis of functional and taxonomical organization of regulatory potential. As the general regulatory potentials distributions (determined as is explained in section 2) per habitat in AM, WFS, WhF are similar among them (i.e., calculated average and standard deviation is about 10.4 (±3.49), 9.49 (±3.47) and 10.08 (±3.31) for Acid Mine Drainage, Waseca Farm Soil and Whale Falls samples, respectively), we further checked the occurrences of regulatory potential variability related to taxon and function within and among environments. a) Regulatory potential distribution per taxon. At this point we observed similar profiles for the majority of the species assigned by MEGAN (Huson et al., 2007) in AM, WFS, WhF (median of TFBSs/promoter ~ 9 -12), only a few exceptions were out of this range. b) Regulatory potentials distributions per function. The functional enrichment analysis was done by first ranking all promoters as to their number of predicted TFBSs. Then, we retained significant cases based on two criteria 1) functions whose p << 0.05 within environment and 2) functions with orthologous in the other three environments. Those selected groups were compared again, this time among environments. In contrast to the taxon analysis, we observed here functions with a significant higher regulatory potential comparing among habitats. We have found, for instance, highly regulatory potential in processes involved in the adaptation to variable environmental factors, such as cofactor availability in deep sea and fluctuations on the concentration of di-and-oligosacharides in soil. Another interesting observation is the variability on the regulatory potential of the orthologous genes depending on the environment and to this respect it is worth to highlight some examples listed below. -In Whale Falls Samples, beside the cofactor metabolism, we found high regulation in functions that are involved in the oxidative stress response, i.e., the hydrogen peroxide inducible gene activator and NnrS protein in response to NO and the non-specific DNA binding protein (Dps). This response is also needed in order to protect genomic DNA during prolonged non-growing phases, which are typical of oceanic environments (Storz & Imlay, 1999). -In Waseca Farm Soil carbohydrates metabolism related functions appear as highly regulated, more precisely di and oligosaccharides metabolism. This fact could be in concordance with the fluctuations in organic matter in the soil, such as, plant debris. -In Acid Mine, we identified high regulatory potential in the promoters of genes related to the Ton and Tol B transport systems, these genes are involved in avoiding toxicity by keeping metal homeostasis inside the cell (Osorio et al., 2008), in particular of iron. The high regulatory potential of the TonB-dependent receptor and the iron chelator utilization proteins (particularly in Leptospirilum group) compare to other orthologous found in Waseca Soil, for example, might provide homeostasis and, therefore, plasticity to acid mine bacteria living under variable ferric concentrations. In brief, through these analyses we have found specific functional enrichments among highly regulated functions in each of the metagenomes that point to possible interaction points between gene regulation and dynamic parameters of the niche. What is more, these results also highlight the impact of gene regulation in the adaptation of microbes to their habitat. Thus, with the outcomes explained above we found important clues aimed to solve two out of three main questions exposed at the beginning of this summary, both related to how is distributed the regulatory potential at the community level. II. REGULATORY POTENTIAL VARIABILITY IN THE SAME (OR HIGHLY RELATED) SPECIES LIVING IN DIFERENT HABITATS. The second block of my PhD aims to study the same general question on the relationship of gene regulation and adaptation, but now by studying the variability found on the same species living in different habitats. Here we present promising preliminary results that are currently being confirmed and expanded. We have analysed the regulatory potential of nine Escherichia Coli strains (downloaded from IMG database, http://img.jgi.doe.gov/). E. Coli is an excellent model to try our hypothesis at the species level, first because the regulatory network of several strains are well characterized and also because is a very versatile microorganism related to niche specificities. Thus, due to all the information available this species is a perfect candidate to study TFBSs abundances under different environmental conditions. 4. De novo TFBS predictions in different E. Coli strains. We first estimated the regulatory potential in several Escherichia Coli strains isolated from different sources, such as, human and animal gut (labeled as 2513237219, 2518645559, 2513237251, 2506520037), urinary bladder (2512047041, 2511231170, 2511231198), cerebrospinal fluid (2511231131) and a strain isolated for first time in 1977, which was later engineered for ethanol production (2513237200). This data was downloaded from IMG/JGI database (http://img.jgi.doe.gov/). After running two different approaches to predict TFBSs per promoter we observed a high correlation coefficient between both predictions for all the strains analyzed here except the engineered (2513237200). In spite of the great sequences identity among strains, top highly regulated functions are different among them. The most remarkably case is for colanic acid biosynthesis, in which the strain located in the gut present a complex regulation in some genes of this pathway, differently as occur in the strains located in the urinary tract and cerebrospinal fluid. However, for pyrimidine metabolism related genes we observed a similar behavior in the regulatory potential for the strains located in the gut and urinary tract. 5. Identification of TFBSs in different strains of E.Coli using homology mapping of experimentally validated binding sequence matrices. The second is an approach compatible with the analysis of regulatory regions in well-characterized species, such as E. Coli, because it relies on the TFBSs mapping using position specific sequencing matrices derived mostly from experimental information. More specifically, we predicted TFBSs through position specific sequencing matrices (PSSM) of TFs from RegulonDB (http://regulondb.ccg.unam.mx:80/index.html). We first calculated (with MATSCAN software, Blanco et al., 2006) the mapping score of known TFBS in nine Escherichia Coli strains isolated from different sources, as is explained in section 4 (http://img.jgi.doe.gov/). Afterwards, these isolated strains were clustered (using the TFBSs scores as input) by body location and also we observed a separation of these strains by the capacity to synthesize or not selenocysteine. We also analyzed the TFBSs abundances per TF in these E. Coli strains. After analyzing 86 TFs, we observed that the TFBSs distributions per TF have a great similarity among them, with a few exceptions. The outstanding differences between strains are given by the non- selenocysteine synthesizer and is related to FhlA. FhlA is a DNA-binding transcriptional activator; required for the induction of formate dehydrogenase H (FDH-H) expression. In addition, FDH-H enzyme contains selenium as selenocysteine incorporated cotranslationally. The dependence of this strain on external selenocysteine, increases the necessity of a major control on the FDH-H synthetic pathway. What is more, the maintenance of a FDH-H enzyme stock inside the cell is also critical under fermentative growth conditions, frequently observed in these host associated microorganisms. 6. Conclusions and perspectives. With this PhD thesis we intend to tackle some questions up to now unsolved, such as, 1) whether the environment influence the structure and complexity of regulatory regions (e.g. the regulation of gene expression) within the same microbial community, and how this affects the regulation of essential functions of the community in general, and 2) whether the core genome of different strains or highly related species, shows differences in terms of gene regulation, and how this correlates with media conditions (natural or not). For instance, we were able to find niche specific gene regulatory potentialities and also intraspecific plasticity of gene regulation. As far as we know this work constitutes the first study on how the regulatory regions of a gene are shaped by the environmental factors. Furthermore, this kind of analysis would have a wide applicability in biomedicine for instance in the design of in silico microbial communities for specific environments as therapeutic strategies. References Allen EE, Banfield JF. (2005). Community genomics in microbial ecology and evolution. Nat Rev Microbiol.3:489-98. Tringe SG, von Mering C, Kobayashi A, Salamov AA, Chen K, Chang HW, Podar M, Short JM, Mathur EJ, Detter JC, Bork P, Hugenholtz P, Rubin EM. (2005).Comparative metagenomics of microbial communities. Science. 308: 554-557. Gianoulis TA, Raes J, Patel PV, Bjornson R, Korbel JO, Letunic I, Yamada T, Paccanaro A, Jensen LJ, Snyder M, Bork P, Gerstein MB.(2009). Quantifying environmental adaptation of metabolic pathways in metagenomics. 106:1374-1379. Rodionov DA: Comparative genomic reconstruction of transcriptional regulatory networks in bacteria. Chem Rev 2007, 107:3467-3497. Li H, Rhodius V, Gross C, Siggia ED: Identification of the binding sites of regulatory proteins in bacterial genomes. Proc Natl Acad Sci U S A 2002, 99:11772-11777. Blanco E, Messeguer X,. Smith T.F, Guigo R. Transcription Factor Map Alignment of Promoter Regions.PLoS Computational Biology 2(5): 2006. Huson DH, Auch AF, Qi J, Schuster SC: MEGAN analysis of metagenomic data. Genome Res 2007, 17:377-386. Storz G, Imlay JA: Oxidative stress. Curr Opin Microbiol 1999, 2:188-194. Osorio H, Martinez V, Nieto PA, Holmes DS, Quatrini R: Microbial iron management mechanisms in extremely acidic environments: comparative genomics evidence for diversity and versatility. BMC Microbiol 2008, 8:203. Novichkov PS, Laikova ON, Novichkova ES, Gelfand MS, Arkin AP, Dubchak I et al (2010). RegPrecise: a database of curated genomic inferences of transcriptional regulatory interactions in prokaryotes. Nucleic Acids Res 38: D111-118. Grissa I, Vergnaud G, Pourcel C (2007a). CRISPRFinder: a web tool to identify clustered regularly interspaced short palindromic repeats. Nucleic Acids Res 35: W52-57. Liu J, Xu X, Stormo GD (2008). The cis-regulatory map of Shewanella genomes. Nucleic Acids Res 36: 5376-5390. Sun J, Tuncay K, Haidar AA, Ensman L, Stanley F, Trelinski M et al (2007). Transcriptional regulatory network discovery via multiple method integration: application to e. coli K12. Algorithms Mol Biol 2: 2. Zhang S, Li S, Niu M, Pham PT, Su Z (2011). MotifClick: prediction of cis-regulatory binding sites via merging cliques. BMC Bioinformatics 12: 238.10
Genòmica; Genómica; Genomics; Adaptació (Biologia); Adaptación (Biología); Adaptation (Biology); Microbiologia; Microbiología; Microbiology
575 - General genetics. General cytogenetics. Immunogenetics. Evolution. Phylogeny
Ciències Experimentals i Matemàtiques
Tesi realitzada al Centre de Supercomputació de Barcelona (BSC-CNS)
Facultat de Biologia [236]