dc.description.abstract
[eng] Cancer, a complex disease, arises from accumulated somatic genomic and epigenomic changes within tumor cells, typically acquired during an individual’s lifetime. These alterations confer growth advantages, transforming normal cells into cancerous ones. Differences among tumors originated in the same tissue, have been demonstrated and characterized in diverse studies using large cohorts of patients, such as The International Cancer Genome Consortium (ICGC) or The Cancer Genome Atlas (TCGA). Furthermore, it is known that each tumor can be formed by many cell populations, each accumulating different somatic genetic mutations. This knowledge has put into question the traditional classification of tumors, and how they are treated. Advancements in genome technologies, such as next-generation sequencing, have played an important role in generating vast amounts of tumor datasets, allowing sophisticated and ambitious bioinformatic analyses. These technologies have been essential to comprehend tumor formation and progression, and their potential translation into the clinics.
Using large-scale and public initiatives of cancer data, and through the combination of genomic and transcriptomic analysis, we have been involved in diverse cancer-related studies, primarily focused on the identification and interpretation somatic genomic events. Therefore, the general goal of the work described in this thesis is to expand the understanding of the genomic basis behind tumors, through the analysis of somatic events, like somatic processed pseudogenes and other previously unexplored genomic elements, i.e. micropeptides.
First, in collaboration with Dr. Elias Campo from IDIBAPS, we participated in a longitudinal study of Chronic Lymphocytic Leukemia. In particular, we were focused on the analysis of somatic structural variants, to define and quantify their cell frequency and incorporate them in the study of the subclonal architecture of
CLL patients. Using diverse variant calling pipelines and experimental validations, we first identified SVs and observed an increase in them during tumor progression, particularly evident once the patients transformed into a more aggressive form known Richter’s syndrome. We then designed a strategy to calculate SV variant allele frequencies. This involved exploring coverage variability and read alignment within these mutated genomic regions. Based on this analysis, we could observe stable or decreased SV frequencies at diagnosis, contrasting with increase at Richter transformation.
Another part of the thesis has been conducted in the context of the Pancancer Analysis of Whole Genomes initiative, where we studied the landscape of processed pseudogenes in 2585 cancer genomes and assessed their potential functional impact. PPs represent mRNA copies randomly integrated into the genome through retrotransposition. Prior to our study, these events were described as somatic in only a few tumor types. We established a protocol based on automatic rules applied to somatic structural variants and manual inspection of the genomes, to detect such somatic event. We found evidence for 433 candidates somatic PPs across 251 tumor genomes, uncovering new cancer types not examined before. Additionally, as a first approximation to study their functional impact and using RNA-seq data exploration, we identified evidence of expression of 17 PPs across 6 tumor types. The reconstruction of the potential PP- host gene fusion transcripts allowed us to predict that these insertions generally generate premature stop codons within the coding region of the host.
Finally, we focused on the identification of novel micropepitdes, a recently discovered class of genetic elements. Micropeptides are small open reading frames of less than 300 nucleotides that can code for stable and functional small proteins. Among other observed functions, it has been shown that these small peptides can suppress cancer growth and have important roles in cancer. We used publicly available genomic and transcriptomic data to identify new micropeptides,
focusing on non-annotated DNA regions. First, in collaboration with Dra. Maria Abad from VHIO, we defined a catalog of more than 1.000.000 candidate micropeptide sequences in non-annotated regions. To do so, we performed de novo transcriptome assembly of 6 RNA-seq samples from pancreatic adenocarcinoma human tissues, merged the predicted transcripts and in-silico translated their sequences. Results were filtered to remove sequences overlapping with known coding sequences and depending on their expression values. The dataset was then used for analyzing pancreatic tumor samples with mass spectrometry analysis. Secondly, complementing this collaboration, we lead a different study focusing on the identification of new small ORFs within non- annotated regions of the human genome. Based on evolutionary conservation features at DNA and protein level, we identified a set of 8.289 candidate smORFs within intergenic regions of the human genome. We then also analyzed their potential transcription on 135 normal samples from the GTEX project, including 28 tissues. From this data, we could find expression evidence for 260 candidate smORFs in at least one normal sample. Lastly, with the aim of exploring the role of micropeptides in cancer we analyzed recurrence of somatic SNVs from the ICGC. However, to date, we have not identified any cancer driver mutations within these smORFs. We hope that extending this comparison to other collections of somatic variants related to cancer can identify candidate cancer smORFs
Collectively, the presented thesis offers a comprehensive description of somatic genomic events in cancer focusing on structural variation and processed pseudogenes, as well as the evaluation of novel gene elements, providing a foundation for future investigations.
ca