Abstract
23 min readArticle Figures and data Abstract Editor's evaluation eLife digest Introduction Results Discussion Materials and methods Appendix 1 Appendix 2 Appendix 3 Appendix 4 Appendix 5 Appendix 6 Appendix 7 Appendix 8 Appendix 9 Appendix 10 Appendix 11 Appendix 12 Data availability References Decision letter Author response Article and author information Metrics Abstract Genes of unknown function are among the biggest challenges in molecular biology, especially in microbial systems, where 40–60% of the predicted genes are unknown. Despite previous attempts, systematic approaches to include the unknown fraction into analytical workflows are still lacking. Here, we present a conceptual framework, its translation into the computational workflow AGNOSTOS and a demonstration on how we can bridge the known-unknown gap in genomes and metagenomes. By analyzing 415,971,742 genes predicted from 1749 metagenomes and 28,941 bacterial and archaeal genomes, we quantify the extent of the unknown fraction, its diversity, and its relevance across multiple organisms and environments. The unknown sequence space is exceptionally diverse, phylogenetically more conserved than the known fraction and predominantly taxonomically restricted at the species level. From the 71 M genes identified to be of unknown function, we compiled a collection of 283,874 lineage-specific genes of unknown function for Cand. Patescibacteria (also known as Candidate Phyla Radiation, CPR), which provides a significant resource to expand our understanding of their unusual biology. Finally, by identifying a target gene of unknown function for antibiotic resistance, we demonstrate how we can enable the generation of hypotheses that can be used to augment experimental data. Editor's evaluation In this paper, the authors develop a sensitive and specific computational workflow for comprehensively summarizing known and unknown gene content across large collections of genomes and metagenomes. In addition to clustering and categorizing genes on a large scale, the authors show how to use their approach to both explore lineage-specific genes and generate hypotheses for the function of unknown genes. https://doi.org/10.7554/eLife.67667.sa0 Decision letter Reviews on Sciety eLife's review process eLife digest It is estimated that scientists do not know what half of microbial genes actually do. When these genes are discovered in microorganisms grown in the lab or found in environmental samples, it is not possible to identify what their roles are. Many of these genes are excluded from further analyses for these reasons, meaning that the study of microbial genes tends to be limited to genes that have already been described. These limitations hinder research into microbiology, because information from newly discovered genes cannot be integrated to better understand how these organisms work. Experiments to understand what role these genes have in the microorganisms are labor-intensive, so new analytical strategies are needed. To do this, Vanni et al. developed a new framework to categorize genes with unknown roles, and a computational workflow to integrate them into traditional analyses. When this approach was applied to over 400 million microbial genes (both with known and unknown roles), it showed that the share of genes with unknown functions is only about 30 per cent, smaller than previously thought. The analysis also showed that these genes are very diverse, revealing a huge space for future research and potential applications. Combining their approach with experimental data, Vanni et al. were able to identify a gene with a previously unknown purpose that could be involved in antibiotic resistance. This system could be useful for other scientists studying microorganisms to get a more complete view of microbial systems. In future, it may also be used to analyze the genetics of other organisms, such as plants and animals. Introduction Thousands of isolate, single-cell, and metagenome-assembled genomes are guiding us toward a better understanding of life on Earth (Almeida et al., 2019; Cross et al., 2019; Hug et al., 2016; Kopf et al., 2015; Pachiadaki et al., 2019; Pasolli et al., 2019; Sunagawa et al., 2015). At the same time, the ever-increasing number of genomes and metagenomes, unlocking uncharted regions of life’s diversity, (Brown et al., 2015; Eloe-Fadrosh et al., 2016; Hug et al., 2016) are providing new perspectives on the evolution of life (Parks et al., 2018; Spang et al., 2015). However, our rapidly growing inventories of new genes have a glaring issue: between 40% and 60% cannot be assigned to a known function (Almeida et al., 2021; Bernard et al., 2018; Carradec et al., 2018; Price et al., 2018). Current analytical approaches for genomic and metagenomic data (Chen et al., 2019; Franzosa et al., 2018; Huerta-Cepas et al., 2017; Mitchell et al., 2020; Quince et al., 2017) generally do not include this uncharacterized fraction in downstream analyses, constraining their results to conserved pathways and housekeeping functions (Quince et al., 2017). This inability to handle the unknown is an immense impediment to realizing the potential for discovery of microbiology and molecular biology at large (Bernard et al., 2018; Hanson et al., 2009). Predicting function from traditional single sequence similarity appears to have yielded all it can (Arnold, 2018; Arnold, 1998; Brandenberg et al., 2017), thus several groups have attempted to resolve gene function by other means. Such efforts include combining biochemistry and crystallography Jaroszewski et al., 2009; using environmental co-occurrence Buttigieg et al., 2013; by grouping those genes into evolutionarily related families (Bateman et al., 2010; Brum et al., 2016; Wyman et al., 2018; Yooseph et al., 2007); using remote homologies (Bitard-Feildel and Callebaut, 2017; Lobb et al., 2015); or more recently using deep learning approaches (Bileschi et al., 2019; Liu, 2017). In 2018, (Price et al., 2018) developed a high-throughput experimental pipeline that provides mutant phenotypes for thousands of bacterial genes of unknown function being one of the most promising methods to tackle the unknown. Despite their promise, experimental methods are labor-intensive and require novel computational methods that could bridge the existing gap between the genes with known and unknown function. Here, we present a conceptual framework and its translation to a computational workflow that closes the gap by connecting genomic and metagenomic data by the exploitation of groups of homologous genes, and that facilitates the integration of genes of unknown function into ecological, evolutionary, and biotechnological investigations. The conceptual framework is based on the partitioning of the known and unknown fractions into four different categories that reflects the level of darkness (Figure 1A). The category “Known” (K) contains these sequences predicted to harbor domains of known function described by Pfam (Finn et al., 2016). Some sequences without Pfam domains of unknown function, like the ones encoding for small and intrinsically disordered proteins, can have homology to characterized proteins, hence we classify them as ‘Known without Pfam’ (KWP). Lastly, sequences that do not meet either of these criteria are classified as "Genomic Unknown" (GU) if they can be found in genomes from reference databases or contain domains of unknown function (DUF); and ‘Environmental Unknown’ (EU) when only have been observed in environmental samples. By contextualizing the different categories with information from several sources (Figure 1C), we hope this will prove an invaluable resource for including genes of unknown function in evolutionary and ecological studies (Delmont et al., 2022; Gaïa et al., 2021; Holland-Moritz et al., 2021) as well as enhancing the current methods for its experimental characterization. Figure 1 Download asset Open asset Conceptual framework to unify the known and unknown sequence space and integration of the framework in the current analytical workflows. (A) Link between the conceptual framework and the computational workflow to partition the sequence space in the four conceptual categories. AGNOSTOS infers, validates and refines the GCs and combines them in gene cluster communities (GCCs). Then, it classifies them in one of the four conceptual categories based on their level of ‘darkness’. Finally, we add context to each GC based on several sources of information, providing a robust framework for generating hypotheses that can be used to augment experimental data. (B) The computational workflow provides two mechanisms to structure sequence space using GCs, de novo creation of the GCs (DB creation), or integrating the dataset in an existing GC database (DB update). The structured sequence space can then be plugged into traditional analytical workflows to annotate the genes within each GC of the known fraction. With AGNOSTOS, we provide the opportunity to integrate the unknown fraction into microbiome analyses easily. (C) The versatility of the GCs enables analyses at different scales depending on the scope of our experiments. We can group GCs in gene cluster communities based on their shared homologies to perform coarse-grained analyses. On the other hand, we can design fine-grained analyses using the relationships between the genes in a GC, that is detecting network modules in the GC inner sequence similarity network. Additionally, given that GCs are conserved across environments, organisms and experimental conditions give us access to an unprecedented amount of information to design and interpret experimental data. The application of our approach to 415,971,742 genes predicted from 1749 metagenomes and 28,941 bacterial and archaeal genomes revealed that the unknown fraction (1) is smaller than has been previously reported (Salazar et al., 2019; Thomas and Segata, 2019), (2) is exceptionally diverse, and (3) is phylogenetically more conserved than the known fraction and predominantly taxonomically restricted at the species level. Finally, we show how we can connect all the knowledge produced by our approach to augment the results from experimental data and add context to genes of unknown function through hypothesis-driven molecular investigations. Results AGNOSTOS, a computational workflow to unify the known and the unknown sequence space Driven by the concepts defined in the conceptual framework, we developed AGNOSTOS, a computational workflow that infers, validates, refines, and classifies groups of homologous genes or gene clusters (GCs) in the four proposed categories (Figure 1A; Figure 1B; Appendix 1—figure 1). AGNOSTOS produces GCs with a highly conserved intra-homogeneous structure (Figure 1B), both in terms of sequence similarity and domain architecture homogeneity; it exhausts any existing homology to known genes and provides a proper delimitation of the unknown genes before classifying each GC in one of the four categories (Materials and methods). In the last step, we decorate each GC with a rich collection of contextual data compiled from different sources or generated by analyzing the GC contents in different contexts (Figure 1A). For each GC, we also offer several products that can be used for analytical purposes like improved representative sequences, consensus sequences, sequence profiles for MMseqs2 (Steinegger and Söding, 2017) and HHblits (Steinegger et al., 2019a), or the GC members as a sequence similarity network (Methods). To complement the collection, we also provide a subset of what we define as high-quality GCs. The defining criteria are (1) the representative is a complete gene and (2) more than one-third of genes within a GC are complete genes. First, we used AGNOSTOS in a metagenomic context to show its ability to process complex and noisy datasets. We explored the unknown sequence space of 1,749 human and marine metagenomes. In total, we predicted 322,248,552 genes from the environmental dataset and assigned Pfam annotations to 44% of them (Appendix 1—figure 2A). Next, AGNOSTOS clustered the predicted genes in 32,465,074 GCs and flagged those gene clusters that contain less than ten genes (Skewes-Cox et al., 2014). Flagged gene clusters are not processed through the validation workflow because they don’t have enough sequences to provide reliable results. We flagged 29,461,177 gene clusters, with 19,911,324 singletons (Appendix 1—figure 2A; Appendix 2). A total of 3,003,897 GCs (83% of the original genes) will go through the validation steps. The validation process selected 2,940,257 good-quality clusters (Figure 1B; Appendix 1—table 1; Appendix 3), of which 43% were identified as unknowns after the classification and remote homology refinement steps (Appendix 1—figure 2A, Appendix 4). Lastly, we demonstrate how AGNOSTOS can integrate a new dataset into the already existing metagenomic database by integrating 28,941 genomes from the GTDB_r86 (Appendix 1—figure 2A). With this integration we can build links between the environmental and genomic sequence space by expanding the final collection of GCs with the genes predicted from GTDB_r86. Surprisingly, the integration showed that the environmental GCs (human and marine) not only included a large proportion of the genes predicted from GTDB_r86 (72%) but also that most of these genes (90%) are part of the known sequence space (Appendix 1—figure 2A; Appendix 5). Only 22% of the GTDB_r86 genes created new GCs (2,400,037), with most of these genes (84%) classified as unknowns by AGNOSTOS (Appendix 1—figure 2A; Appendix 5). Finally, only a small proportion of the genes from GTDB_r86 (6%) resulted in singletons (Appendix 1—figure 2A; Appendix 5). The final dataset after being validated and categorized resulted in 5,287,759 GCs (Appendix 1—figure 2A), with both datasets sharing only 922,599 GCs (Appendix 1—figure 2B). The integration of GTDB_r86 into the metagenomic database increased the proportion of GCs in the unknown sequence space from 43% to the 54%. Additionally, AGNOSTOS identified a subset of 203,217 high-quality GCs (Appendix 1—table 2). In these high-quality GCs, we identified 12,313 clusters potentially encoding for small proteins ( ≤ 50 amino acids). Most of these GCs are unknown (66%), which agrees with recent findings on novel small proteins from metagenomes (Sberro et al., 2019). We also observed that the KWP category contains the largest proportion of incomplete genes (Appendix 1—table 3), disrupting the detection and assignment of Pfam domains. But it also incorporates sequences with an unusual amino acid composition that have homology to proteins with high levels of disorder in the DPD database (Perdigão et al., 2017) and has characteristic functions of intrinsically disordered proteins (Habchi et al., 2014) like cellular processes and signaling as predicted by eggNOG annotations (Appendix 1—table 4). As part of the workflow, each GC is complemented with a rich set of information, as shown in Figure 1A (Supplementary file 2A; Appendix 6). Beyond the twilight zone with AGNOSTOS, communities of gene clusters To find relationships between gene clusters, we implemented in AGNOSTOS a method to group them in gene cluster communities (GCCs) (Figure 2A) using remote homologies. As we are dealing with the unknown, we identified GCCs independently in each category. AGNOSTOS uses the gene clusters from category K as a reference to automatically identify the best parameters for the clustering of the gene cluster homology network (Materials and methods; Figure 2B; Appendix 7) and applies the learnt parameters to the other categories. The grouping reduced the final collection of GCs by 87%, producing 673,601 GCCs (Materials and methods; Figure 2B; Appendix 7). Figure 2 Download asset Open asset Overview and validation of the workflow to aggregate GCs in communities. (A) We inferred a gene cluster homology network using the results of an all-vs-all HMM gene cluster comparison with HHBLITS. The edges of the network are based on the HHblits-score/Aligned-columns. Communities are identified by an iterative screening of different MCL inflation parameters and evaluated using five different metrics that consider the inter- and intra-community properties. (B) Comparison of the number of GCs and GCCs for each of the functional categories. (C) Validation of the GCCs inference based on the environmental genes annotated as proteorhodopsins. Ribbons in the alluvial plot are genes, and each stacked bar corresponds (from left to right) to the (1) gene taxonomic classification at the domain level, (2) GC membership, (3) GCC membership and (4) MicRhoDE operational classification. (D) Validation of the GCCs inference based on ribosomal proteins based on standard and high-quality GCs. We validated our approach to capture remote homologies between related GCs using two well-known gene families present in our environmental datasets, proteorhodopsins (Béjà et al., 2000; Béjà et al., 2001) and bacterial ribosomal proteins (Méheust et al., 2019). Theoretically, we would expect to have one or a very low number of GCCs for each of these gene families. We identified 64 GCs (12,184 genes) and 3 GCCs (Appendix 7) containing genes classified as proteorhodopsin (PR). One community from the category K contained 99% of the PR annotated genes (Figure 2C), except 85 genes taxonomically annotated as viral and assigned to the PR Supercluster I (Boeuf et al., 2015) within two GU communities (five GU gene clusters; Appendix 7). For the ribosomal proteins, the results were not so satisfactory. We identified 1843 GCs (781,579 genes) and 98 GCCs. The number of GCCs is larger than the expected number of ribosomal protein families (16) used for validation. When we used high-quality GCs (Appendix 7), we got closer to the expected number of GCCs (Figure 2D). With this subset, we identified 26 GCCs and 145 GCs (1687 genes). The cross-validation of our method against the approach used in Méheust et al., 2019 (Appendix 7) confirms the intrinsic complexity of analyzing metagenomic data. Both approaches showed a high agreement in the GCCs identified (Appendix 7—table 2). Still, our method inferred fewer GCCs for each of the ribosomal protein families (Appendix 7—figure 4), coping better with the complexities of a metagenomic setup, such as incomplete genes (Appendix 1—table 5). AGNOSTOS uncovers a smaller yet highly diverse unknown sequence space Among our primary design goals in developing AGNOSTOS is to unearth ecological and evolutionary information while unifying the known and unknown sequence space of genomes and metagenomes. One can use AGNOSTOS GCs and GCCs as analogs to operational protein families (Schloss and Handelsman, 2008) with the benefits of using high-quality gene clusters and remote homology searches. Our exhaustive analysis of the AGNOSTOS inferred GCs and GCCs described in the following sections, revealed that the unknown sequence space is smaller than what has been reported so far through traditional genomic and metagenomic analysis approaches, where the unknown fraction can reach up to 60% in marine metagenomes (Salazar et al., 2019) or up to 40% in human metagenomes (Thomas and Segata, 2019, Figure 3A). Our workflow recruited as much as 71% of genes in human-related metagenomic samples and 65% of the genes in marine metagenomes into the known sequence space. In both human and marine microbiomes, the genomic unknown fraction showed a similar proportion of genes (21%, Figure 3A). The number of genes corresponding to EU gene clusters is higher in marine metagenomes; 12% of the genes are part of this GC category. We obtained a comparable result when we evaluated the genes from the GTDB_r86, 75% of bacterial and 64% of archaeal genes were part of the known sequence space. Archaeal genomes contained more unknowns than those from Bacteria, where 30% of the genes are classified as genomic unknowns in Archaea, and only 20% in Bacteria (Figure 3A; Supplementary file 1B). Figure 3 Download asset Open asset The extent of the known and unknown sequence space. (A) Proportion of genes in the known and unknown. (B) Accumulation curves for the known and unknown sequence space at the GC- level for the metagenomic and genomic data. from TARA, MALASPINA, OSD2014 and HMP-I/II projects. (C) Collector curves comparing the human and marine biomes. Colored lines represented the mean of 1000 permutations and shaded in gray the standard deviation. Non-abundant singleton clusters were excluded from the accumulation curves calculation. (D) Amino acid distribution in the known and unknown sequence space. In all cases, the four categories have been simplified as known (K, KWP) and unknown (GU, EU). To further evaluate the differences between the known and unknown sequence space, we calculated the accumulation rates of GCs and GCCs combining the categories K and KWP as knowns, and GU and EU as unknowns to get a general overview of both fractions. For the metagenomic dataset we used 1264 metagenomes (18,566,675 GCs and 282,580 GCCs) and for the genomic dataset 28,941 genomes (9,586,109 GCs and 496,930 GCCs). The rate of accumulation of unknown GCs was three times higher than the known (2 X for the genomic), and in both cases the curves were far from reaching a plateau (Figure 3B). This is not the case for the GCC accumulation curves (Appendix 1—figure 4B), which reached a plateau. The accumulation rate is largely determined by the number of singletons, especially singletons from EUs (Appendix 8 and Appendix 1—figure 5). While the accumulation rate of known GCs between marine and human metagenomes is almost identical, there are striking differences for the unknown GCs (Figure 3C). These differences are maintained even when we remove the virus-enriched samples from the marine metagenomes (Appendix 1—figure 4A). Although the marine metagenomes include a large variety of environments, from coastal to the deep sea, the known space remains quite constrained. Next, we wanted to know how much of the sequence space we integrated with AGNOSTOS was found in other databases (Appendix 9). Despite only including marine and human metagenomes in our database, we already cover, in average, 76% of the sequence space of seven datasets spanning different databases and environments (Appendix 9—figure 1). By screening MGnify (Mitchell et al., 2020) (release 2018_09; 11 biomes; 843,535,6116 proteins) we identified freshwater, soil and human non-digestive as the biomes less covered by our data (Appendix 1—figure 6). Two of the seven analyzed datasets are designed to study genes of unknown function (Appendix 9—table 1). On Wyman et al., 2018, where they defined Function Unknown Families of homologous proteins (FunkFams), we identified 20% of their FunkFams to be members of the known sequence space. On Price et al., 2018, we classified as known, 44% of the genes of unknown function used in their experimental conditions. One indirect consequence of our approach is that we can provide a detailed view of the sequence space at the amino acid level. We estimated the number of amino acids belonging to the known (K and KWP combined) or unknown (GU and EU combined) sequence space, and how many of these amino acids are contained in a Pfam domain. From the 90,128,659,316 amino acids analyzed, most of the amino acids in metagenomes (74%) and genomes (80%) are in the known sequence space (Figure 3D; Supplementary file 1B) while only 22% in metagenomes and 15% in genomes are part of the unknowns. In both cases, approximately 40% of the amino acids in the known sequence space were part of a Pfam domain (Figure 3D; Supplementary file 1B). While this result is expected based on the large number of genes present in the known space (Figure 3A), what is surprising is the low proportion of amino acids (2%) corresponding to DUF Pfam domains in genomes and metagenomes (Figure 3D). If we use as reference the proportions of amino acids observed in the known sequence space, we can hypothesize that there are still many DUFs to be unearthed. With AGNOSTOS and its thorough validation and characterization of the genomic and environmental unknowns, we provide the basic building blocks (gene clusters’ multiple sequence alignments) to identify conserved regions that might become new potential DUFs. The unknown sequence space has a limited ecological distribution in human and marine environments Although the role of the unknown fraction in the environment is still a mystery, the large number of gene counts and abundance observed underlines its inherent ecological relevance (Figure 4A). In some metagenomes, the genomic unknown fraction can account for more than 40% of the total gene abundance observed (Figure 4A). The environmental unknown fraction is also relevant in several samples, where singleton GCs are the majority (Figure 4A). We identified two metagenomes with an unusual composition in terms of environmental unknown singletons. The marine metagenome corresponds to a sample from Lake Faro (OSD42), a meromictic saline with a unique extreme environment where Archaea plays an important role (La Cono et al., 2013). The HMP metagenome (SRS143565) that corresponds to a human sample from the right cubital fossa from a healthy female subject. To understand this unusual composition, we should perform further analyses to discard potential technical artifacts like sample contamination. Figure 4 Download asset Open asset Distribution of the unknown sequence space in the human and marine metagenomes. (A) Ratio between the proportion of the number of genes and their estimated abundances per cluster category and biome. Columns represented in the facet depicts three cluster categories based on the size of the clusters. (B) Relationship between the ratio of Genomic unknowns and Environmental unknowns in the HMP-I/II metagenomes. Gastrointestinal tract metagenomes are enriched in Genomic unknown sequences compared to the other body sites. (C) Relationship between the ratio of Genomic unknowns and Environmental unknowns in the TARA Oceans metagenomes. Girus- and virus-enriched metagenomes show a higher proportion of both unknown sequences (genomic and environmental) than the Archaea|Bacteria enriched fractions. (D) Environmental distribution of GCs and GCCs based on Levin’s niche breadth index. We obtained the significance values after generating 100 null gene cluster abundance matrices using the quasiswap algorithm. The ratio between the unknown and known GCs is useful to reveal which metagenomes are enriched in GCs of unknown function (upper left quadrant in Figure 4B–C) and it can be used as a proxy to assess the sequence contained in a metagenome. In human metagenomes, this ratio can distinguish between body sites, with the gastrointestinal tract, an ecologically complex environment (Qin et al., 2010), significantly enriched with genomic unknowns. Furthermore, it is not surprising that the human and marine metagenomes with the largest ratio of unknowns are those samples enriched with viral sequences. Specifically, in the HMP metagenomes are those samples identified to contain crAssphages (Dubinkina et al., 2016; Edwards et al., 2019) and HPV viruses (Ma et al., 2014, Supplementary file 1C; Appendix 1—figure 7). In marine metagenomes (Figure 4C), the highest ratio in genomic and environmental unknowns correspond to the ones enriched with viruses and giant viruses. We performed a large-scale analysis to investigate the occurrence patterns of the GCs in the environment by analyzing their abundance and distribution breadth. The narrow distribution of the unknown fraction (Figure 4D) suggests that these GCs might provide a selective advantage and be necessary to adapt to specific environmental conditions. But the pool of broadly distributed environmental unknowns is the most exciting result. We identified traces of potential ubiquitous organisms left uncharacterized by traditional approaches, as more than 80% of these GCs cannot be associated with a metagenome-assembled genome (MAG) (Appendix 1—table 6, Appendix 10). The genomic unknown sequence space is lineage-specific With the inclusion of the genomes from GTDB_r86, we have access to a phylogenomic framework that can be used to assess how taxonomically restricted is a GC within a lineage, hereafter referred to as lineage-specific genes (Johnson, 2018; Mendler et al., 2019) and how conserved (phylogenetic conservation) a GC is across the different clades in the GTDB_r86 phylogenomic tree (Martiny et al., 2013). We identified 781,814 lineage-specific GCs and 464,923 phylogenetically conserved (p < 0.05) GCs in Bacteria (Supplementary file 1D; Appendix 11 for Archaea). The number of lineage-specific GCs increases with the Relative Evolutionary Distance (Parks et al., 2018, Figure 5A) and differences between the known and the unknown fraction start to be evident at the family level resulting in 4 X more unknown lineage-specific GCs at the species level. In general terms, the unknown GCs are more phylogenetically conserved (GCs shared among members of deep clades) than the known (Figure 5B, p < 0.0001), revealing the importance of the genome’s uncharacterized fraction. However, the lineage-specific unknown GCs are less phylogenetically conserved (Figure 5B) than the known, agreeing with the large number of lineage-specific GCs observed at genus and species level (Figure 5A). Figure 5 Download asset Open asset Phylogenomic exploration of the unknown sequence space. (A) Distribution of the lineage-specific GCs by taxonomic level. Lineage-specific unknown GCs are more abundant in the lower taxonomic levels (genus
Discussion(0)
No comments yet. Be the first to comment.