Abstract
21 min readArticle Figures and data Abstract Editor's evaluation Introduction Results and discussion Materials and methods Data availability References Decision letter Author response Article and author information Metrics Abstract Core gene phylogenies provide a window into early evolution, but different gene sets and analytical methods have yielded substantially different views of the tree of life. Trees inferred from a small set of universal core genes have typically supported a long branch separating the archaeal and bacterial domains. By contrast, recent analyses of a broader set of non-ribosomal genes have suggested that Archaea may be less divergent from Bacteria, and that estimates of inter-domain distance are inflated due to accelerated evolution of ribosomal proteins along the inter-domain branch. Resolving this debate is key to determining the diversity of the archaeal and bacterial domains, the shape of the tree of life, and our understanding of the early course of cellular evolution. Here, we investigate the evolutionary history of the marker genes key to the debate. We show that estimates of a reduced Archaea-Bacteria (AB) branch length result from inter-domain gene transfers and hidden paralogy in the expanded marker gene set. By contrast, analysis of a broad range of manually curated marker gene datasets from an evenly sampled set of 700 Archaea and Bacteria reveals that current methods likely underestimate the AB branch length due to substitutional saturation and poor model fit; that the best-performing phylogenetic markers tend to support longer inter-domain branch lengths; and that the AB branch lengths of ribosomal and non-ribosomal marker genes are statistically indistinguishable. Furthermore, our phylogeny inferred from the 27 highest-ranked marker genes recovers a clade of DPANN at the base of the Archaea and places the bacterial Candidate Phyla Radiation (CPR) within Bacteria as the sister group to the Chloroflexota. Editor's evaluation This contribution is of interest to molecular phylogeny scientists, in particular, and to a broad public interested in early evolution, in general, as it elegantly supports the long-standing (but recently challenged) hypothesis that bacteria and archaea are separated by a long branch. https://doi.org/10.7554/eLife.66695.sa0 Decision letter Reviews on Sciety eLife's review process Introduction Much remains unknown about the earliest period of cellular evolution and the deepest divergences in the tree of life. Phylogenies encompassing both Archaea and Bacteria have been inferred from a ‘universal core’ set of 16–56 genes encoding proteins involved in translation and other aspects of the genetic information processing machinery (Ciccarelli et al., 2006; Fournier and Gogarten, 2010; Harris et al., 2003; Hug et al., 2016; Koonin, 2003; Mukherjee et al., 2017; Petitjean et al., 2014; Ramulu et al., 2014; Raymann et al., 2015; Theobald, 2010; Williams et al., 2020). While representing a small fraction of the total genome of any organism (Dagan and Martin, 2006), these genes are thought to predominantly evolve vertically and are thus best suited for reconstructing the tree of life (Ciccarelli et al., 2006; Creevey et al., 2011; Puigbò et al., 2009; Ramulu et al., 2014; Theobald, 2010). In these analyses, the branch separating Archaea from Bacteria (hereafter, the AB branch) is often the longest internal branch in the tree (Cox et al., 2008; Gogarten et al., 1989; Hug et al., 2016; Iwabe et al., 1989; Pühler et al., 1989; Williams et al., 2020). In molecular phylogenetics, branch lengths are usually measured in expected numbers of substitutions per site, with a long branch corresponding to a greater degree of genetic change. Long branches can therefore result from high evolutionary rates, long periods of absolute time, or a combination of the two. If a sufficient number of fossils are available for calibration, molecular clock models can, in principle, disentangle the contributions of these effects. However, limited fossil data (Sugitani et al., 2015) is currently available to calibrate early divergences in the tree of life (Betts et al., 2018; Horita and Berndt, 1999; Lepland et al., 2002; van Zuilen et al., 2002), and as a result, the ages and evolutionary rates of the deepest branches of the tree remain highly uncertain. Recently, Zhu et al., 2019 inferred a phylogeny from 381 genes distributed across Archaea and Bacteria using the supertree method ASTRAL (Mirarab et al., 2014). These markers increase the total number of genes compared to other universal marker sets and comprise not only proteins involved in information processing but also proteins affiliated with most other functional COG categories, including metabolic processes (Supplementary file 1). The genetic distance (AB branch length) between the domains (Zhu et al., 2019) was estimated from a concatenation of the same marker genes, resulting in a much shorter AB branch length than observed with the core universal markers (Hug et al., 2016; Williams et al., 2020). These analyses were consistent with the hypothesis (Petitjean et al., 2014; Zhu et al., 2019) that the apparent deep divergence of Archaea and Bacteria might be the result of an accelerated evolutionary rate of genes encoding translational and in particular ribosomal proteins along the AB branch as compared to other genes. Interestingly, the same observation was made previously using a smaller set of 38 non-ribosomal marker proteins (Petitjean et al., 2014), although the difference in AB branch length between ribosomal and non-ribosomal markers in that analysis was reported to be substantially lower (roughly twofold, compared to roughly 10-fold for the 381 protein set [Petitjean et al., 2014; Zhu et al., 2019]). A higher evolutionary rate of ribosomal genes might result from the accumulation of compensatory substitutions at the interaction surfaces among the protein subunits of the ribosome (Petitjean et al., 2014; Valas and Bourne, 2011) or as a compensatory response to the addition or removal of ribosomal subunits early in evolution (Petitjean et al., 2014). Alternatively, differences in the inferred AB branch length might result from varying rates or patterns of evolution between the traditional core genes (Spang et al., 2015; Williams et al., 2020) and the expanded set (Zhu et al., 2019). Substitutional saturation (multiple substitutions at the same site) (Jeffroy et al., 2006) and across-site compositional heterogeneity can both impact the inference of tree topologies and branch lengths (Foster, 2004; Lartillot et al., 2007; Lartillot and Philippe, 2004; Quang et al., 2008; Wang et al., 2008; Williams et al., 2021). These difficulties are particularly significant for ancient divergences (Gouy et al., 2015). Failure to model site-specific amino acid preferences has previously been shown to lead to underestimation of the AB branch length due to a failure to detect convergent changes (Tourasse and Gouy, 1999; Williams et al., 2020), although the published analysis of the 381 marker set did not find evidence of a substantial impact of these features on the tree as a whole (Zhu et al., 2019). Those analyses also identified phylogenetic incongruence among the 381 markers, but did not determine the underlying cause (Zhu et al., 2019). This recent work (Zhu et al., 2019) raises two important issues regarding the inference of the universal tree: first, that estimates of the genetic distance between Archaea and Bacteria from classic ‘core genes’ may not be representative of ancient genomes as a whole, and second, that there may be many more suitable genes to investigate early evolutionary history than generally recognized, providing an opportunity to improve the precision and accuracy of deep phylogenies. Here, we investigate these issues in order to determine how different methodologies and marker sets affect estimates of the evolutionary distance between Archaea and Bacteria. First, we examine the evolutionary history of the 381-gene marker set (hereafter, the expanded marker gene set) and identify several features of these genes, including instances of inter-domain gene transfers and mixed paralogy, that may contribute to the inference of a shorter AB branch length in concatenation analyses. Then, we re-evaluate the marker gene sets used in a range of previous analyses to determine how these and other factors, including substitutional saturation and model fit, contribute to inter-domain branch length estimations and the shape of the universal tree. Finally, we identify a subset of marker genes least affected by these issues and use these to estimate an updated tree of the primary domains of life and the length of the branch that separates Archaea and Bacteria. Results and discussion Genes from the expanded marker set are not widely distributed in Archaea The 381-gene set was derived from a larger set of 400 genes used to estimate the phylogenetic placement of new lineages as part of the PhyloPhlAn method (Segata et al., 2013) and applied to a taxonomic selection that included 669 Archaea and 9906 Bacteria (Zhu et al., 2019). Perhaps reflecting the focus on Bacteria in the original application, the phylogenetic distribution of the 381 marker genes in the expanded set varies substantially (Supplementary file 1), with many being poorly represented in Archaea. Specifically, 41% of the published gene trees (https://biocore.github.io/wol/; Zhu et al., 2019) contain less than 25% of the sampled archaea, with 14 and 68 of these trees including 0 or ≤10 archaeal homologues, respectively. Across all of the gene trees, archaeal homologues comprise 0–14.8% of the dataset (Supplementary file 1). Manual inspection of subsampled versions of these gene trees suggested that 317/381 did not possess an unambiguous branch separating the archaeal and bacterial domains (Supplementary file 1). These distributions suggest that many of these genes are not broadly present in both domains, and that some might be specific to Bacteria. Conflicting evolutionary histories of individual marker genes and the inferred species tree In the published analysis of the 381-gene set (Zhu et al., 2019), the tree topology was inferred using the supertree method ASTRAL (Mirarab et al., 2014), with branch lengths inferred on this fixed tree from a marker gene concatenation (Zhu et al., 2019). The topology inferred from this expanded marker set (Zhu et al., 2019) is similar to previous trees (Castelle and Banfield, 2018; Hug et al., 2016) and recovers Archaea and Bacteria as reciprocally monophyletic domains, albeit with a shorter AB branch than in earlier analyses. However, the individual gene trees (Zhu et al., 2019) differ regarding domain monophyly: Archaea and Bacteria are recovered as reciprocally monophyletic groups in only 22 of the 381 published (Zhu et al., 2019) maximum likelihood (ML) gene trees of the expanded marker set (Supplementary file 1). Since single-gene trees often fail to strongly resolve ancient relationships, we used approximately unbiased (AU) tests (Shimodaira, 2002) to evaluate whether the failure to recover domain monophyly in the published ML trees is statistically supported. For computational tractability, we performed these analyses on a 1000-species subsample of the full 10,575-species dataset that was compiled in the original study (Zhu et al., 2019). For 79 of the 381 genes, we could not perform the test because the gene family did not contain any archaeal homologues (56 genes) or contained only one archaeal homologue (23 genes); in total, the 1000-species sample included 74 archaeal genomes. For the remaining 302 genes, domain monophyly was rejected at the 5% significance level (with Bonferroni correction, p<0.0001656) for 151 out of 302 (50%) genes. As a comparison, we performed the same test on several smaller marker sets used previously to infer a tree of life (Coleman et al., 2021; Petitjean et al., 2014; Williams et al., 2020); none of the markers in those sets rejected reciprocal domain monophyly (p<0.05 for all genes, with Bonferroni correction: Coleman: >0.001724; Petitjean: >0.001316; Williams: >0.00102: Figure 1A). In what follows, we refer to four published marker gene sets as (i) the expanded set (381 genes; Zhu et al., 2019); (ii) the core set (49 genes; Williams et al., 2020), encoding ribosomal proteins and other conserved information-processing functions; itself a consensus set of several earlier studies (Da Cunha et al., 2017; Spang et al., 2015; Williams et al., 2012); (iii) the non-ribosomal set (38 genes, broadly distributed and explicitly selected to avoid genes encoding ribosomal proteins; Petitjean et al., 2014); and (iv) the bacterial set (29 genes used in a recent analysis of bacterial phylogeny; Coleman et al., 2021). Figure 1 with 16 supplements see all Download asset Open asset Vertically evolving marker genes support a greater evolutionary distance between Archaea and Bacteria. (A) Expanded set genes that reject domain monophyly (p<0.05, approximately unbiased [AU] test, with Bonferroni correction [see main text]) support significantly shorter Archaea-Bacteria (AB) branch lengths when constrained to follow a domain monophyletic tree (p=3.653 × 10–6, Wilcoxon rank-sum test). None of the marker genes from several other published analyses significantly reject domain monophyly (Bonferroni-corrected p<0.05, AU test) for all genes tested, consistent with vertical inheritance from the LUCA (last universal common ancestor) to the last common ancestors of Archaea and Bacteria, respectively. (B) Two measures of evolutionary proximity (Zhu et al., 2019), AB branch length and relative AB distance, are positively correlated (R = 0.7426499, p< 2.2 × 10–16). We considered two complementary proxies of marker gene verticality: ∆LL (C: against AB branch length; D: against relative AB length), which reflects the degree to which marker genes reject domain monophyly (C: p=0.009013 and R = –0.2317894; D: p=0.0001051 and R = –0.2213292); and the between-domain split score (E: against AB branch length; F: against relative AB length), which quantifies the extent to which marker genes recover monophyletic Archaea and Bacteria; a higher split score (see Materials and methods) indicates the splitting of domains into multiple gene tree clades due to gene transfer, reciprocal sorting out of paralogs, or lack of phylogenetic resolution (E: p=0.0005304 and R = –0.3043537; F: p=2.572 × 10–6 and R = –0.2667739). We also considered a split score based on within-domain relationships (G); between- and within-domain split scores are positively correlated: R = 0.836679, p<2.2 × 10–16, Pearson’s correlation, indicating that markers that recover Archaea and Bacteria as monophyletic also tend to recover established within-domain relationships. (H) Inferred AB length decreases as marker genes of lower verticality (larger ∆LL) are added to the concatenate. Marker genes were sorted by ∆LL, the difference in log-likelihood between the maximum likelihood gene family tree under a free topology search and the log-likelihood of the best tree constrained to obey domain monophyly. Note that 79/381 expanded set markers had zero or one archaea in the 1000-species subsample and so could not be included in these analyses; of the remaining 302 markers, 176 have AB branch lengths very close to 0 in the constraint tree (as seen in panel A). In these plots, we removed all markers with an AB branch length of <0.00001; see Figure 1—figure supplements 1–13 for all plots. Nonlinear trendlines were estimated using LOESS regression. To investigate why 151 of the marker genes rejected the reciprocal monophyly of Archaea and Bacteria, we returned to the full dataset (Zhu et al., 2019), annotated each sequence in each marker gene family by assigning proteins to KOs, PFAMs and Interpro domains, among others (Supplementary file 1, see Materials and methods for details), and manually inspected the tree topologies (Supplementary file 1). This revealed that the major cause of domain polyphyly observed in gene trees was inter-domain gene transfer (in 359 out of 381 gene trees [94.2%]) and mixing of sequences from distinct paralogous families (in 246 out of 381 gene trees [64.6%]). For instance, marker genes encoding ABC-type transporters (p0131, p0151, p0159, p0174, p0181, p0287, p0306, p0364), tRNA synthetases (i.e., p0000, p0011, p0020, p0091, p0094, p0202), and aminotransferases and dehydratases (i.e., p0073/4-aminobutyrate aminotransferase; p0093/3-isopropylmalate dehydratase) often comprised a mixture of paralogs. Together, these analyses indicate that the evolutionary histories of the individual markers of the expanded set differ from each other and from the species tree. The original study investigated and acknowledged (Zhu et al., 2019) the varying levels of congruence between the marker phylogenies and the species tree, but did not investigate the underlying causes. Our analyses establish the basis for these disagreements in terms of gene transfers and the mixing of orthologs and paralogs within and between domains. The estimation of genetic distance based on concatenation relies on the assumption that all of the genes in the supermatrix evolve on the same underlying tree; genes with different gene tree topologies violate this assumption and should not be concatenated because the topological differences among sites are not modeled, and so the impact on inferred branch lengths is difficult to predict. In practice, it is often difficult to be certain that all of the markers in a concatenate share the same gene tree topology, and the analysis proceeds on the hypothesis that a small proportion of discordant genes are not expected to seriously impact the inferred tree. However, the concatenated tree inferred from the expanded marker set differs from previous trees in that the genetic distance between Bacteria and Archaea is greatly reduced, such that the AB branch length appears comparable to distances among bacterial phyla (Zhu et al., 2019). Since an accurate estimate of the AB branch length has a major bearing on unanswered questions regarding the root of the universal tree (Gouy et al., 2015), we next evaluated the impact of the conflicting gene histories within the expanded marker set on inferred AB branch length. The inferred branch length between Archaea and Bacteria is shortened by inter-domain gene transfer and hidden paralogy To investigate the impact of gene transfers and mixed paralogy on the AB branch length inferred by gene concatenations (Zhu et al., 2019), we compared branch lengths estimated from markers on the basis of whether or not they rejected domain monophyly in the expanded marker set (Figure 1A). To estimate AB branch lengths for genes in which the domains were not monophyletic in the ML tree, we first performed a constrained ML search to find the best gene tree that was consistent with domain monophyly for each family under the LG + G4 + F model in IQ-TREE 2 (Minh et al., 2020). While it may seem strained to estimate the length of a branch that does not appear in the ML tree, we reasoned that this approach would provide insight into the contribution of these genes to the AB branch length in the concatenation, in which they conflict with the overall topology. AB branch lengths were significantly (p=3.653 × 10–6, Wilcoxon rank-sum test) shorter for markers that rejected domain monophyly (Bonferroni-corrected p<0.0001656; Figure 1A): the mean AB branch length was 0.00668 substitutions/site for markers that significantly rejected domain monophyly and 0.287 substitutions/site for markers that did not reject domain monophyly. This behavior might result from marker gene transfers reducing the number of fixed differences between the domains, so that the AB branch length in a tree in which Archaea and Bacteria are constrained to be reciprocally monophyletic will tend towards 0 as the number of transfers increases. To test the hypothesis that phylogenetic incongruence among markers might reduce the inferred AB distance, we evaluated the relationship between AB distance and two complementary metrics of marker gene verticality: ∆LL, the difference in log-likelihood between the constrained ML tree and the ML gene tree for the extent to which a marker gene the reciprocal monophyly of Bacteria and and the et al., 2020), which measures the extent to which marker genes recover established relationships for taxonomic levels of interest at the level of or distributions of gene trees to for phylogenetic (see Materials and We evaluated split scores at both the between-domain and within-domain (Figure 1—figure supplements ∆LL and between-domain split score were positively correlated with each other (Figure 1—figure and correlated with both AB length (Figure and and relative AB distance (Figure and an (Zhu et al., 2019) that distances within and between domains. Interestingly, between-domain and within-domain split scores were strongly positively correlated (Figure and the same relationships between within-domain split AB branch and relative AB distance were observed (Figure 1—figure supplements and these suggest that genes that recover the reciprocal monophyly of Archaea and Bacteria also evolve more vertically within each and that these vertically evolving marker genes support a longer AB branch and a greater AB Zhu et al., 2019 also recovered a significant relationship between gene verticality and relative AB distance (see with these AB branch lengths estimated using concatenation as numbers of markers (i.e., markers with higher ∆LL) were added to the concatenate (Figure 1). These suggest that inter-domain gene transfers reduce the overall AB branch length when included in a for the relationship between marker gene verticality and AB branch length could be that vertically evolving genes higher rates of sequence evolution. For a set of genes that at the same on the species tree, the mean distance in substitutions per site, for gene trees using the et al., a of evolutionary distances were significantly positively correlated with ∆LL and between-domain split score R = split R = Figure 1—figure supplements and indicating that vertically evolving genes evolve that of ∆LL and split score the longer AB branches of vertically evolving genes not appear to result from a evolutionary rate for these genes. these indicate that the of genes that not support the reciprocal monophyly of Archaea and Bacteria, or taxonomic in the universal concatenate the reduced estimated AB branch length. ancient vertically evolving genes To estimate the AB branch length and the phylogeny of using a dataset that some of the issues identified we performed a of several previous studies to identify a consensus set of vertically evolving marker genes. We identified markers from these analyses by to the COG (Supplementary file et al., et al., 2019), sequences from a representative sample of archaeal and bacterial genomes (Supplementary file and performed and to a set of markers that recovered archaeal and bacterial monophyly (see Materials and to non-ribosomal markers had a greater number of gene and of mixed In particular, for the original set of COG families (see in Materials and we rejected families based on the inferred ML trees due to a degree of paralogous gene or branch For the remaining markers, the ML trees contained evidence of recent monophyly was in of the non-ribosomal and of the ribosomal We manually removed the individual sequences that domain monophyly and tree inference (see Materials and These that of marker genes is important for deep phylogenetic analyses, particularly when using non-ribosomal of within-domain split scores for these markers (Supplementary file that markers that established relationships within each domain also supported a longer AB branch length (Figure the AB branch length inferred from a concatenation of the marker genes of recent from substitutions/site to substitutions/site consistent with the hypothesis that inter-domain reduce the overall estimate of AB branch length when included in Figure 2 with 1 see all Download asset Open asset The relationship between marker gene Archaea-Bacteria (AB) branch and functional (A) Vertically evolving phylogenetic markers have longer AB The the relationship between a for marker gene within-domain split score lower split score of established within-domain relationships, see Materials and and AB branch length (in expected number of for the marker genes. Marker genes with higher split scores split established monophyletic groups into multiple have shorter AB branch lengths R = scores of ribosomal and non-ribosomal markers were statistically Figure 1). (B) vertically evolving marker genes, ribosomal genes not have a longer AB branch length. The functional of markers against AB branch length using vertically evolving We did not a significant difference between AB branch lengths for ribosomal and non-ribosomal genes Wilcoxon rank-sum test). of AB branch lengths for ribosomal and non-ribosomal marker genes are similar universal marker sets many ribosomal proteins (Ciccarelli et al., 2006; Fournier and Gogarten, 2010; Harris et al., 2003; Hug et al., 2016; et al., 2021; Williams et al., 2020). If ribosomal proteins accelerated evolution the divergence of Archaea and Bacteria, this might lead to the inference of an long AB branch length (Petitjean et al., 2014; Zhu et al., 2019). To investigate we the inter-domain branch lengths for the 38 and 16 ribosomal and non-ribosomal genes, the marker genes set. We evidence that there was a longer AB branch with ribosomal markers than for other vertically evolving genes (Figure mean AB branch length for ribosomal proteins mean for non-ribosomal To investigate we compared AB branch lengths inferred from of the ribosomal and non-ribosomal of the vertically evolving genes 1). AB branch lengths from the ribosomal and non-ribosomal were similar with some support for a longer AB branch length from vertically evolving non-ribosomal genes. these data not support an accelerated evolutionary rate for ribosomal genes compared to other of genes on the AB branch. 1 Archaea-Bacteria (AB) branch lengths and AB branch lengths as a proportion of total tree length inferred from ribosomal and non-ribosomal are The data not support a evolutionary rate for ribosomal proteins on the AB branch compared to other of ancient AB branch tree branch length as a proportion of total tree marker marker Substitutional saturation and poor model contribute to underestimation of AB branch length For the 27 most vertically evolving genes as by within-domain split we performed an of single-gene tree inference and review to identify and the remaining sequences that had evidence of or represented paralogs. The resulting single-gene trees are in the Data To evaluate the relationship between evolutionary rate and AB branch we two sites sites with the of being in the rate and sites with the of being in the rate and compared relative branch lengths inferred from the concatenate using IQ-TREE 2 to infer site-specific rates (Figure
Discussion(0)
No comments yet. Be the first to comment.