Editor's evaluation: A computational screen for alternative genetic codes in over 250,000 genomes

Eugene V Koonin

doi:10.7554/elife.71402.sa0

Editor's evaluation: A computational screen for alternative genetic codes in over 250,000 genomes — Eugene V Koonin (2021) | RDL Network

Abstract

20 min read

Article Figures and data Abstract Editor's evaluation eLife digest Introduction Results Discussion Materials and methods Data availability References Decision letter Author response Article and author information Metrics Abstract The genetic code has been proposed to be a ‘frozen accident,’ but the discovery of alternative genetic codes over the past four decades has shown that it can evolve to some degree. Since most examples were found anecdotally, it is difficult to draw general conclusions about the evolutionary trajectories of codon reassignment and why some codons are affected more frequently. To fill in the diversity of genetic codes, we developed Codetta, a computational method to predict the amino acid decoding of each codon from nucleotide sequence data. We surveyed the genetic code usage of over 250,000 bacterial and archaeal genome sequences in GenBank and discovered five new reassignments of arginine codons (AGG, CGA, and CGG), representing the first sense codon changes in bacteria. In a clade of uncultivated Bacilli, the reassignment of AGG to become the dominant methionine codon likely evolved by a change in the amino acid charging of an arginine tRNA. The reassignments of CGA and/or CGG were found in genomes with low GC content, an evolutionary force that likely helped drive these codons to low frequency and enable their reassignment. Editor's evaluation This work is a substantial contribution to the important and fascinating field of genetic code diversification. https://doi.org/10.7554/eLife.71402.sa0 Decision letter Reviews on Sciety eLife's review process eLife digest All life forms rely on a ‘code’ to translate their genetic information into proteins. This code relies on limited permutations of three nucleotides – the building blocks that form DNA and other types of genetic information. Each ‘triplet’ of nucleotides – or codon – encodes a specific amino acid, the basic component of proteins. Reading the sequence of codons in the right order will let the cell know which amino acid to assemble next on a growing protein. For instance, the codon CGG – formed of the nucleotides guanine (G) and cytosine (C) – codes for the amino acid arginine. From bacteria to humans, most life forms rely on the same genetic code. Yet certain organisms have evolved to use slightly different codes, where one or several codons have an altered meaning. To better understand how alternative genetic codes have evolved, Shulgina and Eddy set out to find more organisms featuring these altered codons, creating a new software called Codetta that can analyze the genome of a microorganism and predict the genetic code it uses. Codetta was then used to sift through the genetic information of 250,000 microorganisms. This was made possible by the sequencing, in recent years, of the genomes of hundreds of thousands of bacteria and other microorganisms – including many never studied before. These analyses revealed five groups of bacteria with alternative genetic codes, all of which had changes in the codons that code for arginine. Amongst these, four had genomes with a low proportion of guanine and cytosine nucleotides. This may have made some guanine and cytosine-rich arginine codons very rare in these organisms and, therefore, easier to be reassigned to encode another amino acid. The work by Shulgina and Eddy demonstrates that Codetta is a new, useful tool that scientists can use to understand how genetic codes evolve. In addition, it can also help to ensure the accuracy of widely used protein databases, which assume which genetic code organisms use to predict protein sequences from their genomes. Introduction The genetic code defines how mRNA sequences are decoded into proteins. The ancient origin of the standard genetic code is reflected in its near-universal usage, once proposed to be a ‘frozen accident’ that is too integral to the translation of all proteins to change (Crick, 1968). However, the discovery of alternative genetic codes in over 30 different lineages of bacteria, eukaryotes, and mitochondria over the past four decades has made it clear that the genetic code is capable of evolving to some degree (Knight et al., 2001a; Kollmar and Mühlhausen, 2017). The first alternative genetic codes were discovered by comparing newly sequenced genomes to amino acid sequences obtained by direct protein sequencing. Nonstandard codon translations were found this way in human mitochondria (Barrell et al., 1979), Candida yeasts (Kawaguchi et al., 1989), green algae (Schneider et al., 1989), and Euplotes ciliates (Meyer et al., 1991). Some reassignments of stop codons to amino acids were detected from DNA sequence alone, based on the appearance of in-frame stop codons in critical genes (Yamao et al., 1985; Caron and Meyer, 1985; Cupples and Pearlman, 1986; Keeling and Doolittle, 1996; McCutcheon et al., 2009; Campbell et al., 2013; Záhonová et al., 2016). As DNA sequence data have accumulated faster than direct protein sequences, computational methods have been developed to predict the genetic code from DNA sequence. The core principle of most methods is to align genomic coding regions to homologous sequences in other organisms (creating multiple sequence alignments) and then to tally the most frequent amino acid aligned to each of the 64 codons. This approach led to the discovery of new genetic codes in screens of ciliates (Swart et al., 2016; Heaphy et al., 2016), yeasts (Riley et al., 2016; Krassowski et al., 2018), green algal mitochondria (Noutahi et al., 2019; Žihala and Eliáš, 2019), invertebrate mitochondria (Telford et al., 2000; Abascal et al., 2006a; Li et al., 2019), and stop codon reassignments in metagenomic data (Ivanova et al., 2014) and the development of software for specific phylogenetic groups (Abascal et al., 2006b; Mühlhausen and Kollmar, 2014; Noutahi et al., 2017). Some approaches, such as FACIL (Dutilh et al., 2011), have expanded phylogenetic breadth by using profile hidden Markov model (HMM) representations of conserved proteins from phylogenetically diverse databases such as Pfam (El-Gebali et al., 2019). However, a systematic survey of genetic code usage across the tree of life has not yet been possible. Existing methods are generally either (1) phylogenetically restricted to clades where multiple sequence alignments can be built for a predetermined set of proteins or (2) lacking sufficiently robust and objective statistical footing to enable a large-scale screen with high accuracy. A potentially incomplete set of alternative genetic codes limits our ability to understand the evolutionary processes behind codon reassignment. One open question is why some codon reassignments reappear independently. Reassignment of the stop codons UAA and UAG to glutamine is the most common change in eukaryotic nuclear genomes, appearing at least five independent times (Schneider et al., 1989; Keeling and Doolittle, 1996; Keeling and Leander, 2003; Karpov et al., 2013; Swart et al., 2016). In bacteria, all of the known changes reassign the stop codon UGA to either glycine in the Absconditabacteria and Gracilibacteria (Campbell et al., 2013; Rinke et al., 2013) or tryptophan in the Mycoplasmatales, Entomoplasmatales (Bové, 1993), and several insect endosymbiotic bacteria (McCutcheon et al., 2009; McCutcheon and Moran, 2010; Bennett and Moran, 2013; Salem et al., 2017). These recurring changes may reflect constraints imposed by the existing translational machinery. The mechanism of codon reassignment may involve changes to tRNA anticodons or tRNA wobble nucleotide modifications (which together dictate anticodon-codon pairing), aminoacyl-tRNA synthetase recognition of cognate tRNAs, release factor binding of stop codons, among others, each of which may bias which reassignments are easier to evolve. However, without a complete picture of genetic code diversity, it is hard to disentangle patterns of codon reassignment from observation bias. For instance, in-frame stop codons caused by a stop codon reassignment may be more easily detectable than a subtle change in amino acid conservation indicative of a sense codon reassignment. Another open question is how a new codon meaning can evolve without disrupting the translation of most proteins. Reassigning a codon leads to the incorporation of the incorrect amino acid at all preexisting codon positions (Crick, 1968). Three evolutionary models differ in the pressure that drives substitutions to remove the codon from positions that cannot tolerate the new translation. In the ‘codon capture’ model, the codon is first driven to near extinction by pressures unrelated to reassignment, such as biased genomic GC content or genome reduction, which then minimizes the impact of reassignment on protein translation (Osawa and Jukes, 1989). This model was first proposed for the reassignment of the stop codon UGA to tryptophan in Mycoplasma capricolum, whose low genomic GC content (25% GC) in combination with small genome size (1 Mb) was thought to have driven the stop codon UGA to extremely low usage in favor of UAA and allowed ‘capture’ of UGA by a tryptophan tRNA (Bové, 1993; Osawa and Jukes, 1989). For larger nuclear genomes, other models have been proposed where codon usage changes occur concurrently with, and are driven by, changes in decoding capability. In the ‘ambiguous intermediate’ model, a codon is decoded stochastically as two different meanings in an intermediate step of codon reassignment, and this translational pressure induces codon substitutions at positions where ambiguity is deleterious (Schultz and Yarus, 1994; Massey et al., 2003). Extant examples of ambiguous translation support the plausibility of this model, such as yeasts that translate the codon CUG as both leucine and serine by stochastic tRNA charging (Gomes et al., 2007) or by competing tRNA species (Mühlhausen et al., 2018). Alternatively, the ‘tRNA loss-driven reassignment’ model proposes an intermediate stage where a codon cannot be translated efficiently, perhaps due to tRNA gene loss or mutation, creating pressure for synonymous substitutions specifically away from that codon, allowing it to be captured later by a different tRNA (Mühlhausen et al., 2016; Sengupta and Higgs, 2005). These three models are not mutually exclusive, and substitutions at the reassigned codon can occur due to a combination of these pressures. Here, we describe Codetta, a computational method for predicting the genetic code that can scale to analyze thousands of genomes. We perform the first survey of genetic code usage in all bacterial and archaeal genomes, reidentifying all known codes in the dataset and discovering the first examples of sense codon changes in bacteria. All five reassignments affect arginine codons (AGG, CGA, and CGG) and provide clues to help us understand how alternative genetic codes evolve. Results Codetta: A computational method to infer the genetic code We developed Codetta, a computational method that takes DNA or RNA sequences from a single organism and predicts an amino acid translation for each of the 64 codons. Codetta can analyze sequences from all domains of life, including bacteria, archaea, eukaryotes, organelles, and viruses, and the ability to confidently predict codon decodings depends on having protein-coding regions with recognizable homology. The general idea is to align the input nucleotide sequence to probabilistic profiles of conserved protein domains in order to obtain, for each of the 64 codons, a set of profile positions aligned to that codon. Each profile position has 20 probabilities describing the expected amino acid. For each of the 64 codons, we aggregate over the set of aligned profile positions to infer the single most likely amino acid decoding of the codon. Most previous approaches for genetic code prediction use the same basic idea (Telford et al., 2000; Abascal et al., 2006b; Dutilh et al., 2011; Mühlhausen and Kollmar, 2014; Swart et al., 2016; Heaphy et al., 2016; Riley et al., 2016; Krassowski et al., 2018; Noutahi et al., 2019), typically aligning the input sequence to multiple sequence alignments and using a simple rule to select the best amino acid for each codon. With Codetta, we extend this idea to systematic high-throughput analysis by using a probabilistic modeling approach to infer codon decodings and by taking advantage of the large collection of probabilistic profiles of conserved protein domains (profile HMMs) in the Pfam database (El-Gebali et al., 2019). Profile HMMs are built from multiple sequence alignments, and the emission probabilities at each consensus column are estimates of the expected amino acid frequencies. The Pfam database contains over 17,000 profile HMMs of conserved protein domains from all three domains of life, which are expected to align to about 50% of coding regions in a genome (El-Gebali et al., 2019). We align Pfam profile HMMs to a six-frame standard genetic code translation of the input DNA/RNA sequence using the HMMER hmmscan program (Eddy, 2011; Figure 1A). Since we rely on a preliminary standard code translation, conserved protein domains could fail to align in organisms using radically different genetic codes. In the set of statistically significant hmmscan alignments (E-value §lt;10-10), we make the simplifying approximation of considering each aligned consensus column independently, so the alignments are viewed as a set of pairwise associations between a codon Z (64 possibilities) and a consensus column of a Pfam domain profile (denoted C, an index identifying a Pfam consensus column). Figure 1 with 1 supplement see all Download asset Open asset Schematic of the genetic code inference method implemented in Codetta. (A) A fragment of the Mycoplasma capricolum genome is used to demonstrate alignment of a Pfam domain (PAD_porph) to a preliminary standard code translation of the input DNA sequence (one of six frames shown). All canonical stop codons, including UGA (TGA in genome sequence, reassigned to tryptophan in M. capricolum), are translated as ‘X’ in the preliminary standard code translation that hmmscan (program used to align Pfam domains) treats as an unknown amino acid. Each consensus column in the PAD_porph domain has a characteristic emission probability for each of the 20 canonical amino acids, represented by a heatmap. (B) Pfam consensus columns aligning to UGA codons across the entire genome comprise the C→Z set for UGA (N = 452 Pfam consensus columns). The Pfam emission probabilities P⁢(A|CiZ) for all 452 aligned consensus columns are used to compute the decoding probabilities P⁢(M|C→Z). The most likely amino acid translation of UGA is inferred to be tryptophan, with decoding probability greater than the cutoff of 0.9999. From these data, we infer each of the 64 codons one at a time (Figure 1B). For a codon Z (e.g., UGA), the observed data C→Z are a set of N consensus columns CiZ (i=1...N) that associate to Z in the provisional alignments. We model the main data-generative process abstractly, imagining that each column CiZ was drawn from the pool of all possible consensus columns by codon Z, which is translated as an unknown amino acid A. Each column has an affinity for codon Z proportional to the column’s emission probability for the amino acid A, P⁢(A|C). A consensus column strongly conserved for a particular amino acid A will tend to only associate with codons that translate to A; moreover, consensus columns weakly conserved for A may also associate with probability proportional to their conservation for A. Thus, this abstract-matching process generates an observed CiZ column association with the codon Z (translated as amino acid A) with probability P⁢(CiZ|A)=P⁢(A|CiZ)⁢P⁢(CiZ)P⁢(A). Here, P⁢(A|CiZ) is the emission probability for amino acid A at the Pfam consensus column CiZ. P(A) is the average emission probability for amino acid A over the pool of all possible consensus columns C, which we take to be all columns aligned to the target genome in order to better reflect genome-specific biases in amino acid usage. Given the data C→Z and this abstract generative model, we infer the most likely decoding M for codon Z out of 21 possibilities M∈{Ala,Cys,…,Tyr,?} (Figure 1B). The M=? model of nonspecific translation draws columns randomly and serves to catch codons that do not encode a specific amino acid, such as stop codons and ambiguously translated codons. For a given decoding M, the probability of the observed columns C→Z is then P⁢(C→Z|M)={∏i=1NP(A=M|CiZ)P(CiZ)P(A=M)if ⁢M∈{Ala,Cys,…,Tyr}∏i=1NP⁢(CiZ)if ⁢M=? Setting the prior probability of each decoding, P⁢(M), to be uniform, we compute the probability of the decoding M as P(M|C→Z)=P(C→Z|M)ΣM′P(C→Z|M′) We assign an amino acid translation to a codon if it attains a decoding probability above some threshold (typically 0.9999). We assign a ‘?’ if no amino acid decoding satisfies the probability threshold (including the case where ‘?’ itself has high probability). A ‘?’ assignment tends to happen if the codon is rare, with few aligned Pfam consensus columns on which to base the inference, or if the codon is ambiguously translated such that no single amino acid model reaches high probability. Because we do not model stop codons explicitly, we expect ‘?’ to be the inferred meaning since stop codons ideally would have few or no aligned Pfam consensus columns. To assess how many columns in C→Z are needed for reliable codon assignment, we constructed synthetic C→Z datasets ranging from 1 to 500 consensus columns by subsampling the Pfam consensus columns aligned to each of the 61 sense codons in the Escherichia coli genome. We calculated the per-codon error rate (fraction of samples predicting the incorrect amino acid) and the per-codon power (fraction of samples predicting the correct amino acid) using a probability threshold of 0.9999 (Figure 1—figure supplement 1). Lack of an amino acid inference (‘?’) was considered neither an error nor a correct prediction. Per-codon error rates were <0.00002 for all sizes of C→Z. Depending on the codon, we found that about 8–34 aligned consensus columns suffice for >95% power to infer the correct amino acid. Accuracy may differ in real genomes for various biological reasons, but these results gave us confidence to proceed. Genetic code prediction of 462 yeast species confirms known distributions of CUG reassignment We further validated Codetta on the budding yeasts (Saccharomycetes, 462 sequenced species) that vary in their translation of CUG as either serine, leucine, or alanine depending on the species (Mühlhausen et al., 2016; Krassowski et al., 2018; Mühlhausen et al., 2018). In some CUG-Ser clade species, such as Candida albicans, CUG codons are stochastically decoded as a mix of serine (97%) and leucine (3%) because the CUG-decoding tRNACAG is aminoacylated by both the seryl- and leucyl-tRNA synthetases (Suzuki et al., 1997; Gomes et al., 2007). Codetta is not designed to predict ambiguous decoding and is expected to assign either the dominant amino acid or a ‘?’ in cases like C. albicans. For 453 species, the predicted CUG translation was consistent with the known phylogenetic distribution of CUG reassignments (Figure 2A). This includes C. albicans, which was predicted to use the predominant serine translation (Gomes et al., 2007). For the remaining nine species, Codetta did not put a high probability on any amino acid decoding of CUG (inferred meaning of ‘?’). Two of these species – Babjeviella inositovora and Cephaloascus fragrans – are basal members of the CUG-Ser clade. Both of these genomes contain a CUG-decoding tRNACAG gene with features of serine identity (see Materials and methods) and B. inositovora has previously been shown to translate CUG codons primarily as serine by whole proteome mass spectrometry (Krassowski et al., 2018; Mühlhausen et al., 2018), that CUG is decoded as serine in these Codetta did not infer an amino acid for CUG because the aligned Pfam consensus columns were not conserved for a single amino acid (Figure supplement 1). Figure with 1 supplement see all Download asset Open asset of CUG reassignments in (A) CUG translation inferred by Codetta of 462 species, by phylogenetic clade. was from et al., of clade is the (B) Codetta CUG inference and of tRNACAG genes in and genomes. tRNACAG genes were using and were as or based on the of tRNA identity (C) to and charging of leucine and serine tRNACAG genes in of the two tRNACAG are shown with features used for in In the tRNA in of the target tRNA and the most other tRNA as by sequence with the were used as for In the tRNA charging a was used to help the expected sizes for and of the tRNA. Figure data 1 of all yeast genomes with phylogenetic and Codetta CUG Download Figure data of Download Figure data of in DNA Download The other species without an inferred amino acid for CUG all to the and species in these clades were predicted to translate CUG as of tRNA genes revealed that out of species in this clade encode two types of tRNACAG one predicted to be and one that CUG may be ambiguously translated as both serine and leucine competing in some of these species (Figure We used to the of both tRNACAG genes in some of these species a of not but could reliable of both and tRNACAG genes only in the serine tRNACAG could be detected in other species) (Figure To both are we acid that aminoacylated and but not the amino acid. We found that both serine and leucine tRNACAG are in (Figure likely in the translation of CUG codons. CUG is translated ambiguously in this it would why Codetta did not a high probability on any single amino acid decoding for some The of serine and leucine tRNACAG genes in some and yeasts was by Krassowski et al., and Mühlhausen et al., we were translation of CUG was in (Mühlhausen et al., for only of the serine tRNACAG could be detected (Krassowski et al., and incorporation of serine at protein positions by CUG (Mühlhausen et al., 2018). In to these we used a where the leucine tRNACAG to be more we did not the of the two tRNACAG in a of the in Figure that the of the leucine tRNACAG is at least times than the serine tRNACAG in the These results that Codetta can infer canonical and codon translations and can such as ambiguous translation it translation. All of the remaining codons were inferred to use the expected translation in all species, with the In three species to a of with low genomic GC content et al., 2019), the arginine codons and/or CGG had a ‘?’ inference due to few aligned Pfam consensus columns due to rare usage of codons. In other species, either the stop codon UAG or UGA was inferred to code for tryptophan due to some aligned Pfam consensus columns. We could not find any nuclear tRNA and we that these are due to the alignment of Pfam domains to in-frame stop codons in stop codons do not randomly but are most likely to from from certain codons as the tryptophan screen of bacterial and archaeal genomes previously known alternative genetic codes To the diversity of genetic codes in bacterial and archaeal genomes, we used Codetta to analyze genomes from including and from and metagenomic of our analysis 1 and are shown for a of the to the of sequenced organisms by a single for each bacteria, Results for the dataset and the are in 1 A of all bacterial clades previously known to use a codon reassignment. For each the shown most to the known phylogenetic distribution from the For each codon reassignment, we the of sequenced species by Codetta and how many were inferred to use the expected amino acid or had no inferred amino acid. of the species to reassigned clades were predicted to use an amino acid at the reassigned codon. et al., McCutcheon et al., Bennett and Moran, McCutcheon and Moran, Salem et al., Rinke et al., and Campbell et al., amino and A of codon from the bacterial and archaeal genomes by Codetta, to one The Codetta inference for each codon is a genetic code by the known bacterial genetic codes in 1 over the stop codons are with sense codons. can be calculated from (N codons N amino amino (N codons N To see if our screen known alternative genetic codes, we a of all bacterial and archaeal clades known to use alternative genetic codes and it over the all remaining organisms with the standard genetic code. This in a genetic code for each For most species using known alternative genetic codes in our our at the reassigned codon with the expected amino acid translation 1). were no of reassigned codons predicted to translate as an amino acid, but were a few cases of reassigned UGA codons that had no amino acid meaning inferred Since the codons could a of by Codetta, we more at these In the and which are to translate the canonical stop codon UGA as tryptophan (Bové, 1993), species had no inferred amino acid meaning for UGA due to than four aligned Pfam consensus columns. All of these genomes a tRNA gene and all but one contain a release factor gene (which translation at of these species are in the et al., a of over bacterial and archaeal genomes, where are into a different order order We at least five perhaps all as a in the and we that UGA is a stop codon in these In the which are to translate the

Editor's evaluation: A computational screen for alternative genetic codes in over 250,000 genomes

Abstract

Discussion(0)

Open reviews(0)

Related publications

Parallel Evolution of the Genetic Code in Arthropod Mitochondrial Genomes

Origin and Evolution of the Universal Genetic Code

Frequent occurrence and predicted functions of tRNAs with 4-base-pair anticodon stems in bacteria: extended superwobble hypothesis

Origin and evolution of the genetic code: The universal enigma

Frozen Accident Pushing 50: Stereochemistry, Expansion, and Chance in the Evolution of the Genetic Code

Related publications

Article2006
Parallel Evolution of the Genetic Code in Arthropod Mitochondrial Genomes
Article2006

Article2017
Origin and Evolution of the Universal Genetic Code
Article2017

Article2026
Frequent occurrence and predicted functions of tRNAs with 4-base-pair anticodon stems in bacteria: extended superwobble hypothesis
Article2026

Article2008
Origin and evolution of the genetic code: The universal enigma
Article2008

Article2017
Frozen Accident Pushing 50: Stereochemistry, Expansion, and Chance in the Evolution of the Genetic Code
Article2017