Bioinformatics Laboratory

Bacterial genomics

Comparative Analysis of the Core Proteomes among the Pseudomonas Major Evolutionary Groups Reveals Species-Specific Adaptations for Pseudomonas aeruginosa and Pseudomonas chlororaphis

The Pseudomonas genus includes many species living in diverse environments and hosts. It is important to understand which are the major evolutionary groups and what are the genomic/proteomic components they have in common or are unique. Towards this goal, we analyzed 494 complete Pseudomonas proteomes and identified 297 core-orthologues. The subsequent phylogenomic analysis revealed two well-defined species (Pseudomonas aeruginosa and Pseudomonas chlororaphis) and four wider phylogenetic groups (Pseudomonas fluorescens, Pseudomonas stutzeri, Pseudomonas syringae, Pseudomonas putida) with a sufficient number of proteomes. As expected, the genus-level core proteome was highly enriched for proteins involved in metabolism, translation, and transcription. In addition, between 39–70% of the core proteins in each group had a significant presence in each of all the other groups. Group-specific core proteins were also identified, with P. aeruginosa having the highest number of these and P. fluorescens having none. We identified several P. aeruginosa-specific core proteins (such as CntL, CntM, PlcB, Acp1, MucE, SrfA, Tse1, Tsi2, Tse3, and EsrC) that are known to play an important role in its pathogenicity. Finally, a holin family bacteriocin and a mitomycin-like biosynthetic protein were found to be core-specific for P. cholororaphis and we hypothesize that these proteins may confer a competitive advantage against other root-colonizers.

Published in Diversity as a feature paper.

Graphical abstract

Figure 1. Workflow of the phylogenomic and core proteome analyses.

Figure 2. The phylogenomic neighbor-joining tree of the 494 complete proteomes. The tree was based on 198 core proteins (core set 1), using the Kimura model and 500 bootstrap values. The various genus core sets are indicated on the tree.

Figure 3. The Pseudomonas genus core proteome, depending on the sampling depth (i.e., number of proteomes analyzed).

Download supplementary materials here

A Comparative Analysis of the Core Proteomes within and among the Bacillus subtilis and Bacillus cereus Evolutionary Groups Reveals the Patterns of Lineage- and Species-Specific Adaptations

By integrating phylogenomic and comparative analyses of 1104 high-quality genome sequences, we identify the core proteins and the lineage-specific fingerprint proteins of the various evolutionary clusters (clades/groups/species) of the Bacillus genus. As fingerprints, we denote those core proteins of a certain lineage that are present only in that particular lineage and absent in any other Bacillus lineage. Thus, these lineage-specific fingerprints are expected to be involved in particular adaptations of that lineage. Intriguingly, with a few notable exceptions, the majority of the Bacillus species demonstrate a rather low number of species-specific fingerprints, with the majority of them being of unknown function. Therefore, species-specific adaptations are mostly attributed to highly unstable (in evolutionary terms) accessory proteomes and possibly to changes at the gene regulation level. A series of comparative analyses consistently demonstrated that the progenitor of the Cereus Clade underwent an extensive genomic expansion of chromosomal protein-coding genes. In addition, the majority (76–82%) of the B. subtilis proteins that are essential or play a significant role in sporulation have close homologs in most species of both the Subtilis and the Cereus Clades. Finally, the identification of lineage-specific fingerprints by this study may allow for the future development of highly specific vaccines, therapeutic molecules, or rapid and low-cost molecular tests for species identification.

Published in Microorganisms.

Figure 1. The phylogenomic maximum likelihood tree (IQ-Tree2) of the 1104 Bacillus proteomes. The tree was based on 114 core proteins and 20,041 variable sites, using the LG + I + F + G4 model and aLRT. For ease of visualization, the entire Subtilis and Cereus Clades are collapsed. Next to each leaf of the tree, the chromosome size, and the number of all chromosomally encoded proteins are given.

Figure 2. Phylogenomic ML tree (IQ-Tree2-Q.Plant + I + F + G4-aLRT) of the Subtilis Clade based on 457 core protein-orthologous groups from 634 proteomes. For ease of visualization, certain evolutionary clusters have been collapsed. The full tree is available as Supplementary Figure S2. Next to the species name, in parentheses, is the number of complete genomes that are available and, on their right, is the number of genomes used in the normalized dataset. Further to the right of the species names and at the common ancestor of a lineage, with blue and red colors we denote the number of core and relaxed/strict fingerprint proteins for each lineage (based on the normalized dataset).

Figure 3. Phylogenomic ML tree (QTree2-Q.Plant + I + F + G4-aLRT) of the Cereus Clade, based on 812 core protein-orthologous groups from 445 proteomes. For ease of visualization, certain evolutionary clusters have been collapsed. The full tree is available as Supplementary Figure S3. Next to the species name, in parentheses, is the number of complete genomes that are available and, on their right, is the number of genomes used in the normalized dataset. Further to the right of the species names and at the common ancestor of a lineage, we denote with blue and red colors the number of core and (relaxed/strict) fingerprint proteins for each lineage (based on the normalized dataset).

Figure 4. Boxplot of the total number of proteins (y-axis) for every available strain (dot—each genome) of a species (x-axis) and its normalized core proteome (green bar).

Figure 5. (A) The phylogenetic distribution of core proteins of the Subtilis Clade in the species of the Cereus Clade. (B) The phylogenetic distribution of core proteins of the Cereus Clade in the species of the Subtilis Clade. The bins on the x-axis correspond to the number of species (in the other Clade), while the y-axis corresponds to the absolute number of core proteins (for that bin). For example, the first graph of Figure 5A shows that 1072 of the core proteins of the Subtilis Clade are also present in 16–17 species of the Cereus Clade. The ratio of the low-presence to high presence bin is shown in the box at the top of the graph. Stars identify any ratio whose difference from the background (in all categories) is statistically significant (based on the hypergeometric test; p-value < 0.05).

Figure 6. The phylogenetic distribution pattern of: (A) 155 B. subtilis proteins important for sporulation; (B) 256 proteins that are essential in B. subtilis. Presence of a close homolog in a given species of the Subtilis and Cereus Clades was determined based on 50% amino acid identity over 50% of the protein’s length. The clustering of proteins (based on their distribution) was performed with the average Euclidean distance, within the seaborn.clustermap python package. A more detailed view of the cluster-heatmaps (including the individual gene names and species) is available in Supplementary Figures S4 and S5. Each row corresponds to a gene and each column corresponds to a certain species. The color in the heatmap corresponds to the % presence (how many strains of the species) of that gene in that certain species.

Download supplementary materials here

A panoramic view of the genomic landscape of the genus Streptomyces

We delineate the evolutionary plasticity of the ecologically and biotechnologically important genus Streptomyces, by analysing the genomes of 213 species. Streptomycetes genomes demonstrate high levels of internal homology, whereas the genome of their last common ancestor was already complex. Importantly, we identify the species-specific fingerprint proteins that characterize each species. Even among closely related species, we observed high interspecies variability of chromosomal protein-coding genes, species-level core genes, accessory genes and fingerprints. Notably, secondary metabolite biosynthetic gene clusters (smBGCs), carbohydrate-active enzymes (CAZymes) and protein-coding genes bearing the rare TTA codon demonstrate high intraspecies and interspecies variability, which emphasizes the need for strain-specific genomic mining. Highly conserved genes, such as those specifying genus-level core proteins, tend to occur in the central region of the chromosome, whereas those encoding proteins with evolutionarily volatile species-level fingerprints, smBGCs, CAZymes and TTA-codon- bearing genes are often found towards the ends of the linear chromosome. Thus, the chromosomal arms emerge as the part of the genome that is mainly responsible for rapid adaptation at the species and strain level. Finally, we observed a moderate, but statistically significant, correlation between the total number of CAZymes and three categories of smBGCs (siderophores, e- Polylysin and type III lanthipeptides) that are related to competition among bacteria.

Accepted for publication in Microbial Genomics.

Figure 1. (a) TheStreptomycesphylogenomic species tree is based on 213 high-qualityStreptomycesgenomes (one representative from each species) and five other Streptomycetaceae (as outgroups) and calculated from 318 core protein orthologous groups (78 205 amino acid sites) using the LG+I+F+G4 model in the IQ-Tree2 software. Next to the tree, on the first column, we show the various species names. On the second column is the FastANI cluster of the corresponding species. On the third column is the number of high-quality genomes in that species. The next six columns correspond to the average number of chromosomal proteins, the number of normalized core proteins, the average number of accessory proteins, the number of smBGCs in the complete genomes, the number of relaxed fingerprint proteins, and the number of smBGCs that are identified as fingerprints, for each species. All these species data are also summarized as a table in File S1, spreadsheet 5. (b) Violin plots that show the distribution of several genomic characteristics of the analysedStreptomycesspecies (from top to bottom): the chromosomally encoded proteins for the 213 species; the core proteins identified in each of the 61 species, with five genomes (complete and draft) each; the accessory proteins in each of the 61 species; the relaxed fingerprints for the 61 species; the maximum number of smBGCs in the high-quality genomes of each of the 213 Streptomycesspecies; intra-species smBGC average variation, based on Jaccard distance, of the 12Streptomycesspecies that had five complete genomes each; the maximum number of CAZYmes in the high-quality genomes of each of the 213Streptomycesspecies; intra-species CAZYme average variation (based on Jaccard distance), of the 12Streptomycesspecies that had five complete genomes each; the number of TTA-bearing proteins in the representative high-quality genomes of each of the 213Streptomycesspecies; and intra-species TTA-bearing proteins average conservation variation (based on Jaccard distance), of the 12 Streptomycesspecies that had five complete genomes each. The dashed line within each violin plot represents the mean value

Figure 2. (a) Number of smBGCs in the 213Streptomycesspecies. Only one high-quality genome with the highest number of smBGCs for each species was analysed with antiSMASH. To the left is the phylogenomic tree of the 213 species. The yellow horizontal bars represent the total number of chromosomally encoded proteins for that species representative. Next to it, the light blue horizontal bar represents the total number of smBGCs for that species representative. The smBGC heatmap only includes the most frequently found (present in at least 40% of species) smBGCs. Above each column is the type of smBGC and, in parentheses, the percentage of species that is present. (b) Number of carbohydrate-active enzymes (CAZymes) in the various species. The horizontal light blue bars represent the total number of CAZymes for that species representative. The CAZyme heatmap shows the abundance of each of the six CAZyme categories in each species. All these species data are also summarized in File S1, spreadsheet 5. The orange vertical bar to the right of the phylogenomic tree represents theStreptomyceslineage that has a significantly higher number of genes for these enzymes, compared to the otherStreptomycesspecies. Red stars denote the 12 species that were used to calculate the intra-species heterogeneity in the number of smBGCs, CAZymes and the conservation of TTA-bearing proteins

Figure 3. Presence of chromosomally encoded TTA-bearing ORFs in the 213Streptomycesspecies (green bars). Only one high-quality genome for each species was analysed. To the immediate right of the barchart is a heatmap of the statistically significant enrichment of the different COG categories in TTA-bearing ORFs per species (fold-change on log 2 scale). Over-representation in a particular category is marked with red colour, while under- representation is marked with blue colour. Above the heatmap is the total number of species in which a certain category is found to be over/under- represented. To the far right is a presence/absence matrix of the 11 TTA-codons whose position is conserved in at least 50% of the species that have the orthogroup. Presence of TTA-bearing ORFs is shown in red, while absence is shown in white. Orthogroups of TTA-bearing ORFs that are not present in a certain species are marked black. All these species data are also summarized in File S1, spreadsheet 5

Download supplementary materials here or access them in Figshare.

pyPGCF: a python software for phylogenomic analysis, species demarcation, identification of core and fingerprint proteins of bacterial genomes that are important for plants

This computational protocol describes how to use pyPGCF, a python software that runs in linux environment, in order to analyse bacterial genomes and perform i) phylogenomic analysis, ii) species demarcation, iii) identification of core proteins of the bacterial genus and the individual species, iv) identification of species-specific fingerprint proteins that are found in all strains of a species and at the same are absent from all other species of the genus, v) functional annotation of the core and fingerprint proteins with eggNOG, vi) identification of secondary metabolite gene clusters (smBGCs) with antiSMASH. This software has already been implemented to analyse bacterial genera and species that are important for plants (i.e. Pseudomonas, Bacillus). In addition, we provide a test set and example commands on how to analyse 165 genomes from 55 species of the genus Bacillus. The main advantages of pyPGCF are that i) it uses adjustable orthology cutoffs, ii) it identifies species-specific fingerprints, iii) its computational cost scales linearly with the number of genomes being analysed. Therefore, pyPGCF is able to deal with a very large number of bacterial genomes, in reasonable timescales.

Submitted to Methods in Molecular Biology.

Figure 1. Workflow of the software. A) operations of the species_demarcation module. B) operations of the orthologues module. C) operations of the core module. D) operations of the phylogenomic module. E) operations of the eggnog module. F) operations of the smbgc module.

Download explanatory videos here or access them in Figshare.

The Notable Achievements and the Prospects of Bacterial Pathogen Genomics

Throughout the entirety of human history, bacterial pathogens have played an important role and even shaped the fate of civilizations. The application of genomics within the last 27 years has radically changed the way we understand the biology and evolution of these pathogens. In this review, we discuss how the short- (Illumina) and long-read (PacBio, Oxford Nanopore) sequencing technologies have shaped the discipline of bacterial pathogen genomics, in terms of fundamental research (i.e., evolution of pathogenicity), forensics, food safety, and routine clinical microbiology. We have mined and discuss some of the most prominent data/bioinformatics resources such as NCBI pathogens, PATRIC, and Pathogenwatch. Based on this mining, we present some of the most popular sequencing technologies, hybrid approaches, assemblers, and annotation pipelines. A small number of bacterial pathogens are of very high importance, and we also present the wealth of the genomic data for these species (i.e., which ones they are, the number of antimicrobial resistance genes per genome, the number of virulence factors). Finally, we discuss how this discipline will probably be transformed in the near future, especially by transitioning into metagenome-assembled genomes (MAGs), thanks to long-read sequencing.

Published in Microorganisms.

Figure 1. Most frequently used sequencing platforms according to PATRIC, for bacterial pathogens, (A) used as single technology and (B) used in combinations (hybrid approaches).

Figure 2. The number of genomes in each bacterial taxonomic group of the NCBI pathogens. (A) The total number of genomes reported in each taxonomic group. (B) The number of complete genomes in each taxonomic group.

Figure 3. Most of the commonly used assemblers reported in the NCBI bacterial pathogens database as of March 2022.