Pan-genome#Cloud

File:Example of Anvi’o software output.png genomes made with Anvi'o {{cite journal | vauthors = Eren AM, Kiefl E, Shaiber A, Veseli I, Miller SE, Schechter MS, Fink I, Pan JN, Yousef M, Fogarty EC, Trigodet F, Watson AR, Esen ÖC, Moore RM, Clayssen Q, Lee MD, Kivenson V, Graham ED, Merrill BD, Karkman A, Blankenberg D, Eppley JM, Sjödin A, Scott JJ, Vázquez-Campos X, McKay LJ, McDaniel EA, Stevens SL, Anderson RE, Fuessel J, Fernandez-Guerra A, Maignien L, Delmont TO, Willis AD | display-authors = 6 | title = Community-led, integrated, reproducible multi-omics with anvi'o | journal = Nature Microbiology | volume = 6 | issue = 1 | pages = 3–6 | date = January 2021 | pmid = 33349678 | doi = 10.1038/s41564-020-00834-3 | pmc = 8116326 }} software whose development is led by A. Murat Eren. Genomes obtained from Tettelin et al. (2005).{{Cite journal|last1=Tettelin|first1=Hervé|last2=Masignani|first2=Vega|last3=Cieslewicz|first3=Michael J.|last4=Donati|first4=Claudio|last5=Medini|first5=Duccio|last6=Ward|first6=Naomi L.|last7=Angiuoli|first7=Samuel V.|last8=Crabtree|first8=Jonathan|last9=Jones|first9=Amanda L.|last10=Durkin|first10=A. Scott|last11=DeBoy|first11=Robert T.|date=2005-09-27|title=Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: Implications for the microbial "pan-genome"|journal=Proceedings of the National Academy of Sciences|language=en|volume=102|issue=39|pages=13950–13955|doi=10.1073/pnas.0506758102|issn=0027-8424|pmid=16172379|pmc=1216834|bibcode=2005PNAS..10213950T|doi-access=free}} Each circle corresponds to one genome and each radius represent a gene family. At the bottom and at right are localized the core genome families. Some families in the core may have more than one homologous gene per genome. In the middle, at the left of the figure the shell genome is observed. At the top left are shown families from the dispensable genome and singletons. ]]

In the fields of molecular biology and genetics, a pan-genome (pangenome or supragenome) is the entire set of genes from all strains within a clade. More generally, it is the union of all the genomes of a clade.{{cite journal | vauthors = Medini D, Donati C, Tettelin H, Masignani V, Rappuoli R | title = The microbial pan-genome | journal = Current Opinion in Genetics & Development | volume = 15 | issue = 6 | pages = 589–94 | date = December 2005 | pmid = 16185861 | doi = 10.1016/j.gde.2005.09.006 | author-link5 = Rino Rappuoli }}{{cite journal | vauthors = Vernikos G, Medini D, Riley DR, Tettelin H | title = Ten years of pan-genome analyses | journal = Current Opinion in Microbiology | volume = 23 | pages = 148–54 | date = February 2015 | pmid = 25483351 | doi = 10.1016/j.mib.2014.11.016 }}{{cite journal | vauthors = Marroni F, Pinosio S, Morgante M | title = Structural variation and genome complexity: is dispensable really dispensable? | journal = Current Opinion in Plant Biology | volume = 18 | pages = 31–36 | date = April 2014 | pmid = 24548794 | doi = 10.1016/j.pbi.2014.01.003 | bibcode = 2014COPB...18...31M }} The pan-genome can be broken down into a "core pangenome" that contains genes present in all individuals, a "shell pangenome" that contains genes present in two or more strains, and a "cloud pangenome" that contains genes only found in a single strain.{{cite journal | vauthors = Wolf YI, Makarova KS, Yutin N, Koonin EV | title = Updated clusters of orthologous genes for Archaea: a complex ancestor of the Archaea and the byways of horizontal gene transfer | journal = Biology Direct | volume = 7 | pages = 46 | date = December 2012 | pmid = 23241446 | pmc = 3534625 | doi = 10.1186/1745-6150-7-46 | doi-access = free }} Some authors also refer to the cloud genome as "accessory genome" containing 'dispensable' genes present in a subset of the strains and strain-specific genes. Note that the use of the term 'dispensable' has been questioned, at least in plant genomes, as accessory genes play "an important role in genome evolution and in the complex interplay between the genome and the environment". The field of study of pangenomes is called pangenomics.

The genetic repertoire of a bacterial species is much larger than the gene content of an individual strain.{{cite journal |vauthors=Mira A, Martín-Cuadrado AB, D'Auria G, Rodríguez-Valera F |date=2010|title=The bacterial pan-genome:a new paradigm in microbiology |url=https://pubmed.ncbi.nlm.nih.gov/20890839/ |journal= Int Microbiol|volume=13 |issue=2|pages=45–57 |doi=10.2436/20.1501.01.110 | pmid=20890839}} Some species have open (or extensive) pangenomes, while others have closed pangenomes. For species with a closed pan-genome, very few genes are added per sequenced genome (after sequencing many strains), and the size of the full pangenome can be theoretically predicted. Species with an open pangenome have enough genes added per additional sequenced genome that predicting the size of the full pangenome is impossible. Population size and niche versatility have been suggested as the most influential factors in determining pan-genome size.

Pangenomes were originally constructed for species of bacteria and archaea, but more recently eukaryotic pan-genomes have been developed, particularly for plant species. Plant studies have shown that pan-genome dynamics are linked to transposable elements.{{cite journal | vauthors = Morgante M, De Paoli E, Radovic S | title = Transposable elements and the plant pan-genomes | journal = Current Opinion in Plant Biology | volume = 10 | issue = 2 | pages = 149–55 | date = April 2007 | pmid = 17300983 | doi = 10.1016/j.pbi.2007.02.001 | bibcode = 2007COPB...10..149M }}{{cite journal |vauthors=Gordon SP, Contreras-Moreira B, Woods DP, Des Marais DL, Burgess D, Shu S, Stritt C, Roulin AC, Schackwitz W, Tyler L, Martin J, Lipzen A, Dochy N, Phillips J, Barry K, Geuten K, Budak H, Juenger TE, Amasino R, Caicedo AL, Goodstein D, Davidson P, Mur LA, Figueroa M, Freeling M, Catalan P, Vogel JP | display-authors = 6 | title = Extensive gene content variation in the Brachypodium distachyon pan-genome correlates with population structure | journal = Nature Communications | volume = 8 | issue = 1 | pages = 2184 | date = December 2017 | pmid = 29259172 | pmc = 5736591 | doi = 10.1038/s41467-017-02292-8 | bibcode = 2017NatCo...8.2184G }}{{cite journal | vauthors = Gordon SP, Contreras-Moreira B, Levy JH, Djamei A, Czedik-Eysenberg A, Tartaglio VS, Session A, Martin J, Cartwright A, Katz A, Singan VR, Goltsman E, Barry K, Dinh-Thi VH, Chalhoub B, Diaz-Perez A, Sancho R, Lusinska J, Wolny E, Nibau C, Doonan JH, Mur LA, Plott C, Jenkins J, Hazen SP, Lee SJ, Shu S, Goodstein D, Rokhsar D, Schmutz J, Hasterok R, Catalan P, Vogel JP | display-authors = 6 | title = Gradual polyploid genome evolution revealed by pan-genomic analysis of Brachypodium hybridum and its diploid progenitors | journal = Nature Communications | volume = 11 | pages = 3670 | date = July 2020 | issue = 1 | pmid = 32728126 | doi = 10.1038/s41467-020-17302-5 | pmc = 7391716 | bibcode = 2020NatCo..11.3670G | doi-access = free }}{{cite journal | vauthors = Contreras-Moreira B, Cantalapiedra CP, García-Pereira MJ, Gordon SP, Vogel JP, Igartua E, Casas AM, Vinuesa P | display-authors = 6 | title = Analysis of Plant Pan-Genomes and Transcriptomes with GET_HOMOLOGUES-EST, a Clustering Solution for Sequences of the Same Species | journal = Frontiers in Plant Science | volume = 8 | pages = 184 | date = February 2017 | pmid = 28261241 | pmc = 5306281 | doi = 10.3389/fpls.2017.00184 | doi-access = free }} The significance of the pan-genome arises in an evolutionary context, especially with relevance to metagenomics,{{cite journal | vauthors = Reno ML, Held NL, Fields CJ, Burke PV, Whitaker RJ | title = Biogeography of the Sulfolobus islandicus pan-genome | journal = Proceedings of the National Academy of Sciences of the United States of America | volume = 106 | issue = 21 | pages = 8605–10 | date = May 2009 | pmid = 19435847 | pmc = 2689034 | doi = 10.1073/pnas.0808945106 | bibcode = 2009PNAS..106.8605R | doi-access = free }} but is also used in a broader genomics context.{{cite journal | vauthors = Reinhardt JA, Baltrus DA, Nishimura MT, Jeck WR, Jones CD, Dangl JL | title = De novo assembly using low-coverage short read sequence data from the rice pathogen Pseudomonas syringae pv. oryzae | journal = Genome Research | volume = 19 | issue = 2 | pages = 294–305 | date = February 2009 | pmid = 19015323 | pmc = 2652211 | doi = 10.1101/gr.083311.108 }} An open access book reviewing the pangenome concept and its implications, edited by Tettelin and Medini, was published in the spring of 2020.{{cite book | vauthors = Tettelin H, Medini D |date=2020| veditors = Tettelin H, Medini D |title=The Pangenome|language=en-gb|doi=10.1007/978-3-030-38281-0|pmid=32633908|isbn=978-3-030-38280-3|url=http://library.oapen.org/bitstream/20.500.12657/37707/1/2020_Book_ThePangenome.pdf|s2cid=217167361}}

Etymology

The term 'pangenome' was defined with its current meaning by Tettelin et al. in 2005; it derives 'pan' from the Greek word παν, meaning 'whole' or 'everything', while the genome is a commonly used term to describe an organism's complete genetic material. Tettelin et al. applied the term specifically to bacteria, whose pangenome "includes a core genome containing genes present in all strains and a dispensable genome composed of genes absent from one or more strains and genes that are unique to each strain."

Parts of the pangenome

File: Parts of the pangenome.png

= Core =

Is the part of the pangenome that is shared by every genome in the tested set. Some authors have divided the core pangenome in hard core, those families of homologous genes that has at least one copy of the family shared by every genome (100% of genomes) and the soft core or extended core,{{cite journal | vauthors = Halachev MR, Loman NJ, Pallen MJ | title = Calculating orthologs in bacteria and Archaea: a divide and conquer approach | journal = PLOS ONE | volume = 6 | issue = 12 | pages = e28388 | date = 2011 | pmid = 22174796 | pmc = 3236195 | doi = 10.1371/journal.pone.0028388 | bibcode = 2011PLoSO...628388H | doi-access = free }} those families distributed above a certain threshold (90%). In a study that involves the pangenomes of Bacillus cereus and Staphylococcus aureus, some of them isolated from the international space station, the thresholds used for segmenting the pangenomes were as follows: "Cloud", "Shell", and "Core" corresponding to gene families with presence in <10%, 10–95%, and >95% of the genomes, respectively.{{cite journal | vauthors = Blaustein RA, McFarland AG, Ben Maamar S, Lopez A, Castro-Wallace S, Hartmann EM | title = Pangenomic Approach To Understanding Microbial Adaptations within a Model Built Environment, the International Space Station, Relative to Human Hosts and Soil | journal = mSystems | volume = 4 | issue = 1 | pages = e00281-18 | date = 2019 | pmid = 30637341 | pmc = 6325168 | doi = 10.1128/mSystems.00281-18 }}

The core genome size and proportion to the pangenome depends on several factors, but it is especially dependent on the phylogenetic similarity of the considered genomes. For example, the core of two identical genomes would also be the complete pangenome. The core of a genus will always be smaller than the core genome of a species. Genes that belong to the core genome are often related to house keeping functions and primary metabolism of the lineage, nevertheless, the core gene can also contain some genes that differentiate the species from other species of the genus, i.e. that may be related pathogenicity to niche adaptation.{{cite journal | vauthors = Mosquera-Rendón J, Rada-Bravo AM, Cárdenas-Brito S, Corredor M, Restrepo-Pineda E, Benítez-Páez A | title = Pangenome-wide and molecular evolution analyses of the Pseudomonas aeruginosa species | journal = BMC Genomics | volume = 17 | issue = 45 | pages = 45 | date = January 2016 | pmid = 26754847 | pmc = 4710005 | doi = 10.1186/s12864-016-2364-4 | doi-access = free }}

= Shell =

Is the part of the pangenome shared by the majority of the genomes in a pangenome.{{cite journal | vauthors = Snipen L, Ussery DW | title = Standard operating procedure for computing pangenome trees | journal = Standards in Genomic Sciences | volume = 2 | issue = 1 | pages = 135–41 | date = January 2010 | pmid = 21304685 | pmc = 3035256 | doi = 10.4056/sigs.38923 | bibcode = 2010SGenS...2..135S }} There is not a universally accepted threshold to define the shell genome, some authors consider a gene family as part of the shell pangenome if it shared by more than 50% of the genomes in the pangenome.{{cite journal | vauthors = Sélem-Mojica N, Aguilar C, Gutiérrez-García K, Martínez-Guerrero CE, Barona-Gómez F | title = EvoMining reveals the origin and fate of natural product biosynthetic enzymes | journal = Microbial Genomics | volume = 5 | issue = 12 | pages = e000260 | date = December 2019 | pmid = 30946645 | pmc = 6939163 | doi = 10.1099/mgen.0.000260 | doi-access = free }} A family can be part of the shell by several evolutive dynamics, for example by gene loss in a lineage where it was previously part of the core genome, such is the case of enzymes in the tryptophan operon in Actinomyces,{{cite journal | vauthors = Juárez-Vázquez AL, Edirisinghe JN, Verduzco-Castro EA, Michalska K, Wu C, Noda-García L, Babnigg G, Endres M, Medina-Ruíz S, Santoyo-Flores J, Carrillo-Tripp M, Ton-That H, Joachimiak A, Henry CS, Barona-Gómez F | display-authors = 6 | title = Evolution of substrate specificity in a retained enzyme driven by gene loss | journal = eLife | volume = 6 | issue = 6 | pages = e22679 | date = March 2017 | pmid = 28362260 | pmc = 5404923 | doi = 10.7554/eLife.22679 | doi-access = free }} or by gene gain and fixation of a gene family that was previously part of the dispensable genome such is the case of trpF gene in several Corynebacterium species.{{cite journal | vauthors = Noda-García L, Camacho-Zarco AR, Medina-Ruíz S, Gaytán P, Carrillo-Tripp M, Fülöp V, Barona-Gómez F | title = Evolution of substrate specificity in a recipient's enzyme following horizontal gene transfer | journal = Molecular Biology and Evolution | volume = 30 | issue = 9 | pages = 2024–34 | date = September 2013 | pmid = 23800623 | doi = 10.1093/molbev/mst115 | doi-access = free }}

= Cloud =

The cloud genome consists of those gene families shared by a minimal subset of the genomes in the pangenome,{{cite book | vauthors = Vernikos GS | chapter = A Review of Pangenome Tools and Recent Studies | title = The Pangenome | pages = 89–112 | volume = | date = 2020 | pmid = 32633917 | pmc = | doi = 10.1007/978-3-030-38281-0_4 | isbn = 978-3-030-38280-3 | s2cid = 219011507 }} it includes singletons or genes present in only one of the genomes. It is also known as the peripheral genome, or accessory genome. Gene families in this category are often related to ecological adaptation.{{cn|date=April 2023}}

Classification

File: Characteristics of open and closed pangenomes.png

The pan-genome can be somewhat arbitrarily classified as open or closed based on the alpha value of Heaps' law: $N=kn^{-\alpha}$ {{cite journal | vauthors = Costa SS, Guimarães LC, Silva A, Soares SC, Baraúna RA | title = First Steps in the Analysis of Prokaryotic Pan-Genomes | journal = Bioinformatics and Biology Insights | volume = 14 | pages = 1177932220938064 | date = 2020 | pmid = 32843837 | pmc = 7418249 | doi = 10.1177/1177932220938064 }}

$N$ Number of gene families.
$n$ Number of genomes.
$k$ Constant of proportionality.
$\alpha$ Exponent calculated in order to adjust the curve of number of gene families vs new genome.

if $\alpha \le 1$ then the pangenome is considered open.

if $\alpha > 1$ then the pangenome is considered closed.

Usually, the pangenome software can calculate the parameters of the Heap law that best describe the behavior of the data.

= Open pangenome =

An open pangenome occurs when the number of new gene families in one taxonomic lineage keeps increasing without appearing to be asymptotic regardless how many new genomes are added to the pangenome. Escherichia coli is an example of a species with an open pangenome. Any E. coli genome size is in the range of 4000–5000 genes and the pangenome size estimated for this species with approximately 2000 genomes is composed by 89,000 different gene families.{{cite journal | vauthors = Land M, Hauser L, Jun SR, Nookaew I, Leuze MR, Ahn TH, Karpinets T, Lund O, Kora G, Wassenaar T, Poudel S, Ussery DW | display-authors = 6 | title = Insights from 20 years of bacterial genome sequencing | journal = Functional & Integrative Genomics | volume = 15 | issue = 2 | pages = 141–61 | date = March 2015 | pmid = 25722247 | pmc = 4361730 | doi = 10.1007/s10142-015-0433-4 }} The pangenome of the domain bacteria is also considered to be open.

= Closed Pangenome =

A closed pangenome occurs in a lineage when only few gene families are added when new genomes are incorporated into the pangenome analysis, and the total amount of gene families in the pangenome seem to be asymptotic to one number. It is believed that parasitism and species that are specialists in some ecological niche tend to have closed pangenomes. Staphylococcus lugdunensis is an example of a commensal bacteria with closed pan-genome.{{cite journal | vauthors = Argemi X, Matelska D, Ginalski K, Riegel P, Hansmann Y, Bloom J, Pestel-Caron M, Dahyot S, Lebeurre J, Prévost G | display-authors = 6 | title = Comparative genomic analysis of Staphylococcus lugdunensis shows a closed pan-genome and multiple barriers to horizontal gene transfer | journal = BMC Genomics | volume = 19 | issue = 1 | pages = 621 | date = August 2018 | pmid = 30126366 | pmc = 6102843 | doi = 10.1186/s12864-018-4978-1 | doi-access = free }}

History

= Pangenome =

The original pangenome concept was developed by Tettelin et al. when they analyzed the genomes of eight isolates of Streptococcus agalactiae, where they described a core genome shared by all isolates, accounting for approximately 80% of any single genome, plus a dispensable genome consisting of partially shared and strain-specific genes. Extrapolation suggested that the gene reservoir in the S. agalactiae pan-genome is vast and that new unique genes would continue to be identified even after sequencing hundreds of genomes. The pangenome comprises the entirety of the genes discovered in the sequenced genomes of a given microbial species and it can change when new genomes are sequenced and incorporated into the analysis.{{cn|date=April 2023}}

File: Supergenome and metapangenome.png

The pangenome of a genomic lineage accounts for the intra lineage gene content variability. Pangenome evolves due to: gene duplication, gene gain and loss dynamics and interaction of the genome with mobile elements that are shaped by selection and drift.{{cite journal | vauthors = Brockhurst MA, Harrison E, Hall JP, Richards T, McNally A, MacLean C | title = The Ecology and Evolution of Pangenomes | journal = Current Biology | volume = 29 | issue = 20 | pages = R1094–R1103 | date = October 2019 | pmid = 31639358 | doi = 10.1016/j.cub.2019.08.012 | s2cid = 204823648 | doi-access = free | bibcode = 2019CBio...29R1094B }} Some studies point that prokaryotes pangenomes are the result of adaptive, not neutral evolution that confer species the ability to migrate to new niches.{{cite journal | vauthors = McInerney JO, McNally A, O'Connell MJ | title = Why prokaryotes have pangenomes | journal = Nature Microbiology | volume = 2 | pages = 17040 | date = March 2017 | issue = 4 | pmid = 28350002 | doi = 10.1038/nmicrobiol.2017.40 | s2cid = 19612970 | url = http://eprints.whiterose.ac.uk/113972/39/McInerney_McNally_O%27Connell_NatMicro.pdf }}

= Supergenome =

The supergenome can be thought of as the real pangenome size if all genomes from a species were sequenced.{{cite journal | vauthors = Koonin EV | title = The Turbulent Network Dynamics of Microbial Evolution and the Statistical Tree of Life | journal = Journal of Molecular Evolution | volume = 80 | issue = 5–6 | pages = 244–50 | date = June 2015 | pmid = 25894542 | pmc = 4472940 | doi = 10.1007/s00239-015-9679-7 | bibcode = 2015JMolE..80..244K }} It is defined as all genes accessible for being gained by a certain species. It cannot be calculated directly but its size can be estimated by the pangenome size calculated from the available genome data. Estimating the size of the cloud genome can be troubling because of its dependence on the occurrence of rare genes and genomes. In 2011 genomic fluidity was proposed as a measure to categorize the gene-level similarity among groups of sequenced isolates.{{cite journal | vauthors = Kislyuk AO, Haegeman B, Bergman NH, Weitz JS | title = Genomic fluidity: an integrative view of gene diversity within microbial populations | journal = BMC Genomics | volume = 12 | issue = 12 | pages = 32 | date = January 2011 | pmid = 21232151 | pmc = 3030549 | doi = 10.1186/1471-2164-12-32 | doi-access = free }} In some lineages the supergenomes did appear infinite,{{cite journal | vauthors = Puigbò P, Lobkovsky AE, Kristensen DM, Wolf YI, Koonin EV | title = Genomes in turmoil: quantification of genome dynamics in prokaryote supergenomes | journal = BMC Biology | volume = 12 | issue = 66 | pages = 66 | date = August 2014 | pmid = 25141959 | pmc = 4166000 | doi = 10.1186/s12915-014-0066-4 | doi-access = free }} as is the case of the Bacteria domain.{{cite journal | vauthors = Lapierre P, Gogarten JP | title = Estimating the size of the bacterial pan-genome | journal = Trends in Genetics | volume = 25 | issue = 3 | pages = 107–10 | date = March 2009 | pmid = 19168257 | pmc = | doi = 10.1016/j.tig.2008.12.004 }}

= Metapangenome =

'Metapangenome' has been defined as the outcome of the analysis of pangenomes in conjunction with the environment where the abundance and prevalence of gene clusters and genomes are recovered through shotgun metagenomes.{{cite journal |last1=Delmont |first1=TO |last2=Eren |first2=AM |title=Linking pangenomes and metagenomes: the Prochlorococcus metapangenome. |journal=PeerJ |date=2018 |volume=6 |pages=e4320 |doi=10.7717/peerj.4320 |pmid=29423345|pmc=5804319 |doi-access=free }} The combination of metagenomes with pangenomes, also referred to as "metapangenomics", reveals the population-level results of habitat-specific filtering of the pangenomic gene pool.{{cite journal |last1=Utter |first1=Daniel R. |last2=Borisy |first2=Gary G. |last3=Eren |first3=A. Murat |last4=Cavanaugh |first4=Colleen M. |last5=Mark Welch |first5=Jessica L. |title=Metapangenomics of the oral microbiome provides insights into habitat adaptation and cultivar diversity |journal=Genome Biology |date=2020-12-16 |volume=21 |issue=1 |pages=293 |doi=10.1186/s13059-020-02200-2|pmid=33323129 |pmc=7739467 |doi-access=free }}

Other authors consider that Metapangenomics expands the concept of pangenome by incorporating gene sequences obtained from uncultivated microorganisms by a metagenomics approach. A metapangenome comprises both sequences from metagenome-assembled genomes (MAGs) and from genomes obtained from cultivated microorganisms.{{cite book |vauthors= Ma B, France M, Ravel J |veditors= Tettelin H, Medini D |title= The Pangenome: Diversity, Dynamics and Evolution of Genomes |publisher=Springer |date=2020 |chapter= Meta-Pangenome: At the Crossroad of Pangenomics and Metagenomics|pages= 205–218 |url=https://www.ncbi.nlm.nih.gov/books/NBK558817/ |doi=10.1007/978-3-030-38281-0_9|pmid= 32633911 |isbn=978-3-030-38281-0|s2cid= 219067583 }} Metapangenomics has been applied to assess diversity of a community, microbial niche adaptation, microbial evolution, functional activities, and interaction networks of the community.{{cite journal | vauthors = Zhong C, Chen C, Wang L, Ning K | title = Integrating pan-genome with metagenome for microbial community profiling | journal = Computational and Structural Biotechnology Journal | volume = 19 | issue = | pages = 1458–1466 | date = 2021 | pmid = 33841754 | pmc = 8010324 | doi = 10.1016/j.csbj.2021.02.021 }} The Anvi'o platform developed a workflow that integrates analysis and visualization of metapangenomes by generating pangenomes and study them in conjunction with metagenomes.

Examples

= Prokaryote pangenome =

File:Streptococcus pneumoniae pan-genome Donati 2011.jpg drops sharply to zero when the number of genomes exceeds 50. (b) Number of core genes as a function of the number of sequenced genomes. The number of core genes converges to 1,647 for number of genomes n→∞. From Donati et al.{{cite journal | vauthors = Donati C, Hiller NL, Tettelin H, Muzzi A, Croucher NJ, Angiuoli SV, Oggioni M, Dunning Hotopp JC, Hu FZ, Riley DR, Covacci A, Mitchell TJ, Bentley SD, Kilian M, Ehrlich GD, Rappuoli R, Moxon ER, Masignani V | display-authors = 6 | title = Structure and dynamics of the pan-genome of Streptococcus pneumoniae and closely related species | journal = Genome Biology | volume = 11 | issue = 10 | pages = R107 | year = 2010 | pmid = 21034474 | pmc = 3218663 | doi = 10.1186/gb-2010-11-10-r107 | doi-access = free }}]]In 2018, 87% of the available whole genome sequences were bacteria fueling researchers interest in calculating prokaryote pangenomes at different taxonomic levels. In 2015, the pangenome of 44 strains of Streptococcus pneumoniae bacteria shows few new genes discovered with each new genome sequenced (see figure). In fact, the predicted number of new genes dropped to zero when the number of genomes exceeds 50 (note, however, that this is not a pattern found in all species). This would mean that S. pneumoniae has a 'closed pangenome'.{{cite journal | vauthors = Rouli L, Merhej V, Fournier PE, Raoult D | title = The bacterial pangenome as a new tool for analysing pathogenic bacteria | journal = New Microbes and New Infections | volume = 7 | pages = 72–85 | date = September 2015 | pmid = 26442149 | pmc = 4552756 | doi = 10.1016/j.nmni.2015.06.005 }} The main source of new genes in S. pneumoniae was Streptococcus mitis from which genes were transferred horizontally. The pan-genome size of S. pneumoniae increased logarithmically with the number of strains and linearly with the number of polymorphic sites of the sampled genomes, suggesting that acquired genes accumulate proportionately to the age of clones. Another example of prokaryote pan-genome is Prochlorococcus, the core genome set is much smaller than the pangenome, which is used by different ecotypes of Prochlorococcus.{{cite journal | vauthors = Kettler GC, Martiny AC, Huang K, Zucker J, Coleman ML, Rodrigue S, Chen F, Lapidus A, Ferriera S, Johnson J, Steglich C, Church GM, Richardson P, Chisholm SW | display-authors = 6 | title = Patterns and implications of gene gain and loss in the evolution of Prochlorococcus | journal = PLOS Genetics | volume = 3 | issue = 12 | pages = e231 | date = December 2007 | pmid = 18159947 | pmc = 2151091 | doi = 10.1371/journal.pgen.0030231 | doi-access = free }} Open pan-genome has been observed in environmental isolates such as Alcaligenes sp.{{cite journal | vauthors = Basharat Z, Yasmin A, He T, Tong Y | title = Genome sequencing and analysis of Alcaligenes faecalis subsp. phenolicus MB207 | journal = Scientific Reports | volume = 8 |pages = 3616 | date = 2018 | issue = 1 | pmid = 29483539 | pmc = 5827749 | doi = 10.1038/s41598-018-21919-4 | bibcode = 2018NatSR...8.3616B }} and Serratia sp.,{{cite arXiv | vauthors = Basharat Z, Yasmin A | title = Pan-genome Analysis of the Genus Serratia | year = 2016 | class = q-bio.GN | eprint = 1610.04160 }} showing a sympatric lifestyle. Nevertheless, open pangenome is not exclusive to free living microorganisms, a 2015 study on Prevotella bacteria isolated from humans, compared the gene repertoires of its species derived from different body sites of human. It also reported an open pan-genome showing vast diversity of gene pool.{{cite journal | vauthors = Gupta VK, Chaudhari NM, Iskepalli S, Dutta C | title = Divergences in gene repertoire among the reference Prevotella genomes derived from distinct body sites of human | journal = BMC Genomics | volume = 16 | issue = 153 | pages = 153 | date = March 2015 | pmid = 25887946 | pmc = 4359502 | doi = 10.1186/s12864-015-1350-6 | doi-access = free }}

Archaea also have some pangenome studies. Halobacteria pangenome shows the following gene families in the pangenome subsets: core (300), variable components (Softcore: 998, Cloud:36531, Shell:11784).{{cite journal | vauthors = Gaba S, Kumari A, Medema M, Kaushik R | title = Pan-genome analysis and ancestral state reconstruction of class halobacteria: probability of a new super-order | journal = Scientific Reports | volume = 10 | issue = 1 | pages = 21205 | date = December 2020 | pmid = 33273480 | pmc = 7713125 | doi = 10.1038/s41598-020-77723-6 | bibcode = 2020NatSR..1021205G }}

= Eukaryote pangenome =

Eukaryote organisms such as fungi, animals and plants have also shown evidence of pangenomes. In four fungi species whose pangenome has been studied, between 80 and 90% of gene models were found as core genes. The remaining accessory genes were mainly involved in pathogenesis and antimicrobial resistance.{{cite journal | vauthors = McCarthy CG, Fitzpatrick DA | title = Pan-genome analyses of model fungal species | journal = Microbial Genomics | volume = 5 | issue = 2 | date = February 2019 | pmid = 30714895 | pmc = 6421352 | doi = 10.1099/mgen.0.000243 | doi-access = free }}

In animals, the human pangenome is being studied. In 2010 a study estimated that a complete human pan-genome would contain ~19–40 Megabases of novel sequence not present in the extant reference human genome.{{cite journal |vauthors=Li R, Li Y, Zheng H, Luo R, Zhu H, Li Q, Qian W, Ren Y, Tian G, Li J, Zhou G, Zhu X, Wu H, Qin J, Jin X, Li D, Cao H, Hu X, Blanche H, Cann H, Zhang X, Li S, Bolund L, Kristiansen K, Yang H, Wang J, Wang J |date=2010|title=Building the sequence map of the human pan-genome |url=https://www.nature.com/articles/nbt.1596 |journal=Nat Biotechnol|volume=28 |issue=1|pages=57–63 |doi=10.1038/nbt.1596 |pmid=19997067|s2cid=205274447}} The [https://humanpangenome.org/about-us/consortium-organization/ Human Pangenome consortium] has the goal to acknowledge the human genome diversity. In 2023, a draft human pangenome reference was published.{{cite journal |last1=Liao |first1=Wen-Wei |last2=Asri |first2=Mobin |last3=Ebler |first3=Jana |last4=Doerr |first4=Daniel |last5=Haukness |first5=Marina |last6=Hickey |first6=Glenn | display-authors = et al |title=A draft human pangenome reference |journal=Nature |date=May 2023 |volume=617 |issue=7960 |pages=312–324 |doi=10.1038/s41586-023-05896-x|doi-access=free |pmid=37165242 |pmc=10172123 |bibcode=2023Natur.617..312L }} It is based on 47 diploid genomes from persons of varied ethnicity. Plans are underway for an improved reference capturing still more biodiversity from a still wider sample.

Among plants, there are examples of pangenome studies in model species, both diploid and polyploid, and a growing list of crops.{{cite journal | vauthors = Gao L, Gonda I, Sun H et al | title = The tomato pan-genome uncovers new genes and a rare allele regulating fruit flavor | journal = Nature Genetics | volume = 51 | pages = 1044–51 | date = May 2019 | issue = 6 | pmid = 31086351 | doi = 10.1038/s41588-019-0410-2 | s2cid = 152283283 }}{{cite journal | vauthors = Jayakodi M, Padmarasu S, Haberer G et al | title = The barley pan-genome reveals the hidden legacy of mutation breeding | journal = Nature | volume = 588 | pages = 284–9 | date = Nov 2020 | issue = 7837 | pmid = 33239781 | doi = 10.1038/s41586-020-2947-8 | pmc = 7759462 | bibcode = 2020Natur.588..284J | doi-access = free }}

Pangenomes have shown promise as a tool in plant breeding by accounting for structural variants and SNPs in non-reference genomes, which helps to solve the problem of missing heritability that persists in genome wide association studies.{{Cite journal |last1=Zhou |first1=Yao |last2=Zhang |first2=Zhiyang |last3=Bao |first3=Zhigui |last4=Li |first4=Hongbo |last5=Lyu |first5=Yaqing |last6=Zan |first6=Yanjun |last7=Wu |first7=Yaoyao |last8=Cheng |first8=Lin |last9=Fang |first9=Yuhan |last10=Wu |first10=Kun |last11=Zhang |first11=Jinzhe |last12=Lyu |first12=Hongjun |last13=Lin |first13=Tao |last14=Gao |first14=Qiang |last15=Saha |first15=Surya |date=8 July 2022 |title=Graph pangenome captures missing heritability and empowers tomato breeding |journal=Nature |language=en |volume=606 |issue=7914 |pages=527–534 |doi=10.1038/s41586-022-04808-9 |issn=1476-4687|doi-access=free |pmid=35676474 |pmc=9200638 |bibcode=2022Natur.606..527Z |hdl=20.500.11850/553553 |hdl-access=free }} An emerging plant-based concept is that of pan-NLRome, which is the repertoire of nucleotide-binding leucine-rich repeat (NLR) proteins, intracellular immune receptors that recognize pathogen proteins and confer disease resistance.{{cite journal | vauthors = Van de Weyer AL, Monteiro F, Furzer OJ, Nishimura MT, Cevik V, Witek K, Jones JD, Dangl JL, Weigel D, Bemm F | title = A Species-Wide Inventory of NLR Genes and Alleles in Arabidopsis thaliana | journal = Cell | volume = 178 | issue = 5 | pages = 1260–72 | date = August 2019 | pmid = 31442410 | doi = 10.1016/j.cell.2019.07.038 | pmc = 6709784 | doi-access = free }}

= Virus pangenome =

Virus does not necessarily have genes extensively shared by clades such as is the case of 16S in bacteria, and therefore the core genome of the full Virus Domain is empty. Nevertheless, several studies have calculated the pangenome of some viral lineages. The core genome from six species of pandoraviruses comprises 352 gene families only 4.7% of the pangenome, resulting in an open pangenome.{{cite journal | vauthors = Aherfi S, Andreani J, Baptiste E, Oumessoum A, Dornas FP, Andrade AC, Chabriere E, Abrahao J, Levasseur A, Raoult D, La Scola B, Colson P | display-authors = 6 | title = A Large Open Pangenome and a Small Core Genome for Giant Pandoraviruses | journal = Frontiers in Microbiology | volume = 9 | issue = 9 | pages = 1486 | date = 2018 | pmid = 30042742 | pmc = 6048876 | doi = 10.3389/fmicb.2018.01486 | doi-access = free }}

Data structures

The number of sequenced genomes is continuously growing "simply scaling up established bioinformatics pipelines will not be sufficient for leveraging the full potential of such rich genomic data sets".{{cite journal | vauthors=((The Computational Pan-Genomics Consortium))| title = Computational pan-genomics: status, promises and challenges | journal = Briefings in Bioinformatics | volume = 19 | issue = 1 | pages = 118–135 | date = January 2018 | pmid = 27769991 | pmc = 5862344 | doi = 10.1093/bib/bbw089 }} Pan-genome graph constructions are emerging data structure technique designed to represent pangenomes and to efficiently map reads to them. They have been reviewed by Eizenga et al. {{cite journal | vauthors = Eizenga JM, Novak AM, Sibbesen JA, Heumos S, Ghaffaari A, Hickey G, Chang X, Seaman JD, Rounthwaite R, Ebler J, Rautiainen M, Garg S, Paten B, Marschall T, Sirén J, Garrison E | display-authors = 6 | title = Pangenome Graphs | journal = Annual Review of Genomics and Human Genetics | volume = 21 | pages = 139–162 | date = August 2020 | pmid = 32453966 | pmc = 8006571 | doi = 10.1146/annurev-genom-120219-080406 }}

Software tools

File:BPGA analysis of Streptococcus agalactiae.png. Example of phylogenies made with BPGA software. This software allows us to generate phylogenies based on the clustering of the core genome or pangenome. Core and pan phylogenetic reconstructions are not necessarily matching.]]

As interest in pangenomes increased, there have been several software tools developed to help analyze this kind of data.

To start a pangenomic analysis the first step is the homogenization of genome annotation. The same software should be used to annotate all genomes used, such as GeneMark{{cite journal | vauthors = Besemer J, Lomsadze A, Borodovsky M | title = GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions | journal = Nucleic Acids Research | volume = 29 | issue = 12 | pages = 2607–18 | date = June 2001 | pmid = 11410670 | pmc = 55746 | doi = 10.1093/nar/29.12.2607 }} or RAST.{{cite journal | vauthors = Aziz RK, Bartels D, Best AA, DeJongh M, Disz T, Edwards RA, Formsma K, Gerdes S, Glass EM, Kubal M, Meyer F, Olsen GJ, Olson R, Osterman AL, Overbeek RA, McNeil LK, Paarmann D, Paczian T, Parrello B, Pusch GD, Reich C, Stevens R, Vassieva O, Vonstein V, Wilke A, Zagnitko O | display-authors = 6 | title = The RAST Server: rapid annotations using subsystems technology | journal = BMC Genomics | volume = 9 | issue = 9 | pages = 75 | date = February 2008 | pmid = 18261238 | pmc = 2265698 | doi = 10.1186/1471-2164-9-75 | doi-access = free }} In 2015, a group reviewed the different kinds of analyses and tools a researcher may have available.{{cite journal | vauthors = Xiao J, Zhang Z, Wu J, Yu J | title = A brief review of software tools for pangenomics | journal = Genomics, Proteomics & Bioinformatics | volume = 13 | issue = 1 | pages = 73–6 | date = February 2015 | pmid = 25721608 | pmc = 4411478 | doi = 10.1016/j.gpb.2015.01.007 }} There are seven kinds of software developed to analyze pangenomes: Those dedicated to cluster homologous genes; identify SNPs; plot pangenomic profiles; build phylogenetic relationships of orthologous genes/families of strains/isolates; function-based searching; annotation and/or curation; and visualization.

The two most cited software tools for pangenomic analysis at the end of 2014 were Panseq{{cite journal | vauthors = Laing C, Buchanan C, Taboada EN, Zhang Y, Kropinski A, Villegas A, Thomas JE, Gannon VP | display-authors = 6 | title = Pan-genome sequence analysis using Panseq: an online tool for the rapid analysis of core and accessory genomic regions | journal = BMC Bioinformatics | volume = 11 | issue = 1 | pages = 461 | date = September 2010 | pmid = 20843356 | pmc = 2949892 | doi = 10.1186/1471-2105-11-461 | doi-access = free }} and the pan-genomes analysis pipeline (PGAP).{{cite journal | vauthors = Zhao Y, Wu J, Yang J, Sun S, Xiao J, Yu J | title = PGAP: pan-genomes analysis pipeline | journal = Bioinformatics | volume = 28 | issue = 3 | pages = 416–8 | date = February 2012 | pmid = 22130594 | pmc = 3268234 | doi = 10.1093/bioinformatics/btr655 }} Other options include BPGA – A Pan-Genome Analysis Pipeline for prokaryotic genomes,{{cite journal | vauthors = Chaudhari NM, Gupta VK, Dutta C | title = BPGA- an ultra-fast pan-genome analysis pipeline | journal = Scientific Reports | volume = 6 | issue = 24373 | pages = 24373 | date = April 2016 | pmid = 27071527 | pmc = 4829868 | doi = 10.1038/srep24373 | bibcode = 2016NatSR...624373C }} GET_HOMOLOGUES,{{cite journal | vauthors = Contreras-Moreira B, Vinuesa P | title = GET_HOMOLOGUES, a versatile software package for scalable and robust microbial pangenome analysis | journal = Applied and Environmental Microbiology | volume = 79 | issue = 24 | pages = 7696–701 | date = December 2013 | pmid = 24096415 | pmc = 3837814 | doi = 10.1128/AEM.02411-13 | bibcode = 2013ApEnM..79.7696C }} Roary.{{cite journal | vauthors = Page AJ, Cummins CA, Hunt M, Wong VK, Reuter S, Holden MT, Fookes M, Falush D, Keane JA, Parkhill J | display-authors = 6 | title = Roary: rapid large-scale prokaryote pan genome analysis | journal = Bioinformatics | volume = 31 | issue = 22 | pages = 3691–3 | date = November 2015 | pmid = 26198102 | pmc = 4817141 | doi = 10.1093/bioinformatics/btv421 }} and PanDelos.{{cite journal | vauthors = Bonnici V, Giugno R, Manca V | title = PanDelos: a dictionary-based method for pan-genome content discovery | journal = BMC Bioinformatics | volume = 19 | issue = Suppl 15 | pages = 437 | date = November 2018 | pmid = 30497358 | pmc = 6266927 | doi = 10.1186/s12859-018-2417-6 | doi-access = free }} In 2015 a review focused on prokaryote pangenomes{{cite journal | vauthors = Guimarães LC, Florczak-Wyspianska J, de Jesus LB, Viana MV, Silva A, Ramos RT, Soares S | display-authors = 6 | title = Inside the Pan-genome - Methods and Software Overview | journal = Current Genomics | volume = 16 | issue = 4 | pages = 245–52 | date = August 2015 | pmid = 27006628 | pmc = 4765519 | doi = 10.2174/1389202916666150423002311 }} and another for plant pan-genomes were published.{{cite journal | vauthors = Golicz AA, Batley J, Edwards D | title = Towards plant pangenomics | journal = Plant Biotechnology Journal | volume = 14 | issue = 4 | pages = 1099–105 | date = April 2016 | pmid = 26593040 | doi = 10.1111/pbi.12499 | pmc = 11388911 | url = http://espace.library.uq.edu.au/view/UQ:383261/UQ383261_OA.pdf }} Among the first software packages designed for plant pangenomes were PanTools.{{cite journal | vauthors = Sheikhizadeh S, Schranz ME, Akdel M, de Ridder D, Smit S | title = PanTools: Representation, Storage and Exploration of Pan-Genomic Data | journal = Bioinformatics | volume = 32 | issue = 17 | pages = i487–i493 | date = September 2016 | pmid = 27587666 | doi = 10.1093/bioinformatics/btw455 | doi-access = free }} and GET_HOMOLOGUES-EST. In 2018 panX was released, an interactive web tool that allows inspection of gene families evolutionary history.{{cite journal | vauthors = Ding W, Baumdicker F, Neher RA | title = panX: pan-genome analysis and exploration | journal = Nucleic Acids Research | volume = 46 | issue = 1 | pages = e5 | date = January 2018 | pmid = 29077859 | pmc = 5758898 | doi = 10.1093/nar/gkx977 }} panX can display an alignment of genomes, a phylogenetic tree, mapping of mutations and inference about gain and loss of the family on the core-genome phylogeny. In 2019 OrthoVenn 2.0 {{cite journal | vauthors = Xu L, Dong Z, Fang L, Luo Y, Wei Z, Guo H, Zhang G, Gu YQ, Coleman-Derr D, Xia Q, Wang Y | display-authors = 6 | title = OrthoVenn2: a web server for whole-genome comparison and annotation of orthologous clusters across multiple species | journal = Nucleic Acids Research | volume = 47 | issue = W1 | pages = W52–W58 | date = July 2019 | pmid = 31053848 | pmc = 6602458 | doi = 10.1093/nar/gkz333 }} allowed comparative visualization of families of homologous genes in Venn diagrams up to 12 genomes. In 2023, [https://bridgecereal.scinet.usda.gov/ BRIDGEcereal]was developed to survey and graph indel-based haplotypes from pan-genome through a gene model ID.{{Cite journal |last1=Zhang |first1=Bosen |last2=Huang |first2=Haiyan |last3=Tibbs-Cortes |first3=Laura E. |last4=Vanous |first4=Adam |last5=Zhang |first5=Zhiwu |last6=Sanguinet |first6=Karen |last7=Garland-Campbell |first7=Kimberly A. |last8=Yu |first8=Jianming |last9=Li |first9=Xianran |title=Streamline unsupervised machine learning to survey and graph indel-based haplotypes from pan-genomes |journal=Molecular Plant |date=2023 |volume=16 |issue=6 |pages=975–978 |language=en |doi=10.1016/j.molp.2023.05.005|doi-access=free |pmid=37202927 }}

File:Pcbi.1007732.g002.png genomes generated with PPanGGOLiN software. Edges correspond to genomic colocalization and nodes correspond to genes. The thickness of the edges is proportional to the number of genomes sharing that link. The edges between persistent (similar to core genes), shell and cloud nodes are colored in orange, green and blue, respectively.]]

In 2020 Anvi'o was available as a multiomics platform that contains pangenomic and metapangenomic analyses as well as visualization workflows. In Anvi'o, genomes are displayed in concentrical circles and each radius represents a gene family, allowing for comparison of more than 100 genomes in its interactive visualization. In 2020, a computational comparison of tools for extracting gene-based pangenomic contents (such as GET_HOMOLOGUES, PanDelos, Roary, and others) has been released.{{cite journal | vauthors = Bonnici V, Maresi E, Giugno R | title = Challenges in gene-oriented approaches for pangenome content discovery | journal = Briefings in Bioinformatics | year = 2020 | volume = 22 | issue = 3 | issn = 1477-4054 | doi = 10.1093/bib/bbaa198 | pmid = 32893299 }} Tools were compared from a methodological perspective, analyzing the causes that lead a given methodology to outperform other tools. The analysis was performed by taking into account different bacterial populations, which are synthetically generated by changing evolutionary parameters. Results show a differentiation of the performance of each tool that depends on the composition of the input genomes. Again in 2020, several tools introduced a graphical representation of the pangenomes showing the contiguity of genes (PPanGGOLiN, Panaroo).

Other software tools for pangenomics include Prodigal, Prokka, PanVis, PanTools, Pangenome Graph Builder (PGGB), PanX, Pagoo, and pgr-tk. {{cite journal |

last1=Bonnici | first1=Vincenzo |

last2=Chicco | first2=Davide |

title=Seven quick tips for gene-focused computational pangenomic analysis |

journal=BioData Mining |

volume=17 |

issue=28 |

date=2024-09-03 | page=28 |

issn=1756-0381 |

doi=10.1186/s13040-024-00380-2 |

doi-access=free |

pmid= 39227987 |

pmc=11370085 }}

File:Pangenome analysis with BPGA software.png. At the left, the distribution of Go terms by core/dispensable/unique genome is shown. In this example, the category replication, recombination, and repair are enriched on unique gene families. On the right, a typical pan/core plot is shown, when more genomes have added the size of the core is decreasing, and on the contrary the size of the pangenome increases.

]]

References

Category:Evolutionary biology

Category:Genomics

Category:Microbiology

Category:Pathogen genomics