Population structure (genetics)
{{Short description|Stratification of a genetic population based on allele frequencies}}
{{cs1 config|name-list-style=vanc}}
Population structure (also called genetic structure and population stratification) is the presence of a systematic difference in allele frequencies between subpopulations. In a randomly mating (or panmictic) population, allele frequencies are expected to be roughly similar between groups. However, mating tends to be non-random to some degree, causing structure to arise. For example, a barrier like a river can separate two groups of the same species and make it difficult for potential mates to cross; if a mutation occurs, over many generations it can spread and become common in one subpopulation while being completely absent in the other.
Genetic variants do not necessarily cause observable changes in organisms, but can be correlated by coincidence because of population structure—a variant that is common in a population that has a high rate of disease may erroneously be thought to cause the disease. For this reason, population structure is a common confounding variable in medical genetics studies, and accounting for and controlling its effect is important in genome wide association studies (GWAS). By tracing the origins of structure, it is also possible to study the genetic ancestry of groups and individuals.
Description
The basic cause of population structure in sexually reproducing species is non-random mating between groups: if all individuals within a population mate randomly, then the allele frequencies should be similar between groups. Population structure commonly arises from physical separation by distance or barriers, like mountains and rivers, followed by genetic drift. Other causes include gene flow from migrations, population bottlenecks and expansions, founder effects, evolutionary pressure, random chance, and (in humans) cultural factors. Even in lieu of these factors, individuals tend to stay close to where they were born, which means that alleles will not be distributed at random with respect to the full range of the species.{{cite journal | vauthors = Cardon LR, Palmer LJ | title = Population stratification and spurious allelic association | journal = Lancet | volume = 361 | issue = 9357 | pages = 598–604 | date = February 2003 | pmid = 12598158 | doi = 10.1016/S0140-6736(03)12520-2 | s2cid = 14255234 }}{{cite web | vauthors = McVean G | author-link1 = Gil McVean | url = http://www.stats.ox.ac.uk/~mcvean/notes7.pdf | archive-url = https://web.archive.org/web/20181123141715/http://www.stats.ox.ac.uk/~mcvean/notes7.pdf | archive-date = 2018-11-23| title = Population Structure | year = 2001 | access-date = 2020-11-14}}
Measures
Population structure is a complex phenomenon and no single measure captures it entirely. Understanding a population's structure requires a combination of methods and measures. Many statistical methods rely on simple population models in order to infer historical demographic changes, such as the presence of population bottlenecks, admixture events or population divergence times. Often these methods rely on the assumption of panmictia, or homogeneity in an ancestral population. Misspecification of such models, for instance by not taking into account the existence of structure in an ancestral population, can give rise to heavily biased parameter estimates.{{cite journal | vauthors = Scerri EM, Thomas MG, Manica A, Gunz P, Stock JT, Stringer C, Grove M, Groucutt HS, Timmermann A, Rightmire GP, d'Errico F, Tryon CA, Drake NA, Brooks AS, Dennell RW, Durbin R, Henn BM, Lee-Thorp J, deMenocal P, Petraglia MD, Thompson JC, Scally A, Chikhi L | display-authors = 6 | title = Did Our Species Evolve in Subdivided Populations across Africa, and Why Does It Matter? | journal = Trends in Ecology & Evolution | volume = 33 | issue = 8 | pages = 582–594 | date = August 2018 | pmid = 30007846 | pmc = 6092560 | doi = 10.1016/j.tree.2018.05.005 | bibcode = 2018TEcoE..33..582S | author-link9 = Axel Timmermann }} Simulation studies show that historical population structure can even have genetic effects that can easily be misinterpreted as historical changes in population size, or the existence of admixture events, even when no such events occurred.{{cite journal | vauthors = Rodríguez W, Mazet O, Grusea S, Arredondo A, Corujo JM, Boitard S, Chikhi L | title = The IICR and the non-stationary structured coalescent: towards demographic inference with arbitrary changes in population structure | journal = Heredity | volume = 121 | issue = 6 | pages = 663–678 | date = December 2018 | pmid = 30293985 | pmc = 6221895 | doi = 10.1038/s41437-018-0148-0 }}
= Heterozygosity =
File:Loss of heterozygosity over time in a bottlenecking population with label.png can result in a loss of heterozygosity. In this hypothetical population, an allele has become fixed after the population repeatedly dropped from 10 to 3.]]
One of the results of population structure is a reduction in heterozygosity. When populations split, alleles have a higher chance of reaching fixation within subpopulations, especially if the subpopulations are small or have been isolated for long periods. This reduction in heterozygosity can be thought of as an extension of inbreeding, with individuals in subpopulations being more likely to share a recent common ancestor.{{Cite book|last1=Hartl|first1=Daniel L.|last2=Clark|first2=Andrew G.|url=https://www.worldcat.org/oclc/37481398|title=Principles of population genetics|date=1997|publisher=Sinauer Associates|isbn=0-87893-306-9|edition=3rd|location=Sunderland, MA|oclc=37481398|pages=111–163}} The scale is important — an individual with both parents born in the United Kingdom is not inbred relative to that country's population, but is more inbred than two humans selected from the entire world. This motivates the derivation of Wright's F-statistics (also called "fixation indices"), which measure inbreeding through observed versus expected heterozygosity.{{cite journal|last1=Wright|first1=Sewall|title=The Genetical Structure of Populations|journal=Annals of Eugenics|volume=15|issue=1|year=1949|pages=323–354|issn=2050-1420|doi=10.1111/j.1469-1809.1949.tb02451.x|pmid=24540312}} For example, measures the inbreeding coefficient at a single locus for an individual relative to some subpopulation :{{cite book | title = Population and Quantitative Genetics | last = Coop | first = Graham | year = 2019 | pages=22–44}}
:
Here, is the fraction of individuals in subpopulation that are heterozygous. Assuming there are two alleles, that occur at respective frequencies , it is expected that under random mating the subpopulation will have a heterozygosity rate of . Then:
:
Similarly, for the total population , we can define allowing us to compute the expected heterozygosity of subpopulation and the value as:
:
If F is 0, then the allele frequencies between populations are identical, suggesting no structure. The theoretical maximum value of 1 is attained when an allele reaches total fixation, but most observed maximum values are far lower. FST is one of the most common measures of population structure and there are several different formulations depending on the number of populations and the alleles of interest. Although it is sometimes used as a genetic distance between populations, it does not always satisfy the triangle inequality and thus is not a metric.{{cite journal|last1=Arbisser|first1=Ilana M.|last2=Rosenberg|first2=Noah A.|title=FST and the triangle inequality for biallelic markers|journal=Theoretical Population Biology|volume=133|year=2020|pages=117–129|issn=0040-5809|doi=10.1016/j.tpb.2019.05.003|pmid=31132375|pmc=8448291}} It also depends on within-population diversity, which makes interpretation and comparison difficult.{{cite journal|last1=Meirmans|first1=Patrick G.|last2=Hedrick|first2=Philip W.|title=Assessing population structure:FST and related measures|journal=Molecular Ecology Resources|volume=11|issue=1|year=2010|pages=5–18|issn=1755-098X|doi=10.1111/j.1755-0998.2010.02927.x|pmid=21429096|s2cid=24403040|doi-access=free}}
= Admixture inference =
An individual's genotype can be modelled as an admixture between K discrete clusters of populations. Each cluster is defined by the frequencies of its genotypes, and the contribution of a cluster to an individual's genotypes is measured via an estimator. In 2000, Jonathan K. Pritchard introduced the STRUCTURE algorithm to estimate these proportions via Markov chain Monte Carlo, modelling allele frequencies at each locus with a Dirichlet distribution.{{cite journal|last1=Pritchard|first1=Jonathan K|last2=Stephens|first2=Matthew|last3=Donnelly|first3=Peter|title=Inference of Population Structure Using Multilocus Genotype Data|journal=Genetics|volume=155|issue=2|year=2000|pages=945–959|issn=1943-2631|doi=10.1093/genetics/155.2.945|pmid=10835412|pmc=1461096 |doi-access=free}} Since then, algorithms (such as ADMIXTURE) have been developed using other estimation techniques.{{cite journal|last1=Alexander|first1=D. H.|last2=Novembre|first2=J.|last3=Lange|first3=K.|title=Fast model-based estimation of ancestry in unrelated individuals|journal=Genome Research|volume=19|issue=9|year=2009|pages=1655–1664|issn=1088-9051|doi=10.1101/gr.094052.109|pmid=19648217|pmc=2752134}}{{cite journal |vauthors=Novembre J, Ramachandran S |title=Perspectives on human population structure at the cusp of the sequencing era |journal=Annu Rev Genomics Hum Genet |volume=12 |issue= 1|pages=245–74 |date=2011 |pmid=21801023 |doi=10.1146/annurev-genom-090810-183123 |url=}} Estimated proportions can be visualized using bar plots — each bar represents an individual, and is subdivided to represent the proportion of an individual's genetic ancestry from one of the K populations.
Varying K can illustrate different scales of population structure; using a small K for the entire human population will subdivide people roughly by continent, while using large K will partition populations into finer subgroups. Though clustering methods are popular, they are open to misinterpretation: for non-simulated data, there is never a "true" value of K, but rather an approximation considered useful for a given question.{{cite journal|last1=Lawson|first1=Daniel J.|last2=van Dorp|first2=Lucy|last3=Falush|first3=Daniel|title=A tutorial on how not to over-interpret STRUCTURE and ADMIXTURE bar plots|journal=Nature Communications|volume=9|issue=1|year=2018|page=3258|issn=2041-1723|doi=10.1038/s41467-018-05257-7|pmid=30108219|pmc=6092366|bibcode=2018NatCo...9.3258L}} They are sensitive to sampling strategies, sample size, and close relatives in data sets; there may be no discrete populations at all; and there may be hierarchical structure where subpopulations are nested. Clusters may be admixed themselves, and may not have a useful interpretation as source populations.{{cite journal|last1=Novembre|first1=John|title=Pritchard, Stephens, and Donnelly on Population Structure|journal=Genetics|volume=204|issue=2|year=2016|pages=391–393|issn=1943-2631|doi=10.1534/genetics.116.195164|pmid=27729489|pmc=5068833}}
{{wide image|1=Map_of_samples_and_population_structure_of_North_Africa_and_neighboring_populations.png|2=800px|3=A study of population structure of humans in Northern Africa and neighboring populations modelled using ADMIXTURE and assuming K=2,4,6,8 populations (Figure B, top to bottom). Varying K changes the scale of clustering. At K=2, 80% of the inferred ancestry for most North Africans is assigned to cluster that is common to Basque, Tuscan, and Qatari Arab individuals (in purple). At K=4, clines of North African ancestry appear (in light blue). At K=6, opposite clines of Near Eastern (Qatari) ancestry appear (in green). At K=8, Tunisian Berbers appear as a cluster (in dark blue).{{cite journal |vauthors=Henn BM, Botigué LR, Gravel S, Wang W, Brisbin A, Byrnes JK, Fadhlaoui-Zid K, Zalloua PA, Moreno-Estrada A, Bertranpetit J, Bustamante CD, Comas D |title=Genomic ancestry of North Africans supports back-to-Africa migrations |journal=PLOS Genet |volume=8 |issue=1 |pages=e1002397 |date=January 2012 |pmid=22253600 |pmc=3257290 |doi=10.1371/journal.pgen.1002397 |url= |doi-access=free }}}}
= Dimensionality reduction =
File:Procrustes-transformed PCA plot of genetic variation of Sub-Saharan African populations.png
Genetic data are high dimensional and dimensionality reduction techniques can capture population structure. Principal component analysis (PCA) was first applied in population genetics in 1978 by Cavalli-Sforza and colleagues and resurged with high-throughput sequencing.{{cite journal|last1=Menozzi|first1=P|last2=Piazza|first2=A|last3=Cavalli-Sforza|first3=L|title=Synthetic maps of human gene frequencies in Europeans|journal=Science|volume=201|issue=4358|year=1978|pages=786–792|issn=0036-8075|doi=10.1126/science.356262|pmid=356262|bibcode=1978Sci...201..786M}} Initially PCA was used on allele frequencies at known genetic markers for populations, though later it was found that by coding SNPs as integers (for example, as the number of non-reference alleles) and normalizing the values, PCA could be applied at the level of individuals.{{cite journal | vauthors = Patterson N, Price AL, Reich D | title = Population structure and eigenanalysis | journal = PLOS Genetics | volume = 2 | issue = 12 | pages = e190 | date = December 2006 | pmid = 17194218 | pmc = 1713260 | doi = 10.1371/journal.pgen.0020190 | doi-access = free }} One formulation considers individuals and bi-allelic SNPs. For each individual , the value at locus is is the number of non-reference alleles (one of ). If the allele frequency at is , then the resulting matrix of normalized genotypes has entries:
:
PCA transforms data to maximize variance; given enough data, when each individual is visualized as point on a plot, discrete clusters can form. Individuals with admixed ancestries will tend to fall between clusters, and when there is homogenous isolation by distance in the data, the top PC vectors will reflect geographic variation.{{cite journal|last1=Novembre|first1=John|last2=Johnson|first2=Toby|last3=Bryc|first3=Katarzyna|last4=Kutalik|first4=Zoltán|last5=Boyko|first5=Adam R.|last6=Auton|first6=Adam|last7=Indap|first7=Amit|last8=King|first8=Karen S.|last9=Bergmann|first9=Sven|last10=Nelson|first10=Matthew R.|last11=Stephens|first11=Matthew|last12=Bustamante|first12=Carlos D.|title=Genes mirror geography within Europe|journal=Nature|volume=456|issue=7218|year=2008|pages=98–101|issn=0028-0836|doi=10.1038/nature07331|pmid=18758442|pmc=2735096|bibcode=2008Natur.456...98N}} The eigenvectors generated by PCA can be explicitly written in terms of the mean coalescent times for pairs of individuals, making PCA useful for inference about the population histories of groups in a given sample. PCA cannot, however, distinguish between different processes that lead to the same mean coalescent times.{{cite journal|last1=McVean|first1=Gil|title=A Genealogical Interpretation of Principal Components Analysis|journal=PLOS Genetics|volume=5|issue=10|year=2009|pages=e1000686|issn=1553-7404|doi=10.1371/journal.pgen.1000686|pmid=19834557|pmc=2757795 |doi-access=free }}
Multidimensional scaling and discriminant analysis have been used to study differentiation, population assignment, and to analyze genetic distances.{{cite journal |vauthors=Jombart T, Pontier D, Dufour AB |title=Genetic markers in the playground of multivariate analysis |journal=Heredity (Edinb) |volume=102 |issue=4 |pages=330–41 |date=April 2009 |pmid=19156164 |doi=10.1038/hdy.2008.130 |s2cid=10739417 |url=|doi-access=free }} Neighborhood graph approaches like t-distributed stochastic neighbor embedding (t-SNE) and uniform manifold approximation and projection (UMAP) can visualize continental and subcontinental structure in human data.{{cite journal |vauthors=Li W, Cerise JE, Yang Y, Han H |title=Application of t-SNE to human genetic data |journal=J Bioinform Comput Biol |volume=15 |issue=4 |pages=1750017 |date=August 2017 |pmid=28718343 |doi=10.1142/S0219720017500172 |url=}}{{cite journal |vauthors=Diaz-Papkovich A, Anderson-Trocmé L, Ben-Eghan C, Gravel S |title=UMAP reveals cryptic population structure and phenotype heterogeneity in large genomic cohorts |journal=PLOS Genet |volume=15 |issue=11 |pages=e1008432 |date=November 2019 |pmid=31675358 |pmc=6853336 |doi=10.1371/journal.pgen.1008432 |doi-access=free }} With larger datasets, UMAP better captures multiple scales of population structure; fine-scale patterns can be hidden or split with other methods, and these are of interest when the range of populations is diverse, when there are admixed populations, or when examining relationships between genotypes, phenotypes, and/or geography.{{cite journal |vauthors=Sakaue S, Hirata J, Kanai M, Suzuki K, Akiyama M, Lai Too C, Arayssi T, Hammoudeh M, Al Emadi S, Masri BK, Halabi H, Badsha H, Uthman IW, Saxena R, Padyukov L, Hirata M, Matsuda K, Murakami Y, Kamatani Y, Okada Y |title=Dimensionality reduction reveals fine-scale structure in the Japanese population with consequences for polygenic risk prediction |journal=Nat Commun |volume=11 |issue=1 |pages=1569 |date=March 2020 |pmid=32218440 |doi=10.1038/s41467-020-15194-z|pmc=7099015 |bibcode=2020NatCo..11.1569S }} Variational autoencoders can generate artificial genotypes with structure representative of the input data, though they do not recreate linkage disequilibrium patterns.{{cite journal |vauthors=Battey CJ, Coffing GC, Kern AD |title=Visualizing population structure with variational autoencoders |journal=G3 (Bethesda) |volume=11 |issue=1 |pages= |date=January 2021 |pmid=33561250 |pmc=8022710 |doi=10.1093/g3journal/jkaa036 |url=}}
Demographic inference
Population structure is an important aspect of evolutionary and population genetics. Events like migrations and interactions between groups leave a genetic imprint on populations. Admixed populations will have haplotype chunks from their ancestral groups, which gradually shrink over time because of recombination. By exploiting this fact and matching shared haplotype chunks from individuals within a genetic dataset, researchers may trace and date the origins of population admixture and reconstruct historic events such as the rise and fall of empires, slave trades, colonialism, and population expansions.{{cite journal | vauthors = Hellenthal G, Busby GB, Band G, Wilson JF, Capelli C, Falush D, Myers S | title = A genetic atlas of human admixture history | journal = Science | volume = 343 | issue = 6172 | pages = 747–751 | date = February 2014 | pmid = 24531965 | pmc = 4209567 | doi = 10.1126/science.1243518 | bibcode = 2014Sci...343..747H }}
Role in genetic epidemiology
Population structure can be a problem for association studies, such as case-control studies, where the association between the trait of interest and locus could be incorrect. As an example, in a study population of Europeans and East Asians, an association study of chopstick usage may "discover" a gene in the Asian individuals that leads to chopstick use. However, this is a spurious relationship as the genetic variant is simply more common in Asians than in Europeans.{{cite journal | vauthors = Hamer D, Sirota L | title = Beware the chopsticks gene | journal = Molecular Psychiatry | volume = 5 | issue = 1 | pages = 11–3 | date = January 2000 | pmid = 10673763 | doi = 10.1038/sj.mp.4000662 | s2cid = 9760182 }} Also, actual genetic findings may be overlooked if the locus is less prevalent in the population where the case subjects are chosen. For this reason, it was common in the 1990s to use family-based data where the effect of population structure can easily be controlled for using methods such as the transmission disequilibrium test (TDT).{{cite journal | vauthors = Pritchard JK, Rosenberg NA | title = Use of unlinked genetic markers to detect population stratification in association studies | journal = American Journal of Human Genetics | volume = 65 | issue = 1 | pages = 220–8 | date = July 1999 | pmid = 10364535 | pmc = 1378093 | doi = 10.1086/302449 }}
Phenotypes (measurable traits), such as height or risk for heart disease, are the product of some combination of genes and environment. These traits can be estimated using polygenic scores, which seek to isolate and estimate the contribution of genetics to a trait by summing the effects of many individual genetic variants. To construct a score, researchers first enroll participants in an association study to estimate the contribution of each genetic variant. Then, they can use the estimated contributions of each genetic variant to calculate a score for the trait for an individual who was not in the original association study. If structure in the study population is correlated with environmental variation, then the polygenic score is no longer measuring the genetic component alone.{{cite journal | vauthors = Blanc J, Berg JJ | title = How well can we separate genetics from the environment? | journal = eLife | volume = 9 | pages = e64948 | date = December 2020 | pmid = 33355092 | doi = 10.7554/eLife.64948 | pmc = 7758058 | doi-access = free }}
Several methods can at least partially control for this confounding effect. The genomic control method was introduced in 1999 and is a relatively nonparametric method for controlling the inflation of test statistics.{{cite journal | vauthors = Devlin B, Roeder K | title = Genomic control for association studies | journal = Biometrics | volume = 55 | issue = 4 | pages = 997–1004 | date = December 1999 | pmid = 11315092 | doi = 10.1111/j.0006-341X.1999.00997.x | s2cid = 6297807 }} It is also possible to use unlinked genetic markers to estimate each individual's ancestry proportions from some K subpopulations, which are assumed to be unstructured.{{cite journal | vauthors = Pritchard JK, Stephens M, Rosenberg NA, Donnelly P | title = Association mapping in structured populations | journal = American Journal of Human Genetics | volume = 67 | issue = 1 | pages = 170–81 | date = July 2000 | pmid = 10827107 | pmc = 1287075 | doi = 10.1086/302959 }} More recent approaches make use of principal component analysis (PCA), as demonstrated by Alkes Price and colleagues,{{cite journal | vauthors = Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D | title = Principal components analysis corrects for stratification in genome-wide association studies | journal = Nature Genetics | volume = 38 | issue = 8 | pages = 904–9 | date = August 2006 | pmid = 16862161 | doi = 10.1038/ng1847 | s2cid = 8127858 }} or by deriving a genetic relationship matrix (also called a kinship matrix) and including it in a linear mixed model (LMM).{{cite journal | vauthors = Yu J, Pressoir G, Briggs WH, Vroh Bi I, Yamasaki M, Doebley JF, McMullen MD, Gaut BS, Nielsen DM, Holland JB, Kresovich S, Buckler ES | display-authors = 6 | title = A unified mixed-model method for association mapping that accounts for multiple levels of relatedness | journal = Nature Genetics | volume = 38 | issue = 2 | pages = 203–8 | date = February 2006 | pmid = 16380716 | doi = 10.1038/ng1702 | s2cid = 8507433 }}{{cite journal | vauthors = Loh PR, Tucker G, Bulik-Sullivan BK, Vilhjálmsson BJ, Finucane HK, Salem RM, Chasman DI, Ridker PM, Neale BM, Berger B, Patterson N, Price AL | display-authors = 6 | title = Efficient Bayesian mixed-model analysis increases association power in large cohorts | journal = Nature Genetics | volume = 47 | issue = 3 | pages = 284–90 | date = March 2015 | pmid = 25642633 | pmc = 4342297 | doi = 10.1038/ng.3190 | author-link5 = Hilary Finucane }}
PCA and LMMs have become the most common methods to control for confounding from population structure. Though they are likely sufficient for avoiding false positives in association studies, they are still vulnerable to overestimating effect sizes of marginally associated variants and can substantially bias estimates of polygenic scores and trait heritability.{{cite journal | vauthors = Zaidi AA, Mathieson I | title = Demographic history mediates the effect of stratification on polygenic scores | journal = eLife | volume = 9 | pages = e61548 | date = November 2020 | pmid = 33200985 | doi = 10.7554/eLife.61548 | pmc = 7758063 | veditors = Perry GH, Turchin MC, Martin P | doi-access = free }}{{cite journal | vauthors = Sohail M, Maier RM, Ganna A, Bloemendal A, Martin AR, Turchin MC, Chiang CW, Hirschhorn J, Daly MJ, Patterson N, Neale B, Mathieson I, Reich D, Sunyaev SR | display-authors = 6 | title = Polygenic adaptation on height is overestimated due to uncorrected stratification in genome-wide association studies | journal = eLife | volume = 8 | pages = e39702 | date = March 2019 | pmid = 30895926 | doi = 10.7554/eLife.39702 | pmc = 6428571 | veditors = Nordborg M, McCarthy MI, Barton NH, Hermisson J | doi-access = free }} If environmental effects are related to a variant that exists in only one specific region (for example, a pollutant is found in only one city), it may not be possible to correct for this population structure effect at all. For many traits, the role of structure is complex and not fully understood, and incorporating it into genetic studies remains a challenge and is an active area of research.{{cite journal | vauthors = Lawson DJ, Davies NM, Haworth S, Ashraf B, Howe L, Crawford A, Hemani G, Davey Smith G, Timpson NJ | display-authors = 6 | title = Is population structure in the genetic biobank era irrelevant, a challenge, or an opportunity? | journal = Human Genetics | volume = 139 | issue = 1 | pages = 23–41 | date = January 2020 | pmid = 31030318 | pmc = 6942007 | doi = 10.1007/s00439-019-02014-8 }}
References
{{Reflist|30em}}
{{Population genetics}}
{{Genetics-footer}}