Note on f3 statistic calculation
Published:
f3 statistic was developed by Patterson and Reich to measure admixture between populations. It is widely used in ancient DNA studies and there are many packages out there which are able to estimate it for you.
How does it work?
Imagine a star phylogeny ((A,B)C). C is an outgroup. A,B and C have some allele frequencies a, b, and c. We can call variants and measure a,b,c. Before A and B diverged, they shared variance in allele frequencies (drift) with the outgroup C. f3 statistics is
<(c-a)*(c-b)>
i.e. overlap of drift shared between AB and C. If A and B are admixed, allele frequencies are affected (the drift has changed). Thus, we will see f3 stat to be affected as well (it will be negative). Genetic drift can be visualised in a form of an admixture graph (see Patterson’s 2012 paper appendix B for an easy explanation).
How to calculate it?
f3 can be calculated using allele counts from samples representing populations, however it can also be estimated genome-wide. Here is an example for dummies.
import allele
g = allel.GenotypeArray([[[0, 0]], [[1, 1]], [[0, 1]]], dtype='i1') # one sample genotyped at three loci
>>> g
<GenotypeArray shape=(3, 1, 2) dtype=int8>
0/0
1/1
0/1
Let’s count alleles. 1st snp has 2 zeros, 2nd snp has two alt alleles (hom alt), 3rd snp has one alttt allele. So allele counts will be: 0, 2, 1. Let’s check it:
ac = g.count_alleles()
>>> ac
<AlleleCountsArray shape=(3, 2) dtype=int32>
2 0
0 2
1 1
Here first column is the number of ref, second column - the number of alt alleles. If we would have three-allelic sites, there would be three columns.
Now let’s calculate vanilla f3 stat.
aca = [[0, 2], [2, 0], [0, 2], [0, 2], [0, 0]] # allele counts for five snips in genome A.
# [0 ref, 2 alt (hom alt)], [hom ref], etc...
acb = [[2, 0], [0, 2], [0, 2], [0, 2], [0, 2]] # allele counts for five snips in genome B
acc = [[1, 1], [1, 1], [0, 2], [2, 0], [1, 1]] # allele counts for five snips in genome C
f3_snp1 = ((acc[0][1] - aca[0][1]) * (acc[0][1] - acb[0][1]))/2 # we divide by 2 because there are max 2 alleles, genome is diploid
f3_snp1 = (1-2) * (1-0) / 2
>>> f3_snp1
-0.5
Now to compute genome-wide estimate just calculate average f3 across the genomes.
NB!
In real life f3 estimate should be corrected for heterozygosity (see Patterson’s 2012 paper appendix A).
References
For this code and more tests see Alistair Miles’s test code for his package scikit-allele.
For a more clear explanation of this and other admixture statistics see Nathan’s review paper