First, let's convert our pruned plink files to eigenstrat format. Use a new parameter file (plink2geno.par), with the following information:
genotypename: Data/HumanOriginsPublic2068_reduced_pruned.bed
snpname: Data/HumanOriginsPublic2068_reduced_pruned.bim
indivname: Data/HumanOriginsPublic2068_reduced_pruned.fam
outputformat: EIGENSTRAT
genotypeoutname: Data/HumanOriginsPublic2068_reduced_pruned.geno
snpoutname: Data/HumanOriginsPublic2068_reduced_pruned.snp
indivoutname: Data/HumanOriginsPublic2068_reduced_pruned.ind
Run convertf:
convertf -p plink2geno.par
Before we proceed, we need to clean up our ind file:
paste <(cut -d ":" -f 2 Data/HumanOriginsPublic2068_reduced_pruned.ind | cut -d " " -f 1) <(cut -d ":" -f 2 Data/HumanOriginsPublic2068_reduced_pruned.ind | cut -d " " -f 2) <(cut -d ":" -f 1 Data/HumanOriginsPublic2068_reduced_pruned.ind | sed 's/ //g') > Data/temp.txt
mv Data/temp.txt Data/HumanOriginsPublic2068_reduced_pruned.ind
We'll begin by computing outgroup F3 statistics to determine which are the populations that share the most drift with French. We'll use Ju_hoan_North - a population from Namibia - as the outgroup. A requirement when computing F3 statistics in AdmixTools is to have a list of population triplets. Copy the following list of triplets to a file and name it OutgroupF3_French.txt. Note that the first column corresponds to our population of interest (French), the second column iterates over a series of populations we'll compare the French to, and the third column is a fixed outgroup, in this case Ju_hoan_North.
French Papuan Ju_hoan_North
French Sardinian Ju_hoan_North
French Orcadian Ju_hoan_North
French Mayan Ju_hoan_North
French Yoruba Ju_hoan_North
French Mbuti Ju_hoan_North
French Karitiana Ju_hoan_North
French Italian_North Ju_hoan_North
French Ami Ju_hoan_North
We also need to create a parameter file (OutgroupF3.par) to specify the location of our input files:
genotypename: Data/HumanOriginsPublic2068_reduced_pruned.geno
snpname: Data/HumanOriginsPublic2068_reduced_pruned.snp
indivname: Data/HumanOriginsPublic2068_reduced_pruned.ind
popfilename: OutgroupF3_French.txt
Now run the program qp3Pop, which comes in the AdmixTools package:
qp3Pop -p OutgroupF3.par > OutgroupF3_French.log
Look at the output log file. Which population shares the largest amount of drift with French? Which shares the least amount of drift?
We can also plot the results. We need to extract the lines from the log file that are of interest to us. We'll also select the unnecessary columns at the beginning of each line:
grep 'result:' OutgroupF3_French.log | awk '{print $2, $3, $4, $5, $6, $7, $8, $9}' > OutgroupF3_French.tab
Let's load it into R and create a plot with error bars for each of the F3 values:
R
library("Hmisc")
f3tab = read.table("OutgroupF3_French.tab", col.names=c("PopA", "PopB", "PopC", "F3", "SE", "Z", "SNPs"))
f3ordered = f3tab[order(-f3tab$F3),]
numrows = dim(f3ordered)[1]
errbar(1:numrows, f3ordered$F3,(f3ordered$F3+f3ordered$SE),(f3ordered$F3-f3ordered$SE), pch=20, las=2, xaxt='n',xlab="", ylab="F3")
axis(1, at=1:numrows, labels=f3ordered$PopB, las=2)
Now, we'll hypothesize that French are an admixed group resulting from the mixture of two populations. We'll try to find the pair of populations in our panel that can best stand in for those two populations. For this, we'll resort to using Admixture F3 statistics, in which the population of interest (French) is now placed in the third position of our population file (where Ju_hoan_North was previously when computing Outgroup F3 statistics). The first and second column will be a candidate pair of populations. The more negative the Admixture F3 statistic, the stronger the violation of "tree-ness" relating the 3 populations - an indication of admixture or population structure.
We'll use a pre-made R script (scripts/BuildF3List.R) to iterate over all possible pairs of populations that do not include French:
cut -f 3 Data/HumanOriginsPublic2068_reduced_pruned.ind | sort | uniq > allpops.txt
Rscript scripts/BuildF3List.R allpops.txt French
We now have a new output file called "f3target_French.popfile", which always contains the French in the 3rd column and all other possible pairs of populations in the first and second columns.
Let's prepare a parameter file, this time called AdmxitureF3.par:
genotypename: Data/HumanOriginsPublic2068_reduced_pruned.geno
snpname: Data/HumanOriginsPublic2068_reduced_pruned.snp
indivname: Data/HumanOriginsPublic2068_reduced_pruned.ind
popfilename: f3target_French.popfile
Finally, run qp3Pop:
qp3Pop -p AdmixtureF3.par > AdmixtureF3_French.log
As before, let's extract the interesting rows and plot them in R. This time, we'll use pair lables rather than the labels of individual popualtions in the x-axis. We'll also order the populations in increasing order:
grep 'result:' AdmixtureF3_French.log | awk '{print $2, $3, $4, $5, $6, $7, $8, $9}' > AdmixtureF3_French.tab
R
library("Hmisc")
f3tab = read.table("AdmixtureF3_French.tab", col.names=c("PopA", "PopB", "Target", "F3", "SE", "Z", "SNPs"))
f3ordered = f3tab[order(f3tab$F3),]
pairs = paste(f3ordered$PopA,f3ordered$PopB,sep=" + ")
numrows = dim(f3ordered)[1]
errbar(1:numrows, f3ordered$F3,(f3ordered$F3+f3ordered$SE),(f3ordered$F3-f3ordered$SE), pch=20, las=2, xaxt='n',xlab="", ylab="F3")
axis(1, at=1:numrows, labels=pairs, las=2,cex.axis=0.4)
abline(h=0)
Are any F3 statistics negative? Are they significantly so? (|K| > 3?) Which pairs of populations do these correspond to? What could this mean?
Finally, we can further confirm that there's been an admixture event in the past history of the French population by lookin for an excess of ABBA or an excess of BABA sites in a 4-population tree. Let's assume that Ju_hoan_North is a distant outgroup and that the French are a sister clade to Sardinians, possibly with some additional admixture from another group. Let's test two groups as potential sources of this admixture event: Karitiana (a Native South American group) and Yoruba (a Western African group).
As before, we need to create a population file (Dstats_French.txt), but this time it will have to have quadruplets instead of triplets:
French Sardinian Karitiana Ju_hoan_North
French Sardinian Yoruba Ju_hoan_North
The parameter file (Dstats.par) should include the following information:
genotypename: Data/HumanOriginsPublic2068_reduced_pruned.geno
snpname: Data/HumanOriginsPublic2068_reduced_pruned.snp
indivname: Data/HumanOriginsPublic2068_reduced_pruned.ind
popfilename: Dstats_French.txt
Now, run the qpDstat program in AdmixTools:
qpDstat -p Dstats.par > Dstats_French.log
Look at the output log file. The first numerical column corresponds to the value of the D-statistic (defined here as (BABA-ABBA)/(BABA+ABBA)). The second numerical column is the Z-score corresponding to this D-statistic. The 3rd and 4th numerical columns are the BABA and ABBA counts, respectively, and the last column is the total number of SNPs available for analysis.
Are any of the two statistics significant? In what direction? (i.e. is there an excess of ABBA or BABA patterns, relative to what you would expect under a 4-population tree?). What could this mean?
To read more about this signal, you can check out the following papers, which suggest that a Northern Eurasian population that contributed ancestry to Native Americans also admixed with the ancestors of certain European populations, including French. Sardinians have almost none of this admixture, which is why F3(Sardinian,Karitiana,French) is so negative:
http://www.genetics.org/content/192/3/1065