Supplemental Code for:

Pleistocene climate fluctuations drove demographic history of African golden wolves (Canis lupaster)


Carlos Sarabia1�, Juan C. Larrasoa�a2, Vicente Ur�os3, Bridgett vonHoldt4, Jennifer A. Leonard1�

1 Conservation and Evolutionary Genetics Group, Estaci�n Biol�gica de Do�ana (EBD-CSIC)
2 Instituto Geol�gico y Minero de Espa�a (IGME)
3 Vertebrate Zoology Research Group, University of Alicante
4 Faculty of Ecology and Evolutionary Biology, University of Princeton 
� Corresponding authors: Carlos Sarabia (cdomsar@gmail.com); Jennifer A. Leonard (jleonard@ebd.csic.es)


#######################################################################################################
#######################################################################################################
00.cutadapt.sh: Adapter trimming for raw data genomes

###########################################
##### Adapter trimming for raw data genomes #####
###########################################

# This script will remove adapters and low quality sequences from our raw data genomes, sequenced in Illumina 1.3+ platforms. The script must take 2.5-4 hours to run depending on library size.

###################################
##### Variable and PATHs definition #####
###################################
# We will have five PATHs:

# PATH to all intermediate (.bam) files of this pipeline
BAMFOLD='/path/to/bamfolder'
# PATH to all sanity checks (flagstat files)
FLAGFOLD='/path/to/flagstat/output'
# PATH for temporary files generated by bwa mem
TEMP='/path/to/store/temporary/files'
# PATH where our program executables are stored (can also be defined as variables)
PROG='/path/to/programs/folder'
# PATH with the folder where we downloaded our raw genome data.
GENFOLD='/path/to/genomes'

#################################
#### Sequence quality assessment ####
#################################

# First we want to estimate and visualize per base sequence quality, k-mer and GC content, sequence length distribution...
# We will do this with FastQC: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
# This example was done with a library of the Algerian African golden wolf - Liu et al., (2018)

$PROG/FastQC/fastqc $GENFOLD/C4PCGANXX_6.R1.fastq.gz -o $BAMFOLD
$PROG/FastQC/fastqc $GENFOLD/C4PCGANXX_6.R2.fastq.gz -o $BAMFOLD

######################
#### Adapter trimming ###
######################

# We will remove adapters commonly used in Illumina 1.3+ platforms. 

python $PROG/cutadapt/bin/cutadapt -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC -A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT --minimum-length 20 -q 20 -o $BAMFOLD/C4PCGANXX_6.trimmed.R1.fastq.gz -p $BAMFOLD/C4PCGANXX_6.trimmed.R2.fastq.gz $GENFOLD/C4PCGANXX_6.R1.fastq.gz $GENFOLD/C4PCGANXX_6.R2.fastq.gz 2> $FLAGFOLD/reportcut.afr_wolf.algeria.txt

#################################
#### Sequence quality assessment ####
#################################

# We want to check how if our adapters have been correctly trimmed. We run FastQC again. 
$PROG/FastQC/fastqc $BAMFOLD/C4PCGANXX_6.trimmed.R1.fastq.gz -o $BAMFOLD
$PROG/FastQC/fastqc $BAMFOLD/C4PCGANXX_6.trimmed.R2.fastq.gz -o $BAMFOLD


#######################################################################################################
#######################################################################################################
00.reference.genomes.index.sh: Indexing our reference genomes

######################################################
## Indexing and creating a dictionary for our reference genomes ##
######################################################

# Our reference genomes are in .fasta format and require being indexed for some programs of our pipeline to run. Also, we will create a dictionary using Picard. This step must not take longer than a couple of hours for a 3.4-Gbases genome.

#############################
## Variable and PATHs definition ##
#############################
# We will have two PATHs:

PROG='/path/to/programs/folder'
# PATH with the fasta reference genome (in this case domestic dog CanFam3.1 - Lindblad-Toh et al., (2005), but can be also Lycaon pictus - Campana et al., (2016)
REF='/path/to/ref/CanFam3.1'

###################################################
## Indexing and creating dictionary of the reference genome ##
###################################################

$PROG/bwa/bwa index $REF

$PROG/samtools/samtools faidx $REF

java -jar $PROG/picard-tools/picard.jar CreateSequenceDictionary R=/path/to/ref/CanFam3.1.fa O=/path/to/ref/CanFam3.1.dict

################
## REFERENCES #
################

# Lindblad-Toh, K., et al. (2005). Genome sequence, comparative analysis and haplotype structure of the domestic dog. Nature, 438(7069), 803�819. https://doi.org/10.1038/nature04338
# Campana, M. G., et al. (2016). Genome sequence, population history, and pelage genetics of the endangered African wild dog (Lycaon pictus). BMC Genomics, 17(1), 1�10. https://doi.org/10.1186/s12864-016-3368-9


#######################################################################################################
#######################################################################################################
01.preprocessing.autoXMTY.sh: Pre-processing pipeline for wild canids genome mapping

##################################################
## Pre-processing pipeline for wild canids genome mapping ##
##################################################

# This is a generic script to follow a pre-processing pipeline for any wild canid after adapter trimming. Steps will be: mapping piped with sorting and read groups addition, marking duplicates and indel realignment.Depending on how many libraries we have for the same specimen, we will need to repeat the mapping step or not. 
# However, the high efficiency of the mapping step allows a 30-40 Gb raw data library to be mapped in around 5 hours. 
# This whole pipeline should last less than 16 hours to complete. 

#############################
## Variable and PATHs definition ##
#############################
# We will have five PATHs:

# PATH to all intermediate (.bam) files of this pipeline
BAMFOLD='/path/to/bamfolder'
# PATH to all sanity checks (flagstat files)
FLAGFOLD='/path/to/flagstat/output'
# PATH for temporary files generated by bwa mem
TEMP='/path/to/store/temporary/files'

PROG='/path/to/programs/folder'
REF='/path/to/ref/CanFam3.1'

########################################################
## Mapping genomes with bwa mem and sorting with samtools sort ##
########################################################

#########################
# As an example, we are mapping one of the libraries of the Algerian African golden wolf - Liu et al., (2018) after adapter trimming. This step is composed of two piped scripts: mapping with bwa mem and directly sorting the output as a compressed (.bam) file. This step needs to be repeated for every library of the same specimen. 
#########################
# Parameters of bwa mem: multithreaded (-t 18), incorporate read groups (-R), with tID (name of group of libraries, given by us),accession name (tSM), sequencing platform name (tPL), library name (tLB), reads length (tPI). 
# For further description of tag names in -bam files, see http://samtools.github.io/hts-specs/
# For further description of bwa mem features, see http://bio-bwa.sourceforge.net/bwa.shtml
#########################
# Parameters of samtools sort: divide in chunks of 500Mb (-m), use $TEMP as temporary folder (-T), use 18 threads (-@18), 
# export as compressed file (-O bam), use output of previously piped program (-).
# For further description of samtools sort features, see http://www.htslib.org/doc/samtools-sort.html
#########################

$PROG/bwa/bwa mem -M -t 18 -R "@RG\tID:awolf_algeria\tSM:SAMC009005\tPL:ILLUMINA\tLB:C4PCGANXX_6\tPU:awolf_algeria1\tPI:100" $REF $BAMFOLD/C4PCGANXX_6.trimmed.R1.fastq.gz $BAMFOLD/C4PCGANXX_6.trimmed.R2.fastq.gz | $PROG/samtools/samtools sort -m 500M -T $TEMP -@18 -O bam -o $BAMFOLD/C4PCGANXX_6.spr.bam -

# To see if this step worked well, we apply samtools flagstat to check out reads. 
$PROG/samtools/samtools flagstat $BAMFOLD/C4PCGANXX_6.spr.bam > $FLAGFOLD/fstat.C4PCGANXX_6.spr.txt

# This step shall be repeated for every library (C4PU7ANXX_4, C4PU7ANXX_5, C4PU7ANXX_4) of the same individual changing tLB

## Duplicate removal ##
#######################
# Duplicates need to be removed. Also, we will remove unmapped and reads who do not belong to primary alignments, singletons and reads mapped to a different chromosome.
# The FLAG field of .bam and .sam files is an indication of a list of features for mapped reads, for a further explanation see
# https://samtools.github.io/hts-specs/SAMv1.pdf and https://broadinstitute.github.io/picard/explain-flags.html
#######################

$PROG/samtools/samtools view -b -f2 -F260 -@16 $BAMFOLD/C4PCGANXX_6.spr.bam > $BAMFOLD/$BAMFOLD/C4PCGANXX_6.sprw.bam

# To see if this step worked well, we apply samtools flagstat to check out reads. 
$PROG/samtools/samtools flagstat $BAMFOLD/C4PCGANXX_6.sprw.bam > $FLAGFOLD/fstat.C4PCGANXX_6.sprw.txt

# Time to remove duplicates with Picard Mark Duplicates. MAX_FILE_HANDLES_FOR_READ_ENDS_MAP option is based in the capacity of our server. We have set up REMOVE_DUPLICATES as true to remove duplicates and indexed the file with CREATE_INDEX

java -Xmx30G -jar $PROG/picard/picard.jar MarkDuplicates INPUT=$BAMFOLD/C4PCGANXX_6.sprw.bam \
OUTPUT=$BAMFOLD/C4PCGANXX_6.sprwr.bam METRICS_FILE=$BAMFOLD/C4PCGANXX_6.sprw.rmdupmetrics.txt \
MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=1500 REMOVE_DUPLICATES=true VALIDATION_STRINGENCY=LENIENT TMP_DIR=$TEMP CREATE_INDEX=true

# To see if this step worked well, we apply samtools flagstat to check out reads. 
$PROG/samtools/samtools flagstat $BAMFOLD/C4PCGANXX_6.sprwr.bam > $FLAGFOLD/fstat.C4PCGANXX_6.sprwr.txt
# This step shall be repeated for every library (C4PU7ANXX_4, C4PU7ANXX_5, C4PU7ANXX_4) of the same individual

## Merging libraries ##

# We are merging libraries AFTER removing all duplicates from each separate library. If we merge libraries before removing
# duplicates, the software might erroneously interpret the same read from two separate libraries as a duplicate. 
# In this case, we have four libraries for the same individual: C4PCGANXX_6, C4PU7ANXX_4, C4PU7ANXX_5, C4PU7ANXX_4. Also, we want the output of out merging process being sorted for the next step. 

$PROG/samtools/samtools merge - $BAMFOLD/C4PCGANXX_6.sprwr.bam $BAMFOLD/C4PU7ANXX_4.sprwr.bam $BAMFOLD/C4PU7ANXX_5.sprwr.bam $BAMFOLD/C4PU7ANXX_6.sprwr.bam | $PROG/samtools/samtools sort -m 500M -T $TEMP -@16 -O bam -o $BAMFOLD/SAMC009005.sprwr.bam -

# To see if this step worked well, we apply samtools flagstat to check out reads. 
$PROG/samtools/samtools flagstat $BAMFOLD/SAMC009005.sprwr.bam > $FLAGFOLD/fstat.SAMC009005.sprwr.txt

## Indel realignment ##

# Last step of the pre-processing pipeline. It is done using GATK. It involves two steps: 
# 1. Create a list of indels self assigned from local data (RealignerTargetCreator)
# 2. Use the list of indels from previous step to realign indels (IndelRealigner)
####
# Indel realignment 1st step. We create a self-generated indel database. 
java -Xmx48G -jar $PROG/GATK/GenomeAnalysisTK.jar -T RealignerTargetCreator -nct 1 -nt 24 -R $REF -I $BAMFOLD/SAMC009005.sprwr.bam -o $BAMFOLD/SAMC009005.rtc.intervals

# second step indel realignment. The list of indels has been self assigned from local data..
java -Xmx48G -jar $PROG/GATK/GenomeAnalysisTK.jar -T IndelRealigner -R $REF -targetIntervals $BAMFOLD/SAMC009005.rtc.intervals -I $BAMFOLD/SAMC009005.sprwr.bam -o $BAMFOLD/SAMC009005.sprwr_realign.bam

# To see if this step worked well, we can apply samtools flagstat to check out reads. 
$PROG/samtools/samtools flagstat $BAMFOLD/SAMC009005.sprwr_realign.bam > $FLAGFOLD/fstat.SAMC009005.sprwr_realign.txt
## Divide per chromosome, extract autosomes, calculate genomewide depth ##
# The software that we will use for generating genotype likelihoods, ANGSD, had different ways to work with haploid or diploid data.
# We will work with autosomal data. We will extract chromosomes 1-38 from the genomewide .bam and calculate read depth.

#########################
# First, we separate them per chromosome using samtools view.

mkdir $BAMFOLD/autosomes
for i in $(seq -w 38); do $PROG/samtools/samtools view -hb -@16 $BAMFOLD/SAMC009005.sprwr_realigned.bam chr${i} > $BAMFOLD/autosomes/afr_wolf.algeria.chr${i}.bam; done
for I in chrX chrY chrMT; do $PROG/samtools/samtools view -hb -@16 SAMC009005.sprwr_realigned.bam $i > $BAMFOLD/autosomes/afr_wolf.algeria.$i.bam; done

#########################
# Second, we estimate average depth and standard deviation of depth per chromosome
COUNTBAM=($(ls $BAMFOLD/autosomes/*.bam | uniq)) # list of bams in /autosomes/

for sample in �${COUNTBAM[@]}�; do $PROG/samtools/samtools depth -a $sample | awk '{sum+=$3; sumsq+=$3*$3} END { print "Average = ",sum/NR; print sqrt(sumsq/NR - (sum/NR)**2)}' >> $BAMFOLD/autosomes/depthdistchr.afr_wolf.algeria.txt; done

#########################
# Third, we merge autosomes.

for i in X Y MT; do mv $BAMFOLD/autosomes/afr_wolf.algeria.chr$i.bam $BAMFOLD/; done

$PROG/samtools/samtools merge - $BAMFOLD/autosomes/*bam | $PROG/samtools/samtools sort -m 500M -T $TEMP -@16 -O bam -o $BAMFOLD/autosomes/afr_wolf.algeria.autosomes.bam -

#########################
#Finally: we index the autosome to call genotype likelihoods later.
$PROG/samtools/samtools index $BAMFOLD/autosomes/afr_wolf.algeria.autosomes.bam

#####################################
## END OF PREPROCESSING PIPELINE ##
#####################################

#################
## REFERENCES ##
#################

# Lindblad-Toh, K., et al. (2005). Genome sequence, comparative analysis and haplotype structure of the domestic dog. Nature, 438(7069), 803�819. https://doi.org/10.1038/nature04338
# Campana, M. G., et al. (2016). Genome sequence, population history, and pelage genetics of the endangered African wild dog (Lycaon pictus). BMC Genomics, 17(1), 1�10. https://doi.org/10.1186/s12864-016-3368-9
# Liu, Y. H., et al. (2018). Whole-Genome sequencing of African dogs provides insights into adaptations against tropical parasites. Molecular Biology and Evolution, 35(2), 287�298. https://doi.org/10.1093/molbev/msx258


#######################################################################################################
#######################################################################################################
02.a.ANGSD.distribution.qscores.sh: Generating a distribution of quality scores

############################
## Distribution of quality scores ##
###########################

# This script will be used to print a distribution of quality scores along our 23 genomes of wild and domestic canids following 
# Matteo Fumagalli's tutorial: https://github.com/mfumagalli/ngsTools/blob/master/TUTORIAL.md#basic-filtering-using-angsd
# Printing such a distribution is important to define whether the quality of our reads is uniformly distributed and if
# we will lose much information at setting different upper and lower filters for quality. 

##############################
## Variable and PATHs definition ##
#############################

# PATH to a list of the 23 autosomal genomes in .bam resulting preprocessing pipeline
BAM='/path/to/bamfolder/list'
PROG='/path/to/programs/folder'
REF='/path/to/ref/CanFam3.1'
# PATH to output folder where genotype likelihoods will be stored:
OUT='/path/to/ANGSD/output/test.qc.distribution'

#######################################
## Generating a distribution of quality scores ##
######################################

# In a previous step of the preprocessing pipeline, we have calculated both the mean average and standard deviation of read depths. We can calculate an upper limit of the cumulative standard deviation. We have set up an upper limit of maxDepth 800 to see where is the main portion of the area under the curve. 

$PROG/angsd/angsd -P 16 -bam $BAMFOLD/23genomes.filelist -ref $REF -out $OUT/genoplot.qc \
-uniqueOnly 1 -remove_bads 1 -only_proper_pairs 1 -trim 0 -C 50 -baq 1 \
-minMapQ 20 -doQsDist 1 -doDepth 1 -doCounts 1 -maxDepth 800

# To look at the files generated, move to the $OUT folder (cd $OUT).
# To see counts of quality scores:
# less -S $OUT/genoplot.qc.qs
# To see counts of per-sample depth:
# less -S $OUT/genoplot.qc.depthSample
# wc -l $OUT/genoplot.qc.depthSample
# To see counts of global depth:
# less -S $OUT/genoplot.qc.depthGlobal


#######################################################################################################
#######################################################################################################
02.b.ANGSD.genotype.likelihoods.sh: Estimating genotype likelihoods with ANGSD

#We are using here similar paths to script 02.a.

##############################
## Estimating genotype likelihoods ##
##############################

# We estimated a lower limit of coverage per genome of 5X (minIndDepth 5). 
# Max depth of all genomes combined was an estimation of the cumulative mean coverage of all 23 genomes multiplied by two. For some analyses, we are interested in having the neutral regions trimmed; for some analyses not (-sites option).
# Calling genotype likelihoods at the whole autosomal genome would require 5 days. We have subdivided the task and parallelized it across the 38 chromosomes. The longest time to run a script was around 12 hours for chromosome 1. 

$PROG/angsd/angsd -P 16 -bam $BAMFOLD/23genomes.chr01.filelist -ref $REF/chr01.fasta -out $OUT/chr01.canfam -r chr01 \
-uniqueOnly 1 -remove_bads 1 -only_proper_pairs 1 -trim 0 -C 50 -baq 1 \
-minMapQ 20 -minQ 20 -minInd 23 -setMinDepth 46 -setMaxDepth 742 -doCounts 1 \
-GL 1 -doGlf 1 -minIndDepth 5

# For those analyses that require neutral sites, we added -sites $SITES at the end of this command.

##############################
## SNP discovery (PLINK format) ##
##############################

# ANGSD can also call SNPs in PLINK format for programs like Admixture or flashPCA to work. 
# We called SNPs separately for each chromosome. Each PLINK format file was simply concatenated later with the cat command. 

$PROG/angsd/angsd -P 16 -bam $BAMFOLD/23genomes.chr01.filelist -ref $REF/chr01.fasta -out $OUT/chr01.canfam.plink -r chr01 \
-uniqueOnly 1 -remove_bads 1 -only_proper_pairs 1 -trim 0 -C 50 -baq 1 \
-minMapQ 20 -minQ 20 -minInd 23 -setMinDepth 46 -setMaxDepth 742 -doCounts 1 \
-GL 1 -minIndDepth 5 -doMajorMinor 1 -doMaf 1 -skipTriallelic 1 -SNP_pval 0.00001 -doPlink 2 -doGeno -4 -doPost 1


#######################################################################################################
#######################################################################################################
02.c.ANGSD.PCA.sh: PCA using genotype likelihoods with ANGSD and ngsTools

#We are using here similar paths to script 02.a.
#####################################
## Estimating genotype likelihoods for PCA ##
#####################################

# We estimated a lower limit of coverage per genome of 5X (minIndDepth 5). 
# Max depth of all genomes combined was an estimation of the cumulative mean coverage of all 23 genomes multiplied by two. For some analyses, we are interested in having the neutral regions trimmed; for some analyses not. We observed that running the whole autosomal genome would require more than 5 days (real time), so instead we subdivided the program in 38 instances -for 38 chromosomes- and ran them in parallel. The program took max. 9 hours (for chromosome 1).

##########################
# 1st step. We estimate genotype likelihoods (repeat for every autosome).

$PROG/angsd/angsd -P 16 -bam $BAM/23genomes.chr01.filelist -ref $REF -out $OUT/chr01.canfam -r chr01 \
-uniqueOnly 1 -remove_bads 1 -only_proper_pairs 1 -trim 0 -C 50 -baq 1 \
-minMapQ 20 -minQ 20 -minInd 23 -setMinDepth 46 -setMaxDepth 742 -doCounts 1 \
-GL 1 -doMajorMinor 1 -doMaf 1 -skipTriallelic 1 -minIndDepth 5 -sites $SITES -SNP_pval 1e-5 -doGeno 32 -doPost 1

#########################
# 2nd step. Run ngsCovar to perform a PCA

# First we need to unzip all .geno.gz and .mafs.gz files and concatenate them (simply with command cat)
# Then we need to estimate the number of sites: 
NSITES = `cat $OUT/autosomes.canfam.mafs | tail -n+2 | wc -l`

# Then we can run ngsCovar. As in Fumagalli's tutorial, we do not want to perform genotype calling (-call 0). 
# We also do not want to normalise by allele frequency (-norm 0)
$PROG/ngsTools/ngsPopGen/ngsCovar -probfile $OUT/autosomes.canfam.geno -outfile $OUT/autosomes.canfam.covar \
-nind 23 -nsites $NSITES -call 0 -norm 0 

########################
# 3rd. Plot the PCR. 

# We first create a plink cluster-like file defining the labelling for each sample.
Rscript -e 'write.table(cbind(seq(1,23), rep(1,23), c(rep("afr_wolf",7), rep("gr_wolf",6), rep("dog",7), rep("eth_wolf",1), rep("eu_gjackal",2))), row.names=F, sep="\t", col.names=c("FID","IID","CLUSTER"), file="autosomes.canfam.clst", quote=F)'

# Then, we run and plot the PCA. 
Rscript /home/carlos/Desktop/programs/ngsTools/Scripts/plotPCA.R -i $OUT/autosomes.canfam.covar -c 1-2 -a $OUT/autosomes.canfam.clst -o $OUT/autosomes.canfam.pca.pdf


#######################################################################################################
#######################################################################################################
02.d.ANGSD.ngsAdmix.sh: Admixture proportions using ANGSD and ngsAdmix

################################
## Estimating Admixture proportions ##
###############################

# We estimated a lower limit of coverage per genome of 5X (minIndDepth 5). 
# Max depth of all genomes combined was an estimation of the cumulative mean coverage of all 23 genomes multiplied by two. For some analyses, we are interested in having the neutral regions trimmed; for some analyses not. 
# We observed that running the whole autosomal genome would require more than 5 days (real time), so instead we subdivided the program in 38 instances -for 38 chromosomes- and ran them in parallel. The program took max. 16 hours (for chromosome 1).

##########################
# 1st step. We estimate genotype likelihoods in BEAGLE format (repeat for every autosome).

$PROG/angsd/angsd -P 16 -bam $BAM/23genomes.chr01.filelist -ref $REF/chr01.fasta -out $OUT/chr01.canfam -r chr01 \
-uniqueOnly 1 -remove_bads 1 -only_proper_pairs 1 -trim 0 -C 50 -baq 1 \
-minMapQ 20 -minQ 20 -minInd 23 -setMinDepth 115 -setMaxDepth 742 -doCounts 1 \
-GL 1 -doMajorMinor 4 -doMaf 1 -skipTriallelic 1 -doGlf 2 -sites $SITES -SNP_pval 1e-6 

#########################
# 2nd step. Run ngsAdmix in beagle data

# First we need to unzip all .beagle.gz files and concatenate them. 
# We simply take all data except for the first row (which is descriptive of the columns) and concatenate them with cat
# Then we can run ngsAdmix. 

for i in {2..23}; do $PROG/ngsTools/ngsAdmix/NGSadmix -likes $OUT/autosomes.filtered.beagle -K $i -minMaf 0.05 -seed 1 \
-o $OUT/ngsAdmix.autosomes.K$i.run 2>$OUT/run.$i.K.log; done

# The final step is plotting. We have modified the R script provided by the developers so that we could have more color categories.

for i in {2..23}; do Rscript $PROG/ngsTools/ngsAdmix/Scripts/plotAdmix.R -i $OUT/ngsAdmix.autosomes.K$i.run.qopt \
-o $OUT/ngsAdmix.autosomes.$i.run.pdf; done


#######################################################################################################
#######################################################################################################
02.e.ANGSD.plink.merge.sh: SNP merging and filtering

##########################
## SNP merging and filtering ##
#########################

# The output of step 02.b will come in PLINK transposed file format (tfiles). So, we need to convert them to bfile format.
# Since PLINK can also work with lists of data, we will store the file names in several lists before the next step.

for i in {01..38}; do
$PROG/plink/plink --tfile $IN/chr$i.canfam --allow-no-sex --dog --make-bed --out $OUT/chr$i.canfam
echo $OUT/chr$i.canfam.bed >> $OUT/bed.canfam.list
echo $OUT/chr$i.canfam.bim >> $OUT/bim.canfam.list
echo $OUT/chr$i.canfam.fam >> $OUT/fam.canfam.list; done

paste $OUT/bed.canfam.list $OUT/bim.canfam.list $OUT/fam.canfam.list > $OUT/dataset.canfam.txt

$PROG/plink/plink --bfile $OUT/chr01.canfam --merge-list $OUT/dataset.canfam.txt --dog --make-bed --out $OUT/autosomes.canfam

# We now filter SNPs in LD, HWE, MAF and missing genotypes. We handle canid data, so we set up option --dog.
# We filter SNPs in linkage disequilibrium in windows of 50 SNPs with a step size of 5 SNPs and a R2 of 0.5.

$PROG/plink/plink --bfile $OUT/autosomes.canfam --allow-no-sex --dog --make-bed --missing-genotype N --indep-pairwise 50 5 0.5 --hwe 0.001 --maf 0.05 --out $OUT/autosomes.canfam.pruned

# More details on this data type and how to work with them in https://zzz.bwh.harvard.edu/plink/data.shtml


#######################################################################################################
#######################################################################################################
02.f.ANGSD.plink.flashPCA.sh: SNP-based PCA

###################
## SNP-based PCA ##
###################

# We are using the output of step 02.d in PLINK-format bfile. We have a set of neutral SNPs in Hardy-Weinberg equilibrium.
# To run flashPCA, we need to run this program at the flashPCA folder.
$PROG/flashPCA/flashpca --bfile $IN/autosomes.canfam.pruned --numthreads 8

# We send files to storage folder
cp p*txt $OUT && cp e*txt $OUT

# We change names so that they do not overlap
mv $OUT/pve.txt $OUT/pve.canfam.autosomes.pruned.txt && mv $OUT/pcs.txt $OUT/pcs.canfam.autosomes.pruned.txt
mv $OUT/eigenvalues.txt $OUT/eigenvalues.canfam.autosomes.pruned.txt
mv $OUT/eigenvectors.txt $OUT/eigenvectors.canfam.autosomes.pruned.txt

# Once there, we need to check the pcs file and edit it:

# First, we paste 1st two columns (with species names and individual names) to the table.
paste cols.txt pcs.canfam.autosomes.pruned.txt > pasted.pcs.canfam.autosomes.pruned.txt

# Second, we set all multiple spaces as one
sed -n 's/ \+/ /gp' pasted.pcs.canfam.autosomes.pruned.txt > sspaced.pcs.canfam.autosomes.pruned.txt

# Third, we concatenate the full header to the file.
cat header.txt sspaced.pcs.canfam.autosomes.pruned.txt > complete.pcs.canfam.autosomes.pruned.txt

# Fourth, we change all spaces for tabs
sed -i 's/\s/\t/g' complete.pcs.canfam.autosomes.pruned.txt 
# Finally, we plot the PCA in R. 


#######################################################################################################
#######################################################################################################
02.g.ANGSD.plink.Admixture.sh: SNP-based Admixture


# We are using the output of step 02.d in PLINK-format bfile. We have a set of neutral SNPs in Hardy-Weinberg equilibrium.

# First we run Admixture
for K in {05..23};do 
$PROG/Admixture/admixture --cv $IN/autosomes.canfam.pruned.bed $K -j20 | tee $OUT/admix_results/log${K}.autosomes.canfam.pruned.txt; done

# Then, we gather the log-likelihood results to draw a best-fit K graph (in R).
grep 'CV' $OUT/admix_results/log${K}.autosomes.canfam.pruned.txt >> $OUT/admix_results/admixCVerror.autosomes.canfam.pruned.txt
#################################################################

03.SFS.Het.Fst.thetas.sh: SFS calculation, heterozygosity, genomewide Fst and thetas

########################################
## Site Frequency Spectrum (SFS) calculation ##
#######################################

# Since we have only few genomes with low coverage, we are not able to run some powerful analyses - like IBD-based analyses. So we will use thetas and neutrality tests as an indirect estimation of recent population changes.
# The Site Frequency Spectrum (SFS) is the distribution of allele frequencies of a number of loci in a population or in an 
individual. It is very useful to calculate genomewide heterozygosity, Fst and thetas. 
# Methods can be found in https://github.com/mfumagalli/ngsTools/blob/master/TUTORIAL.md , http://popgen.dk/angsd/index.php/SFS_Estimation and the theory is presented in https://pubmed.ncbi.nlm.nih.gov/22911679/

#############################
## Variable and PATHs definition ##
#############################
# We will run an unfolded SFS calculation using the Lycaon pictus (Campana et al., 2016) genome as ancestral reference. 
# The domestic dog (CanFam3.1) genome will be the reference "modern" reference. 

PROG='/path/to/programs/folder'
# PATH to the .bam files. We can run SFS per individual (useful to calculate Fst and individual heterozygosity), or population.
IND='/path/to/bamfolder/individual'
POP='/path/to/bamfolder/list'
# Ancestral and reference genomes:
REF='/path/to/ref/CanFam3.1'
ANC='/path/to/ref/L.pictus'
# PATH to output folder where our SFS output will be stored:
OUT='/path/to/ANGSD/output/7.SFS/'
# PATH with a set of genomic regions at 10-kb distances upwards and downwards from genes. 
SITES='/path/to/neutral/regions/'

#######################################
## Site Frequency Spectrum (SFS) calculation ##
########################################

# The first step consists in computing posterior probabilities of Sample Allele Frequency. Both for small populations and individuals, we can run SFS throughout the whole autosomal genome. It takes less than 15 hours (for a population of 3 individuals)
# Example of a population SAF file being called. For Thetas and Fst calculations we filter out genic regions. Minimum depth is defined by sites represented at least 5 times per genome (or 15 in total) and maximum depth is 2.5X the cumulative mean genome depth. The list here points to whole autosomes of the three coyotes (Mexico, Midwest and California - see paper).

$PROG/angsd/angsd -P 8 -bam $IN/coyote.autosomes.bamlist -anc $ANC -ref $REF -out $OUT/coyote.autosomes.wogenic \
-uniqueOnly 1 -remove_bads 1 -only_proper_pairs 1 -trim 0 -C 50 -baq 1 \
-minMapQ 20 -minQ 20 -minInd 3 -setMinDepth 15 -setMaxDepth 119 -doCounts 1 \
-GL 1 -doSaf 1 -sites $SITES

# Example of SAF file being called for an individual (Algerian African golden wolf).
$PROG/angsd/angsd -P 8 -bam $IN/afr_wolf.algeria.autosomes.list -anc $ANC -ref $REF -out $OUT/afr_wolf.algeria.autosomes.wogenic \
-uniqueOnly 1 -remove_bads 1 -only_proper_pairs 1 -trim 0 -C 50 -baq 1 \
-minMapQ 20 -minQ 20 -minInd 1 -setMinDepth 5 -setMaxDepth 71 -doCounts 1 \
-GL 1 -doSaf 1 -sites $SITES

# Second step: We use the .saf files from previous step to call individual SFS with the program realSFS from the ANGSD package. It can be found here: http://www.popgen.dk/angsd/index.php/RealSFS
# Attention! realSFS consumes a lot of RAM memory. Depending on the size of your genome and to avoid crashes, you might want to increase the memory settings first. realSFS takes a rather short time for a genome (15 minutes).

$PROG/angsd/misc/realSFS $OUT/coyote.autosomes.wogenic.saf.idx -P 16 > $OUT/coyote.autosomes.wogenic.sfs

$PROG/realSFS $OUT/afr_wolf.algeria.autosomes.wogenic.saf.idx -P 16 > $OUT/afr_wolf.algeria.autosomes.wogenic.sfs

############################
## Genomewide heterozygosity ##
############################

# Genomewide heterozygosity can be simply calculated dividing the number of singletons by the total number of SFS values. # For example, if our individual SFS file is:

echo $OUT/afr_wolf.algeria.autosomes.wogenic.sfs
350688535 994011 998982946

# Then the genomewide heterozygosity is: 994011/(350688535+994011+998982946), calculated as in Gopalakrishnan et al., (2018). Since we have some medium and low genome coverages, we want to know if there is a bias at the calculation of SFS. We downsampled the Kenyan genome (24X) to coverages pf 7X, 9X, 11.2X and 15X with samtools view -bs and repeated the analysis, correcting the genomewide heterozygosity wherever needed. 

##############################
## Fst between pairs of individuals ##
##############################

# For Fst calculations, we will compute the 2DSFS between pairs of .saf files. 
# Example: calculating Fst between afr_wolf.algeria and other African golden wolves:

for name in afr_wolf.egypt afr_wolf.ethiopia afr_wolf.kenya afr_wolf.moroccopub afr_wolf.moroccounpub afr_wolf.senegal; do
$PROG/angsd/misc/realSFS -P 16 $OUT/afr_wolf.algeria.autosomes.wogenic.saf.idx $OUT/$name.autosomes.wogenic.saf.idx > $OUT/afr_wolf.algeria.$name.2d.sfs; done

# This 2DSFS file will be used to estimate Fst in a 50kb sliding windows scan with realSFS as in:
# https://github.com/mfumagalli/ngsTools/blob/master/TUTORIAL.md 

for name in afr_wolf.egypt afr_wolf.ethiopia afr_wolf.kenya afr_wolf.moroccopub afr_wolf.moroccounpub afr_wolf.senegal; do
$PROG/angsd/misc/realSFS fst index -P 16 $OUT/afr_wolf.algeria.autosomes.wogenic.saf.idx $OUT/$name.autosomes.wogenic.saf.idx -sfs $OUT/afr_wolf.algeria.$name.2d.sfs -fstout $OUT/afr_wolf.algeria.$name.wogenic -whichFST 1
$PROG/angsd/misc/realSFS fst print $OUT/afr_wolf.algeria.$name.wogenic.fst.idx | awk -v OFS='\t' '{sum3+=$3; sum4+=$4} END {print sum3/sum4}' > $OUT/afr_wolf.algeria.$name.gnome.fst; done

###########################
## Thetas and neutrality tests ##
##########################

# Thetas are caltulated using ANGSD and realSFS as in https://github.com/mfumagalli/ngsTools/blob/master/TUTORIAL.md
# The way to do it has been described here: http://www.popgen.dk/angsd/index.php/Thetas,Tajima,Neutrality_tests
# First: we compute the allele frequency posterior probabilities and statistics (-doThetas) using SFS as prior information (-pest). Since we want to know population changes, we will calculate thetas for populations, not individuals. 

#1. Estimate pestPG
$PROG/angsd/angsd -P 10 -bam $IN/AFR_EAST.autosomes.bamlist -anc $ANC -ref $REF -out $OUT/AFR_EAST.autosomes.wogenic \
-uniqueOnly 1 -remove_bads 1 -only_proper_pairs 1 -trim 0 -C 50 -baq 1 \
-minMapQ 20 -minQ 20 -minInd 4 -setMinDepth 5 -setMaxDepth 106 -doCounts 1 \
-GL 1 -doSaf 1 -doThetas 1 -pest $OUT/afr_east.autosomes.wogenic.sfs

#2. Print thetas
$PROG/angsd/misc/thetaStat print $OUT/AFR_EAST.autosomes.wogenic.thetas.idx > $OUT/AFR_EAST.autosomes.wogenic.thetas.print

#3. Check window-based thetas
$PROG/angsd/misc/thetaStat do_stat $OUT/AFR_EAST.autosomes.wogenic.thetas.idx
$PROG/angsd/misc/thetaStat do_stat $OUT/AFR_EAST.autosomes.wogenic.thetas.idx -win 50000 -step 50000 -outnames $OUT/AFR_EAST.autosomes.wogenic.thetas.out

# We then estimate mean and standard deviation of the genomic windows.
awk '{for(i=1;i<=NF;i++) {sum[i] += 10**$i; sumsq[i] += (10**$i)^2}} END {for (i=1;i<=NF;i++) {printf "%f %f \n", sum[i]/NR, sqrt((sumsq[i]-sum[i]^2/NR)/NR)}}' $OUT/AFR_EAST.autosomes.wogenic.thetas.print >> $OUT/AFR_EAST.autosomes.wogenic.thetas.meanstd.logscaled

# After knowing the mean and standard deviation of the number of sites per genome window, we can estimate a confidence interval of # 95% (mean +/- 2 standard deviations). In AFR_EAST example, it is:
# For thetas
cat AFR_EAST.autosomes.wogenic.thetas.out | awk '{for(i=4;i<=8;i++) {if ($14 > 30297) {theta[i]=($i/$14); sum[i]+=theta[i]; sumsq[i]+=(theta[i]^2)}}} END {for(i=4;i<=8;i++) {printf "theta/site= mean %f stdev %f \n", sum[i]/44010, sqrt((sumsq[i]-sum[i]^2/44010)/44010)}}'

# For neutrality tests
cat AFR_EAST.autosomes.wogenic.thetas.out | awk '{for(i=9;i<=13;i++) {if ($14 > 30297) {sum[i]+=$i; sumsq[i]+=($i^2)}}} END {for(i=9;i<=13;i++) {printf "neut.test_wgs= mean %f stdev %f \n", sum[i]/44010, sqrt((sumsq[i]-sum[i]^2/44010)/44010)}}'

#################
## REFERENCES ##
#################

# Gopalakrishnan, S., et al. (2018). Interspecific Gene Flow Shaped the Evolution of the Genus Canis. Current Biology, 28(21), 3441-3449.e5. https://doi.org/10.1016/j.cub.2018.08.041


#######################################################################################################
#######################################################################################################
04.PSMC.sh: Pairwise Sequentially Markovian Coalescent (PSMC) 

###############################################
## Pairwise Sequentially Markovian Coalescent (PSMC) ##
###############################################

# We are running PSMC to estimate deep demographic history of African golden wolves - divided per lineage. 
# PSMC is a software developed by Li and Durbin, (2011), which is based in the calculation of coalescence events between two alleles at every locus. To achieve this, we must call genotypes with the bcftools software: 
# http://samtools.github.io/bcftools/bcftools.html

#############################
## Variable and PATHs definition ##
#############################

PROG='/path/to/programs/folder'
IN='/path/to/bamfolder/individual'
REF='/path/to/ref/CanFam3.1'
# PATH to output folder where our PSMC output will be stored:
OUT='/path/to/ANGSD/output/8.PSMC/'
# PATH with a set of genomic regions at 10-kb distances upwards and downwards from genes. 
SITES='/path/to/neutral/regions/'

#################
## PSMC analysis ##
#################

# To run a PSMC analysis, we will follow the tutorial in https://github.com/lh3/psmc
# Calling genotypes in low coverage data might be tricky for low coverage samples as tested in Nadachowska-Brzyska et al., (2013). We will correct our PSMC plots using the False Negative Rate (FNR) option of PSMC. 

# First. We call a consensus sequence considering alleles with a minimum of 5X and a maximum of 100. 
# Using the example of the Algerian African golden wolf:

$PROG/bcftools mpileup -C50 -f $REF -Ou $IN/afr_wolf.algeria.autosomes.bam | $PROG/bcftools call -c | $PROG/bcftools/bin/vcfutils.pl vcf2fq -d 5 -D 100 | gzip > $OUT/afr_wolf.algeria.fq.gz

# We have now a whole genome diploid consensus sequence of the African golden wolf. 
# We use this consensus sequence to run PSMC. We have used the settings of Freedman et al., (2014) with wild canids:
# 64 time intervals divided in 1 time segments of 6 intervals and 58 individual time intervals - "1*6+58*1"
# Also, we previously downsampled the Kenyan (24X) genome to the same coverage of the Algerian individual and visually corrected the plot to calculate the FNR.

$PROG/psmc/utils/fq2psmcfa -q20 $OUT/afr_wolf.algeria.fq.gz > $OUT/afr_wolf.algeria.psmcfa

# We bootstrap 50 times the PSMC analysis. 
for i in {01..50}; do $PROG/psmc/psmc -N20 -t10 -r6.3291 -b -p "1*6+58*1" -o $OUT/afr_wolf.algeria.round.$i.psmc $OUT/afr_wolf.algeria.split.psmcfa; done

$PROG/psmc/psmc -N20 -t10 -r6.3291 -p "1*6+58*1" -o $OUT/afr_wolf.algeria.unsplit.psmc $OUT/afr_wolf.algeria.psmcfa

# All psmc files are concatenated for the final plot.
cat $OUT/afr_wolf.algeria.unsplit.psmc $OUT/afr_wolf.algeria.round.*.psmc > $OUT/afr_wolf.algeria.combined.psmc

# Finally, we generate the plot. We used a generation time of 3 years (Freedman et al., 2014) and a mutation rate of 4.5*10-9 (Koch et al., 2019). -N is the FNR correction. 

$PROG/psmc/psmc_plot.pl -P "right bottom" -Y16 -f Helvetica,12 -u 4.5e-09 -g 3 -w 2 -N 0.21 "algeria.corrected" algeria $OUT/afr_wolf.algeria.combined.50.psmc


#######################################################################################################
#######################################################################################################
05.ngsPSMC.sh: Genotype likelihood-based PSMC (ngsPSMC)

########################################################################
## Genotype likelihood-based Pairwise Sequentially Markovian Coalescent (ngsPSMC) ##
########################################################################

# We are running ngsPSMC to estimate more recent demographic history of African golden wolves - divided per lineage. 
# ngsPSMC is a software developed by Shchur, Korneliussen and Nielsen (2017), that uses the same principles as PSMC, but using inferred genotype likelihoods as input, which decreases the uncertainty of called genotypes in low coverage genomes. The program can be found as a part of the ANGSD package in: https://github.com/ANGSD/ngsPSMC

#############################
## Variable and PATHs definition ##
#############################

PROG='/path/to/programs/folder'
IN='/path/to/bamfolder/individual'
REF='/path/to/ref/CanFam3.1'
# PATH to output folder where our ngsPSMC output will be stored:
OUT='/path/to/ANGSD/output/9.ngsPSMC/'

####################
## ngsPSMC analysis ##
####################

# To run a ngsPSMC analysis, we will follow the tutorial in https://github.com/ANGSD/ngsPSMC
# The first thing to take into consideration is that ngsPSMC is still in beta and some parameters cannot be optimized.
# Therefore, we have worked with it using parameters as close as possible to the tutorial provided, but adapting several values to our model animal (African golden wolves). 
# First step. We estimate genotype likelihoods with ANGSD.

$PROG/angsd/angsd -P 14 -i $IN/afr_wolf.algeria.autosomes.bam -ref $REF -dopsmc 1 -out $OUT/afr_wolf.algeria.psmc -gl 1 \
-uniqueOnly 1 -remove_bads 1 -only_proper_pairs 1 -trim 0 -C 50 -baq 1 \
-minInd 1 -minIndDepth 5 -minq 20 -minmapq 20

# The output of previous step will serve as input for ngsPSMC. Parameters changed from the tutorial:
# We have estimated individual theta as in script 0.3.SFS.Het.Fst.thetas.sh
# Genomewide rho was estimated from a recombination map from dogs (Auton et al., 2013)
# Time segment parameters were estimated as in PSMC (see script 04.PSMC.sh).
# -init parameters represents initial population effective size, relative to Ne=10000. 
# After some attempts, we observed that the ngsPSMC plot started from 70K years ago. We used the calculated population size 70kyr ago from the PSMC plot (see script 04.PSMC.sh), which was Ne=8800

$PROG/ngspsmc/ngspsmc $OUT/afr_wolf.algeria.autosomes.psmc.psmc.idx -p "1*6+58*1" -dospline 0 -nthreads 38 \
-nIter 50 -init 0.88 -theta 0.000665 -rho 0.00218063 > $OUT/afr_wolf.algeria.autosomes.thetarho.ngspsmc

# Third: plot the ngsPSMC graph.
# The program requires a .txt file with setting units to be defined. Original parameters were: 
# Initial Ne=8800; mutation rate=4.5e-09 (Koch et al., 2019); binsize=100 (default); generation time=3 years (Freedman et al., 2014)

python3.5 $PROG/ngspsmc/utils/psmc_plot.py -psmc algeria $OUT/afr_wolf.algeria.autosomes.thetarho.ngspsmc 0 0 \
--funits $PROG/ngspsmc/utils/setunits.algeria.txt --fout afr_wolf.algeria.autosomes.ci.thetarho.ngspsmc.pdf \
--maxY 1.7

# This command was repeated by changing --maxY value from 1.7 to 0.2 (x10000) to increase the zoom over changes in Ne between 100 and 10000 years ago


#######################################################################################################
#######################################################################################################
06.ROHs.Fi.sh: Runs of Homozygosity (ROHs) and Inbreeding coefficient (Fi)
#!/bin/sh

#################################################################
## Runs of Homozygosity (ROHs) and genomewide Inbreeding coefficient (Fi) ##
#################################################################

# We are calculating heterozygosity through measure of genomewide homozygosity. This can be done through two ways: 
# 1. Calculating directly genomewide heterozygosity (genotype likelihood-based or SNP-based)
# 2. Calculating homozygosity through Runs of Homozygosity (ROHs - genotype likelihood-based or SNP-based)

#############################
## Variable and PATHs definition ##
#############################

PROG='/path/to/programs/folder'
IN='/path/to/bamfolder'
# PATH to the PLINK-format files (output of 02.e) 
SNP='/path/to/bamfolder/individual/SNP'
REF='/path/to/ref/CanFam3.1'
# PATH to output folder where our ngsPSMC output will be stored:
OUT='/path/to/ANGSD/output/10.Fi/'

############################################
## 1. Calculating directly genomewide homozygosity ## 
############################################

## a. ngsF analysis ##

# The first method requires the use of ngsF from the ngsTools software (https://github.com/mfumagalli/ngsTools). 
# A tutorial on how to use ngsF is in: https://github.com/mfumagalli/ngsTools/blob/master/TUTORIAL.md#inbreeding

# First of all, we call genotype likelihoods in binary format (-doGlf 3). This format is accepted by ngsF. 
# We are using two populations: afr_north (composed of Senegal, west Morocco, east Morocco, Algeria and Egypt)
# and afr_east (Ethiopia and Kenya). Maximum depth will be 2X the cumulative mean depth of every genome involved. 

$PROG/angsd/angsd -P 10 -bam $IN/afr_north.filelist -ref $REF -out $OUT/afr_north.glf3 \
-uniqueOnly 1 -remove_bads 1 -only_proper_pairs 1 -trim 0 -C 50 -baq 1 \
-minMapQ 20 -minQ 20 -minInd 5 -setMinDepth 5 -setMaxDepth 143 -doCounts 1 \
-GL 1 -doGlf 3 -minIndDepth 5 -doMajorMinor 1 -doMaf 1 -skipTriallelic 1 -SNP_pval 1e-3

# With ngsF, we estimate individual genomewide inbreeding coefficients (Fi) with 20 iterations. afr_north.indF is the output.

nsites=`zcat $OUT/afr_north.glf3.mafs.gz | tail -n+2 | wc -l` 

zcat $OUT/afr_north.glf3.glf.gz | $PROG/ngsTools/ngsF/ngsF --n_ind 5 --n_sites $nsites --glf - --out $OUT/afr_north.approx_indF --approx_EM --init_values u --n_threads 4

zcat $OUT/afr_north.glf3.glf.gz | $PROG/ngsTools/ngsF/ngsF.sh --n_ind 5 --n_sites $nsites --glf - --out $OUT/afr_north.20it.indF --init_values $OUT/afr_north.approx_indF.pars --n_threads 4
## b. PLINK analysis ##

# With PLINK, we make use of PLINK-format SNP files generated in script 02.e. We directly called SNPs in two populations, 
# afr_north and afr_east. Output will be afr_north.tfam and afr_north.tped.

$PROG/plink/plink --tfile $SNP/afr_north --het --dog --allow-no-sex --out $OUT/afr_north.het
cat afr_north.het.het | awk '{print $6}'

#######################################
## 2. Calculating homozygosity through ROHs ## 
#######################################

## a. Genotype-likelihood based ROH calculation (ROHan) ##

# We used a novel program called ROHan (Renaud et al., 2019), specifically designed to detect ROHs in low coverage data. 
# The program (and a tutorial) can be found here: https://github.com/grenaud/rohan
# ROHan requires a file with the sequencing error rates in an Illumina platform ($error) and an estimation of rates of local heterozygosity (rohmu). Also, defining a desired window size (5 Mb, for example) is desirable. ROHan will estimate: local rates of heterozygosity, local MCMC and HMM posterior per window according to minimum, mid and maximum estimates of heterozygosity, average coverage and estimation of fraction of the genome in ROH and heterozygosity outside ROHs.

$PROG/rohan/rohan --auto $OUT/afr_wolf.algeria.rohan --err $error -t 10 --rohmu 5e-5 -o $OUT/afr_wolf.algeria.autosomes --size 500000 $ref $bam
## b. SNP-based ROH calculation with PLINK ##

# First, PLINK-format SNP files were generated per population as in step 02.e. 
# SNPs were pruned for linkage disequilibrium as in Sams & Boyko, (2019). 

# We generate a bfile and filter for SNPs in close LD in 200-bp windows with a step size of 100 bp and R2=0.9.
$PROG/plink/plink --tfile $SNP/afr_east.autosomes --allow-no-sex --dog --make-bed --recode --out $OUT/afr_east.autosomes

$PROG/plink/plink --bfile $OUT/afr_east.autosomes --allow-no-sex --dog --make-bed --missing-genotype N --indep-pairwise 200 100 0.90 --maf 0.05 --out $OUT/afr_east.autosomes.prune

# After having pruned, we use the list of SNPs pruned to filter the original PLINK format file. 
$PROG/plink/plink --file $SNP/afr_east.autosomes --allow-no-sex --dog --extract $OUT/afr_east.autosomes.prune.prune.in --make-bed --out $OUT/afr_east.autosomes.pruned

# Finally, we estimate Runs of Homozygosity as defined in Sams & Boyko, (2019). 
$PROG/plink/plink --bfile $OUT/afr_east.autosomes.pruned --homozyg --homozyg-snp 41 --homozyg-kb 500 --homozyg-window-snp 41 --homozyg-window-missing 0 --homozyg-window-threshold 0.05 --homozyg-window-het 0 --homozyg-density 5000 --homozyg-gap 1000 --dog --out $OUT/afr_east.autosomes.pruned.samsboyko

#################
## REFERENCES ##
#################

# Renaud, G., et al. (2019). Joint Estimates of Heterozygosity and Runs of Homozygosity for Modern and Ancient Samples. Genetics, 212(July), 587�614. https://doi.org/10.1534/genetics.119.302057

# Sams, A. J., & Boyko, A. R. (2019). Fine-scale resolution of runs of homozygosity reveal patterns of inbreeding and substantial overlap with recessive disease genotypes in domestic dogs. G3: Genes, Genomes, Genetics, 9(1), 117�123. https://doi.org/10.1534/g3.118.200836


#######################################################################################################
#######################################################################################################
07.a.MiSTI.sh: Estimating divergence with PSMC-based Migration and Split Time Inference (MiSTI)

########################################################################
## Estimating divergence with PSMC-based Migration and Split Time Inference (MiSTI) ##
########################################################################

# MiSTI is a novel program developed by Vladimir Shchur (https://github.com/vlshchur/MiSTI) that is able to estimate migration rates over time and split time between two populations. To this end, the program requires:
# A PSMC file of each genome to be compared. This was done in step 04.PSMC.sh
# A joint site frequency spectrum (2DSFS) file calculated from both genomes (script 03.SFS.Het.Fst.thetas.sh). 
# (Optional) Migration bands: defined segments of time where we estimate different rates of migration. It can be optimized. 

#############################
## Variable and PATHs definition ##
#############################

PROG='/path/to/programs/folder'
IN='/path/to/bamfolder/individual'
REF='/path/to/ref/CanFam3.1'
# PATH to output folder where our PSMC input is stored:
psmc='/path/to/ANGSD/output/8.PSMC/PSMC'
# PATH to output folder where our 2DSFS input is stored:
dsfs='/path/to/ANGSD/output/7.SFS/2DSFS'

#################
## MiSTI analysis ##
#################

# We convert the 2DSFS file from step 03.SFS.Het.Fst.thetas.sh to MiSTI format. Attention! This 2DSFS MUST include genic regions. With this example, we will calculate the split time and migration rates between the Algerian and Kenyan African golden wolves. Most of these calculations would normally take minutes or seconds and can be run in any home PC.

$PROG/misti/utils/ANGSDSFS.py $dsfs/afr_wolf.algeria.afr_wolf.kenya.wgenic.sfs afr_wolf.algeria afr_wolf.kenya > algeria.kenya.mi.sfs

# We calculate the list of time segments between the Algerian and Kenyan AGW PSMC files. 
# We define these units for canids: mutation rate 4.5e-09 (Koch et al., 2019), generation time 3 years (Freedman et al., 2014)

$PROG/misti/utils/calc_time.py --funits $PROG/misti/misti/setunits.canid.txt $psmc/afr_wolf.algeria.psmc $psmc/afr_wolf.kenya.psmc > timescale.algeria.kenya.txt

# We have calculated the heterozygosity loss per genome due to coverage in step 03.SFS.Het.Fst.thetas.sh.
# Also, the file timescale.algeria.kenya.txt will have a number of defined time segments with certain time points in the past. 
# We have based our knowledge in humid/dry periods in Sahara from the literature (see paper) to define a number of time segments under which there could be more or less migration. To avoid biases, we let the program optimize the migration rates by itself. These time segments could have (or not) affected our lineages since they might not have been in the same place. For this reason, we repeated the analyses three times being dynamic with time segments and received the same results for time split. We will be using GNU Parallel (Tange, 2018), which can paralellize running softwares in home PCs. 
# An example of one of the runs: 

time parallel --header : -j 20 $misti -uf --bsSize 10 --hetloss 0.21472 0.0 --funits $PROG/misti/setunits.canid.txt $psmc/afr_wolf.algeria.psmc $psmc/afr_wolf.kenya.psmc algeria.kenya.mi.sfs {per} -o algeria.kenya.opt.mi -mi 1 0 3 00.0 1 -mi 2 0 3 00.0 1 -mi 1 4 9 00.0 1 -mi 2 4 9 00.0 1 -mi 1 10 35 00.0 1 -mi 2 10 35 00.0 1 -mi 1 36 37 00.0 1 -mi 2 36 37 00.0 1 -mi 1 38 39 00.0 1 -mi 2 38 39 00.0 1 -mi 1 40 41 00.0 1 -mi 2 40 41 00.0 1 -mi 1 42 43 00.0 1 -mi 2 42 43 00.0 1 -mi 1 44 45 00.0 1 -mi 2 44 45 00.0 1 -mi 1 46 50 00.0 1 -mi 2 46 50 00.0 1 >> output.modelized/algeria.kenya.opt.out ::: per 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50

# With this command, we are defining several time segments: 0-3, 4-9, 10-35, 36-37, 38-39, 40-41, 42-43, 44-45, 46-50. 
# The values in real time of these segments will be given by the file timescale.algeria.kenya.txt
########################################
## Calculating the split time and migration rates ##
########################################

# The output algeria.kenya.opt.out shows the different paralellized runs of MiSTI and the log-likelihood values. 
# This command will show the different log likelihoods. We will choose the one closest to 0. 
grep "llh =" algeria.kenya.opt.out

# Calculating the split time is not straightforward. Normally, a list of split times will look like this:
algeria.kenya 
splitT time Time/10000 Llh /100000 llh
10 14014.906 1.4014906 -2.4498167494272 -244981.67494272
11 16454.32606 1.6454326 -2.4141074259614 -241410.74259614
12 17095.6004 1.70956004 -2.40589688625513 -240589.688625513
13 20439.20546 2.043920546 -2.36995939498768 -236995.939498768
14 20568.79202 2.056879202 -2.36876460523784 -236876.460523784
15 24068.03546 2.406803546 -2.35175114554301 -235175.114554301
16 25009.44351 2.500944351 -2.34804439531577 -234804.439531577
17 28006.614 2.8006614 -2.34162141334184 -234162.141334184
18 29802.81405 2.9802814 -2.33895852365319 -233895.852365319
19 32281.0112 3.22810112 -2.3387775032911 -233877.75032911
20 34976.85229 3.49768522 -2.34017491043334 -234017.491043334
21 36920.16933 3.69201693 -2.34303327439572 -234303.327439572
22 40561.62956 4.05616295 -2.35082403745713 -235082.403745713

# The split time must be between 29802-32281 years ago, but it is impossible to calculate it like this. However, we can plot a scatterplot and adjust a polynomial curve between time and llh (we have included time/10000 and llh/100000 here for the sake of simplification). To do so, we have plotted llh/100000 vs time/10000 in an Excel table and have adjusted a polynomial curve of degree 5. R2 was 0.9945 and the equation 9.556E-04x? � 2.763E-02x? + 2.98E-01x? � 1.483x? + 3.319x -4.942.
# Finally, to calculate the split time, we have used the Newton-Raphson approach with the help of a website to calculate the points in which the derivative of the polynomial reaches 0 (https:www.symbolab.com). We have taken the 0.1%, 1% and 5% of upper points between 0 and 100000 years ago to estimate a confidence interval. 
# These time points and intervals were plotted with data of d??O from the NGRIP with glacial/interglacial times (see paper).
# To estimate migration rates, the lowest value of llh was taken and migration rates were observed per time segment. 

#################
## REFERENCES ##
#################

# Freedman, A. H., et al.(2014). Genome Sequencing Highlights the Dynamic Early History of Dogs. PLoS Genetics, 10(1). https://doi.org/10.1371/journal.pgen.1004016

# Koch, E. M., et al. (2019). De Novo Mutation Rate Estimation in Wolves of Known Pedigree. Molecular Biology and Evolution, 36(11), 2536�2547. https://doi.org/10.1093/molbev/msz159


#######################################################################################################
#######################################################################################################
07.b.MiSTI.replicability.cov.sh: Replicability test: a. High and low genomic coverages

#########################################################################
## Estimating divergence with PSMC-based Migration and Split Time Inference (MiSTI) ##
########################################################################

## Replicability test: a. High and low genomic coverages ##

# MiSTI is a novel program developed by Vladimir Shchur (https://github.com/vlshchur/MiSTI) that is able to estimate migration rates over time and split time between two populations. To this end, the program requires:
# A PSMC file of each genome to be compared. This was done in step 04.PSMC.sh
# A joint site frequency spectrum (2DSFS) file calculated from both genomes (script 03.SFS.Het.Fst.thetas.sh). 
# (Optional) Migration bands: defined segments of time where we estimate different rates of migration. It can be optimized. 

# In this script, we are running the same command between the Algerian and Kenyan African golden wolf genomes. 
# Kenyan AGW (24X) has been downsampled to 15X, 11.2X, 9X and 7X.
# Results are in Table S7.

#############################
## Variable and PATHs definition ##
#############################

PROG='/path/to/programs/folder'
IN='/path/to/bamfolder/individual'
REF='/path/to/ref/CanFam3.1'
# PATH to output folder where our PSMC input is stored:
psmc='/path/to/ANGSD/output/8.PSMC/PSMC'
# PATH to output folder where our 2DSFS input is stored:
dsfs='/path/to/ANGSD/output/7.SFS/2DSFS'

#################
## MiSTI analysis ##
#################

###
# 1st step: convert ANGSD 2DSFS file -> MiSTI 2DSFS file
# For each coverage, we convert the 2DSFS file from step 03.SFS.Het.Fst.thetas.sh to MiSTI format.
# Most of these calculations would normally take minutes or seconds and can be run in any home PC.
for cov in 24X 15X 11.2X 9X 7X; do $PROG/misti/utils/ANGSDSFS.py $dsfs/afr_wolf.algeria.afr_wolf.kenya.$cov.sfs afr_wolf.algeria afr_wolf.kenya.$cov > algeria.kenya.mi.$cov.sfs; done

###
# 2nd step: calculate the list of time segments between the Algerian and Kenyan AGW PSMC files
# We define a set of units for canids: mutation rate 4.5e-09 (Koch et al., 2019), generation time 3 years (Freedman et al., 2014)

for cov in 24X 15X 11.2X 9X 7X; do $PROG/misti/utils/calc_time.py --funits $PROG/misti/misti/setunits.canid.txt $psmc/afr_wolf.algeria.psmc $psmc/afr_wolf.kenya.$cov.psmc > timescale.algeria.kenya.$cov.txt; done

###
# 3rd step: Define a number of common time steps to run between the different coverages. 
# We have observed that definition of time segments does not affect the divergence time estimation (see Supplementary File: "Replicability tests: results b. Definition of time segments"). # However, we want our time segments to be as close to each other as possible. Our time steps are like this:

time step algeria.kenya.7X algeria.kenya.9X algeria.kenya.11.2X algeria.kenya.15X algeria.kenya.24X
0 0 0 0 0 0
1 1237 1487 1551 1776 1919
2 2205 2205 2205 2205 2205
3 2579 3100 3234 3703 4003
4 4037 4585 4585 4585 4585
5 4585 4852 5060 5793 6263
6 5618 6752 7042 7154 7154
7 7154 7154 7154 8060 8715
8 7334 8814 9192 9928 9928
9 9197 9928 9928 10519 11375
10 9928 11052 11525 12921 12921
...
# In this table we see how several time steps coincide (0, 2205, 4585, 7154, 9928), which are the time steps extracted from the Algerian .psmc file. We will use these steps as beginning of every time segment. 

###
# 4th step: Run MiSTI. 

# We have calculated the heterozygosity loss per genome due to coverage in step 03.SFS.Het.Fst.thetas.sh. 
# We are taking into account genomic heterozygosity loss using the --hetloss command.
# We will be using GNU Parallel (Tange, 2018), which can paralellize running softwares in home PCs. Our runs:

# Algeria.Kenya.24X
time parallel --header : -j 20 $misti -uf --bsSize 10 --hetloss 0.21472 0.0 --funits $MiSTI/setunits.canid.txt $in/psmc/afr_wolf.algeria.normalX.wgenic.psmc $in/psmc/afr_wolf.kenya.24X.wgenic.psmc $in/misfs/algeria.kenya.24X.mi.sfs {per} -o $in/mi/algeria.kenya.24X.opt.extended.mi -mi 1 0 3 00.0 1 -mi 2 0 3 00.0 1 -mi 1 4 9 00.0 1 -mi 2 4 9 00.0 1 -mi 1 10 35 00.0 1 -mi 2 10 35 00.0 1 -mi 1 36 37 00.0 1 -mi 2 36 37 00.0 1 -mi 1 38 39 00.0 1 -mi 2 38 39 00.0 1 -mi 1 40 41 00.0 1 -mi 2 40 41 00.0 1 -mi 1 42 43 00.0 1 -mi 2 42 43 00.0 1 -mi 1 44 45 00.0 1 -mi 2 44 45 00.0 1 -mi 1 46 50 00.0 1 -mi 2 46 50 00.0 1 ::: per 15 16 17 18 19 20 21 22 23 24 25 >> output.test.a/algeria.kenya.24X.extended.out

# Algeria.Kenya.15X
time parallel --header : -j 20 $misti -uf --bsSize 10 --hetloss 0.21472 0.09886 --funits $MiSTI/setunits.canid.txt $in/psmc/afr_wolf.algeria.normalX.wgenic.psmc $in/psmc/afr_wolf.kenya.15X.wgenic.psmc $in/misfs/algeria.kenya.15X.mi.sfs {per} -o $in/mi/algeria.kenya.15X.opt.extended.mi -mi 1 0 3 00.0 1 -mi 2 0 3 00.0 1 -mi 1 4 9 00.0 1 -mi 2 4 9 00.0 1 -mi 1 10 36 00.0 1 -mi 2 10 36 00.0 1 -mi 1 37 38 00.0 1 -mi 2 37 38 00.0 1 -mi 1 39 40 00.0 1 -mi 2 39 40 00.0 1 -mi 1 41 42 00.0 1 -mi 2 41 42 00.0 1 -mi 1 43 44 00.0 1 -mi 2 43 44 00.0 1 -mi 1 45 46 00.0 1 -mi 2 45 46 00.0 1 -mi 1 47 51 00.0 1 -mi 2 47 51 00.0 1 ::: per 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 >> output.test.a/algeria.kenya.15X.extended.out

# Algeria.Kenya.11.2X
time parallel --header : -j 20 $misti -uf --bsSize 10 --hetloss 0.21472 0.18911 --funits $MiSTI/setunits.canid.txt $in/psmc/afr_wolf.algeria.normalX.wgenic.psmc $in/psmc/afr_wolf.kenya.11.2X.wgenic.psmc $in/misfs/algeria.kenya.11.2X.mi.sfs {per} -o $in/mi/algeria.kenya.11.2X.opt.extended.mi -mi 1 0 3 00.0 1 -mi 2 0 3 00.0 1 -mi 1 4 10 00.0 1 -mi 2 4 10 00.0 1 -mi 1 11 37 00.0 1 -mi 2 11 37 00.0 1 -mi 1 38 39 00.0 1 -mi 2 38 39 00.0 1 -mi 1 40 41 00.0 1 -mi 2 40 41 00.0 1 -mi 1 42 43 00.0 1 -mi 2 42 43 00.0 1 -mi 1 44 45 00.0 1 -mi 2 44 45 00.0 1 -mi 1 46 47 00.0 1 -mi 2 46 47 00.0 1 -mi 1 48 52 00.0 1 -mi 2 48 52 00.0 1 ::: per 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 >> output.test.a/algeria.kenya.11.2X.extended.out

# Algeria.Kenya.9X
time parallel --header : -j 20 $misti -uf --bsSize 10 --hetloss 0.21472 0.21472 --funits $MiSTI/setunits.canid.txt $in/psmc/afr_wolf.algeria.normalX.wgenic.psmc $in/psmc/afr_wolf.kenya.9X.wgenic.psmc $in/misfs/algeria.kenya.9X.mi.sfs {per} -o $in/mi/algeria.kenya.9X.opt.extended.mi -mi 1 0 3 00.0 1 -mi 2 0 3 00.0 1 -mi 1 4 10 00.0 1 -mi 2 4 10 00.0 1 -mi 1 11 38 00.0 1 -mi 2 11 38 00.0 1 -mi 1 39 40 00.0 1 -mi 2 39 40 00.0 1 -mi 1 41 42 00.0 1 -mi 2 41 42 00.0 1 -mi 1 43 44 00.0 1 -mi 2 43 44 00.0 1 -mi 1 45 46 00.0 1 -mi 2 45 46 00.0 1 -mi 1 47 48 00.0 1 -mi 2 47 48 00.0 1 -mi 1 49 53 00.0 1 -mi 2 49 53 00.0 1 ::: per 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 >> output.test.a/algeria.kenya.9X.extended.out

# Algeria.Kenya.7X
time parallel --header : -j 20 $misti -uf --bsSize 10 --hetloss 0.21472 0.23025 --funits $MiSTI/setunits.canid.txt $in/psmc/afr_wolf.algeria.normalX.wgenic.psmc $in/psmc/afr_wolf.kenya.7X.wgenic.psmc $in/misfs/algeria.kenya.7X.mi.sfs {per} -o $in/mi/algeria.kenya.7X.opt.extended.mi -mi 1 0 4 00.0 1 -mi 2 0 4 00.0 1 -mi 1 5 11 00.0 1 -mi 2 5 11 00.0 1 -mi 1 12 40 00.0 1 -mi 2 12 40 00.0 1 -mi 1 41 42 00.0 1 -mi 2 41 42 00.0 1 -mi 1 43 44 00.0 1 -mi 2 43 44 00.0 1 -mi 1 45 46 00.0 1 -mi 2 45 46 00.0 1 -mi 1 47 48 00.0 1 -mi 2 47 48 00.0 1 -mi 1 49 50 00.0 1 -mi 2 49 50 00.0 1 -mi 1 51 55 00.0 1 -mi 2 51 53 00.0 1 ::: per 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 >> output.test.a/algeria.kenya.7X.extended.out

# Every time segment defined starts in a common time step found across all five time scale files. 
# The values in real time of these segments will be given by files timescale.algeria.kenya.$cov.txt; with cov being 24X, 15X, 11.2X, 9X and 7X.

########################################
## Calculating the split time and migration rates ##
########################################

# The output algeria.kenya.$cov.extended.out shows the different paralellized runs of MiSTI and the log-likelihood values. 
# This command will show the different log likelihoods. We will choose the one closest to 0. 
grep "llh =" algeria.kenya.opt.out

# Calculating the split time is not straightforward. Normally, a list of split times will look like this:
algeria.kenya.24X 
splitT time Time/10000 Llh /100000 llh
10 14014.906 1.4014906 -2.4498167494272 -244981.67494272
11 16454.32606 1.6454326 -2.4141074259614 -241410.74259614
12 17095.6004 1.70956004 -2.40589688625513 -240589.688625513
13 20439.20546 2.043920546 -2.36995939498768 -236995.939498768
14 20568.79202 2.056879202 -2.36876460523784 -236876.460523784
15 24068.03546 2.406803546 -2.35175114554301 -235175.114554301
16 25009.44351 2.500944351 -2.34804439531577 -234804.439531577
17 28006.614 2.8006614 -2.34162141334184 -234162.141334184
18 29802.81405 2.9802814 -2.33895852365319 -233895.852365319
19 32281.0112 3.22810112 -2.3387775032911 -233877.75032911
20 34976.85229 3.49768522 -2.34017491043334 -234017.491043334
21 36920.16933 3.69201693 -2.34303327439572 -234303.327439572
22 40561.62956 4.05616295 -2.35082403745713 -235082.403745713

# The split time must be between 29802-32281 years ago, but it is impossible to calculate it like this. 
# We plot a scatterplot and adjust a polynomial curve between time and llh (we have included time/10000 and llh/100000 here for the sake of simplification). It is the same step done as in script 07.a. We extract the polynomial curve and R2 
# We also compare the upper 95%, 99% and 99.9% of the curve maximum between the five estimates.


#######################################################################################################
#######################################################################################################
07.c.MiSTI.replicability.times.sh: Replicability test: b. Definition of time segments

## Replicability test: b. Definition of time segments ##

# MiSTI is a novel program developed by Vladimir Shchur (https://github.com/vlshchur/MiSTI) that is able to estimate migration rates over time and split time between two populations. To this end, the program requires:
# A PSMC file of each genome to be compared. This was done in step 04.PSMC.sh
# A joint site frequency spectrum (2DSFS) file calculated from both genomes (script 03.SFS.Het.Fst.thetas.sh). 
# (Optional) Migration bands: defined segments of time where we estimate different rates of migration. It can be optimized. 
# In this script, we are running the same command between the Algerian and Egyptian golden wolf genomes. 
# We defined different time segments, using both humid/dry time segments and paired (even and uneven) time segments. 
# Results are in Table S8.

#############################
## Variable and PATHs definition ##
#############################

PROG='/path/to/programs/folder'
IN='/path/to/bamfolder/individual'
REF='/path/to/ref/CanFam3.1'
# PATH to output folder where our PSMC input is stored:
psmc='/path/to/ANGSD/output/8.PSMC/PSMC'
# PATH to output folder where our 2DSFS input is stored:
dsfs='/path/to/ANGSD/output/7.SFS/2DSFS'

#################
## MiSTI analysis ##
#################

# We used the same files at every time (afr_wolf.algeria and afr_wolf.egypt) and repeated the previous steps from scripts 7.a and 7.b. Heterozygosity loss is taken into account with command --hetloss

# humid/dry
time parallel --header : -j 10 $misti -uf --bsSize 10 --hetloss 0.21472 0.09886 --funits $PROG/MiSTI/setunits.canid.txt $in/psmc/afr_wolf.algeria.normalX.wgenic.psmc $in/psmc/afr_wolf.egypt.normalX.wgenic.psmc $in/misfs/algeria.egypt.mi.sfs {per} -o $in/mi/algeria.egypt.opt.mi -mi 1 0 2 00.0 1 -mi 2 0 2 00.0 1 -mi 1 3 6 00.0 1 -mi 2 3 6 00.0 1 -mi 1 7 13 00.0 1 -mi 2 7 13 00.0 1 -mi 1 14 17 00.0 1 -mi 2 14 17 00.0 1 -mi 1 18 19 00.0 1 -mi 2 18 19 00.0 1 -mi 1 20 22 00.0 1 -mi 2 20 22 00.0 1 -mi 1 23 29 00.0 1 -mi 2 23 29 00.0 1 -mi 1 30 32 00.0 1 -mi 2 30 32 00.0 1 -mi 1 33 34 00.0 1 -mi 2 33 34 00.0 1 -mi 1 35 36 00.0 1 -mi 2 35 36 00.0 1 -mi 1 37 38 00.0 1 -mi 2 37 38 00.0 1 -mi 1 39 40 00.0 1 -mi 2 39 40 00.0 1 -mi 1 41 44 00.0 1 -mi 2 41 44 00.0 1 >> algeria.egypt.opt.out ::: per 3 4 5 6 7 8 9 10 11

# paired even
time parallel --header : -j 20 $misti -uf --bsSize 10 --hetloss 0.21472 0.09886 --funits $PROG/MiSTI/setunits.canid.txt $in/psmc/afr_wolf.algeria.normalX.wgenic.psmc $in/psmc/afr_wolf.egypt.normalX.wgenic.psmc $in/misfs/algeria.egypt.mi.sfs {per} -o $in/mi/algeria.egypt.opt.mi -mi 1 0 3 00.0 1 -mi 2 0 3 00.0 1 -mi 1 4 5 00.0 1 -mi 2 4 5 00.0 1 -mi 1 6 7 00.0 1 -mi 2 6 7 00.0 1 -mi 1 8 9 00.0 1 -mi 2 8 9 00.0 1 -mi 1 10 11 00.0 1 -mi 2 10 11 00.0 1 -mi 1 12 13 00.0 1 -mi 2 12 13 00.0 1 -mi 1 14 15 00.0 1 -mi 2 14 15 00.0 1 -mi 1 16 17 00.0 1 -mi 2 16 17 00.0 1 -mi 1 18 19 00.0 1 -mi 2 18 19 00.0 1 -mi 1 20 21 00.0 1 -mi 2 20 21 00.0 1 -mi 1 22 23 00.0 1 -mi 2 22 23 00.0 1 -mi 1 24 25 00.0 1 -mi 2 24 25 00.0 1 -mi 1 26 27 00.0 1 -mi 2 26 27 00.0 1 -mi 1 28 29 00.0 1 -mi 2 28 29 00.0 1 -mi 1 30 31 00.0 1 -mi 2 30 31 00.0 1 -mi 1 32 33 00.0 1 -mi 2 32 33 00.0 1 -mi 1 34 35 00.0 1 -mi 2 34 35 00.0 1 -mi 1 36 37 00.0 1 -mi 2 36 37 00.0 1 -mi 1 38 39 00.0 1 -mi 2 38 39 00.0 1 -mi 1 40 41 00.0 1 -mi 2 40 41 00.0 1 -mi 1 42 43 00.0 1 -mi 2 42 43 00.0 1 -mi 1 44 45 00.0 1 -mi 2 44 45 00.0 1 >> algeria.egypt.paired.even.out ::: per 3 4 5 6 7 8 9 10 11

# paired uneven
time parallel --header : -j 20 $misti -uf --bsSize 10 --hetloss 0.21472 0.09886 --funits $PROG/MiSTI/setunits.canid.txt $in/psmc/afr_wolf.algeria.normalX.wgenic.psmc $in/psmc/afr_wolf.egypt.normalX.wgenic.psmc $in/misfs/algeria.egypt.mi.sfs {per} -o $in/mi/algeria.egypt.opt.paired.uneven.mi -mi 1 0 2 00.0 1 -mi 2 0 2 00.0 1 -mi 1 3 4 00.0 1 -mi 2 3 4 00.0 1 -mi 1 5 6 00.0 1 -mi 2 5 6 00.0 1 -mi 1 7 8 00.0 1 -mi 2 7 8 00.0 1 -mi 1 9 10 00.0 1 -mi 2 9 10 00.0 1 -mi 1 11 12 00.0 1 -mi 2 11 12 00.0 1 -mi 1 13 14 00.0 1 -mi 2 13 14 00.0 1 -mi 1 15 16 00.0 1 -mi 2 15 16 00.0 1 -mi 1 17 18 00.0 1 -mi 2 17 18 00.0 1 -mi 1 19 20 00.0 1 -mi 2 19 20 00.0 1 -mi 1 21 22 00.0 1 -mi 2 21 22 00.0 1 -mi 1 23 24 00.0 1 -mi 2 23 24 00.0 1 -mi 1 25 26 00.0 1 -mi 2 25 26 00.0 1 -mi 1 27 28 00.0 1 -mi 2 27 28 00.0 1 -mi 1 29 30 00.0 1 -mi 2 29 30 00.0 1 -mi 1 31 32 00.0 1 -mi 2 31 32 00.0 1 -mi 1 33 34 00.0 1 -mi 2 33 34 00.0 1 -mi 1 35 36 00.0 1 -mi 2 35 36 00.0 1 -mi 1 37 38 00.0 1 -mi 2 37 38 00.0 1 -mi 1 39 40 00.0 1 -mi 2 39 40 00.0 1 -mi 1 42 43 00.0 1 -mi 2 42 43 00.0 1 -mi 1 44 45 00.0 1 -mi 2 44 45 00.0 1 >> algeria.egypt.paired.uneven.out ::: per 3 4 5 6 7 8 9 10 11

########################################
## Calculating the split time and migration rates ##
########################################

# The output algeria.egypt.opt.out shows the different paralellized runs of MiSTI and the log-likelihood values. This command will show the different log likelihoods. We will choose the one closest to 0. 
grep "llh =" algeria.kenya.opt.out

# Calculating the split time is not straightforward. Normally, a list of split times will look like this:
algeria.kenya 
splitT time Time/10000 Llh /100000 llh
10 14014.906 1.4014906 -2.4498167494272 -244981.67494272
11 16454.32606 1.6454326 -2.4141074259614 -241410.74259614
12 17095.6004 1.70956004 -2.40589688625513 -240589.688625513
13 20439.20546 2.043920546 -2.36995939498768 -236995.939498768
14 20568.79202 2.056879202 -2.36876460523784 -236876.460523784
15 24068.03546 2.406803546 -2.35175114554301 -235175.114554301
16 25009.44351 2.500944351 -2.34804439531577 -234804.439531577
17 28006.614 2.8006614 -2.34162141334184 -234162.141334184
18 29802.81405 2.9802814 -2.33895852365319 -233895.852365319
19 32281.0112 3.22810112 -2.3387775032911 -233877.75032911
20 34976.85229 3.49768522 -2.34017491043334 -234017.491043334
21 36920.16933 3.69201693 -2.34303327439572 -234303.327439572
22 40561.62956 4.05616295 -2.35082403745713 -235082.403745713

# We repeat the steps of scripts 7.a and 7.b to estimate divergence time, adjusting a polynomial curve between time and llh (we have included time/10000 and llh/100000 here for the sake of simplification) and comparing 95%, 99% and 99.9% of the curve maximum between the three estimates.
# We also added an estimate of curve maximum with humid/dry periods and all time steps between 1 and 44 (150kyr ago). 


#######################################################################################################
#######################################################################################################
07.d.MiSTI.replicability.Cavalli.sh: Replicability test: c. Cavalli-Sforza (1969) equation

## Replicability test: c. Cavalli-Sforza (1969) equation ##

# In this script, we are estimating Watterson's theta and genomewide Fst to apply at the Cavalli-Sforza's equation (1969) and compare with MiSTI results. 
# The equation is (equation 4 from Supplementary File: Replicability of results in divergence time estimation. C. Comparison to Cavalli-Sforza estimation.") t = (- log (1-?ST))*2*(?W/(4*?)); where ?ST is a mean estimate of Fst (with genic regions), ?W is genomewide Watterson's theta (with genic regions), ? is mutation rate per site and generation (we will use ?=4.0*10^-9 as in Skoglund et al., (2015) and ?=4.5*10^-9 as in Frantz et al., (2016) and Koch et al., (2019))
# t will be the estimation of divergence time. 
# We will take into consideration the wide standard deviation of this estimate (Nielsen et al., 1998)

###########################
## Estimate Watterson's theta ##
###########################

# 1st step: we will estimate genomewide ?W
# To this end, we will follow the same steps as in script 03.SFS.Het.Fst.thetas.sh (section: thetas). 
# Thetas will be computed per each pair of genomes. We will treat each pair as a separate population to estimate pseudo- population effective sizes. The example will be "algeria � egypt". Genic regions will be included.

#1. Estimate pestPG
$PROG/angsd/angsd -P 10 -bam $IN/algeria.egypt.autosomes.bamlist -anc $ANC -ref $REF -out $OUT/algeria.egypt.autosomes \
-uniqueOnly 1 -remove_bads 1 -only_proper_pairs 1 -trim 0 -C 50 -baq 1 \
-minMapQ 20 -minQ 20 -minInd 2 -setMinDepth 5 -setMaxDepth 56 -doCounts 1 \
-GL 1 -doSaf 1 -doThetas 1 -pest $OUT/algeria.egypt.autosomes.sfs

#2. Print thetas
$PROG/angsd/misc/thetaStat print $OUT/algeria.egypt.autosomes.wgenic.thetas.idx > $OUT/algeria.egypt.autosomes.wgenic.thetas.print

#3. Check window-based thetas
$PROG/angsd/misc/thetaStat do_stat $OUT/algeria.egypt.autosomes.wogenic.thetas.idx
$PROG/angsd/misc/thetaStat do_stat $OUT/algeria.egypt.autosomes.wogenic.thetas.idx -win 50000 -step 50000 -outnames $OUT/algeria.egypt.autosomes.wogenic.thetas.out

# We will then plot the distribution of sites per window and eliminate those windows where we have less than 5% of the curve distribution. In bash:
cut -f2-14 afr_wolf.algeria.afr_wolf.egypt.wgenic.thetas.ww.out | tail -n+2 > algeria.egypt.thetas

head algeria.egypt.thetas
chr01 225000 38.040209 40.647157 11.641463 60.203793 50.425475 0.715373 1.933220 1.853722 -1.556514 1.146807 40910

# The desired column is 13. In R:

>col = read.table("algeria.egypt.test", as.is=T)
>hist(col$V13, breaks=100)
>quantile(col$V13, probs = c(0.01,0.05,0.99, 0.995)); 
1% 5% 99% 99.5% 
17300.10 38125.60 48447.88 48627.97 

# We repeat this command for every pair of genomes and extracted those windows who had more than the lowest 5% of data. After this step, we estimate mean and standard deviation in R using commands: mean(col$V13); stdev (col$V13)

###########################
## Estimate genomewide Fst ###
###########################

# 2nd step: we will estimate genomewide Fst
# To this end, we will follow the same steps as in script 03.SFS.Het.Fst.thetas.sh (section: Fst). 
# Fst will be computed per each pair of genomes. The example will be "algeria - egypt"
# Genic regions will be included.

# We first compute 2DSFS files using SAF files.

$PROG/angsd/misc/realSFS -P 16 $OUT/afr_wolf.algeria.autosomes.wgenic.saf.idx $OUT/afr_wolf.egypt.autosomes.wgenic.saf.idx > $OUT/afr_wolf.algeria.afr_wolf.egypt.2d.sfs

# This 2DSFS file will be used to estimate Fst in a 50kb sliding windows scan with realSFS as in: https://github.com/mfumagalli/ngsTools/blob/master/TUTORIAL.md 

$PROG/angsd/misc/realSFS fst index -P 16 $OUT/afr_wolf.algeria.autosomes.wgenic.saf.idx $OUT/afr_wolf.egypt.autosomes.wgenic.saf.idx -sfs $OUT/afr_wolf.algeria.afr_wolf.egypt.2d.sfs -fstout $OUT/afr_wolf.algeria.afr_wolf.egypt.wgenic -whichFST 1

$PROG/angsd/misc/realSFS fst stats2 afr_wolf.algeria.afr_wolf.egypt.wgenic.fst.idx -win 50000 -step 50000 -whichFST 1 > afr_wolf.algeria.afr_wolf.egypt.wgenic.fst.ww.out

# .ww.out files will be extracted and the same process as in previous step will be done:
cut -f2-5 afr_wolf.algeria.afr_wolf.egypt.wgenic.fst.ww.out | tail -n+2 > algeria.egypt.fst.file

# Wanted column is #3. As in previous step, we plot the curve and estimate lowest 5% in R. 
>col = read.table("algeria.egypt.fst.file", as.is=T)
>hist(col$V3, breaks=100)
>quantile(col$V3, probs = c(0.01,0.05,0.99, 0.995)); mean (col$V3); stdev (pbs$V3); 
1% 5% 99% 99.5% 
15224.32 20375.10 45778.67 46414.33

# We repeat this command for every pair of genomes and extracted those windows who had more than the lowest 5% of data. 


##########################################################
## Estimating divergence times with Cavalli-Sforza's (1969) equation ##
#########################################################

# We used mean estimations of Fst (using 95% top significant data under the curve), and mean Watterson's theta with its standard deviation in an Excel file. Cavalli-Sforza (1969) and Nielsen et al. (1998) use ?ST without a standard deviation. This is because Fst can take extremely low and high values at a genomic scale. This happens with genes under selection, telomeric and centromeric regions, regions of low mappability and high repeatibility (SINEs, LINEs) and regions with high homozygosity (ROHs). Therefore, we just use the mean value of Fst (estimated) and take into account stdev of Watterson's theta for the table. ? is mutation rate per site and generation (we will use ?=4.0*10^-9 as in Skoglund et al., (2015) and ?=4.5*10^-9 as in Frantz et al., (2016) and Koch et al., (2019))

# Results are in Table S9. 

#################
## REFERENCES ##
#################

# Cavalli-Sforza, L. L. (1969). Human diversity. Proc. 12th Int.Congr. Genet. 2:405-416

# Frantz, L.A.F., Mullin, V.E., Pionnier-Capitan, M., Lebrasseur, O., Ollivier, M., Perri, A., Linderholm, A., Mattiangeli, V., Teasdale, M.D., Dimopoulos, E.A. (2016). Genomic and archaeological evidence suggest a dual origin of domestic dogs. Science 352(6290):1228�1231.

# Freedman, A. H., et al.(2014). Genome Sequencing Highlights the Dynamic Early History of Dogs. PLoS Genetics, 10(1). https://doi.org/10.1371/journal.pgen.1004016

# Koch, E. M., et al. (2019). De Novo Mutation Rate Estimation in Wolves of Known Pedigree. Molecular Biology and Evolution, 36(11), 2536�2547. https://doi.org/10.1093/molbev/msz159

# Nielsen, R., Mountain, J.L., Huelsenbeck, J.P., Slatkin, M. (1998). Maximum-Likelihood estimation of population divergence times and population phylogeny in models without mutation. Evolution, 52(3). 1998. pp. 669-677

# Skoglund P, Ersmark E, Palkopoulou E, Dalen L. 2015. Ancient wolf genomereveals an early divergence of domestic dog ancestors and admixture into high-latitude breeds. Curr Biol. 25(11):1515�1519.