Population Genetics and GWAS


Prepared by: Emile Chimusa Rugamika, Segun Fatumo
Module Name: Population genetics and Genome-wide Association Studies (GWAS)
Contact hours:
Total (40 hours), Theory (50%), Practical (50%)


Recent development of molecular biotechnologies and instruments have facilitated the acquisition of genetic and genomic data from any organism, almost with no limits. Population genetics and disease scoring statistics (Genome-wide association studies), aim at providing effective and efficient analyses and utilization of these data, have thus become one of the hottest, most active, and most promising areas of statistics ( i.e. Statistical Genetics and disease scoring statistics).

This course is devoted to computational problems and methods in the emerging field of Medical population genetics and disease scoring statistics (Genome-wide Associations Studies) where Genomics, Computational Biology and both Biostatistics and Bioinformatics impact medical research. In addition, this course will also provide a forum for statistical and biological students to exchange their ideas, problems, and thoughts in a free, stimulating atmosphere.


On completion of this module, students should

1. Understand the interplay between disease scoring statistics and medical genetic discoveries.
2. Be able to solve genetic mysteries with the obtained quantitative skills.
3. Be empowered with knowledge on statistical and computational details to facilitate genetic data analyses and result interpretations.


H3ABioNet bioinformatics modules as pre-requisites: Biostatistics I, Sequence Analysis, Ethics in ResearchProgramming I

Additional: The module is designed for graduate students with some background in
calculus, regression analysis, mixed model, maximum likelihood, bayesian statistics (desirable), programming Skill in C++, python, R, familiarity with basic genetics is desirable.


1. Principles of Population Genetics (Fourth Edition, 2007) Daniel L. Hartl and Andrew G. Clark.

2. Statistical Genetics of Quantitative Traits: Linkage, Maps, and QTL by Rongling Wu Chang-Xing Ma, and George Casella. Springer-Verlag, New York (2007).

1. Albers CA, Lunter G, MacArthur DG, McVean G, Ouwehand WH, Durbin R (2010) Dindel: accurate indel calls from short-read data. Genome Res. 21:961-73.

2. Browning SR, Browning BL (2007) Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. Am J Hum Genet. 81:1084-97. PMID: 17924348.

3. Coventry A, Bull-Otterson LM, Liu X, Clark AG, Maxwell TJ, Crosby J, Hixson JE, Rea TJ, Muzny DM, Lewis LR, Wheeler DA, Sabo A, Lusk C, Weiss KG, Akbar H, Cree A, Hawes AC, Newsham I, Varghese RT, Villasana D, Gross S, Joshi V, Santibanez J, Morgan M, Chang K, Iv WH, Templeton AR, Boerwinkle E, Gibbs R, Sing CF (2010) Deep resequencing reveals excess rare recent variants consistent with explosive population growth. Nat Commun. 1:131. PMID: 21119644.

4. Delaneau O, Zagury JF, Marchini J (2013) Improved whole-chromosome phasing for disease and population genetic studies. Nat Methods. 10:5-6. PMID: 23269371. Howie B, Fuchsberger C, Stephens M, Marchini J, Abecasis GR (2012) Fast and accurate genotype imputation in genome-wide association studies through pre-phasing. Nat Genet. 44:955-9. PMID: 22820512.

5. Iqbal Z, Caccamo M, Turner I, Flicek P, McVean G (2012) De novo assembly and genotyping of variants using colored de Bruijn graphs. Nat Genet. 44:226-32. PMID: 22231483.

6. Jun G, Flickinger M, Hetrick KN, Romm JM, Doheny KF, Abecasis GR, Boehnke M, Kang HM (2012) Detecting and estimating contamination of human DNA samples in sequencing and array-based genotype data. Am J Hum Genet. 91:839-48. PMID: 23103226.

7. Li H, Ruan J, Durbin R (2008) Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 18:1851-8. PMID: 18714091.

8. Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 25:1754-60. PMID: 19451168.

9. Li H, Durbin R (2011) Inference of human population history from individual whole-genome sequences. Nature. 475:493-6. PMID: 21753753.

10. Li Y, Willer CJ, Ding J, Scheet P, Abecasis GR (2010) MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genet Epidemiol. 34:816-34. PMID: 21058334.

11. Lin DY, Zeng D (2010) Meta-analysis of genome-wide association studies: no efficiency gain in using individual participant data. Genet Epidemiol. 34:60-6. PMID: 19847795.

12. Liu et al (2013) http://arxiv.org/abs/1305.1318.

13. Wen X, Stephens M (2010) Using linear predictors to impute allele frequencies from summary or pooled genotype data. Ann Appl Stat. 4:1158-1182. PMID: 21479081. Wu MC, Lee S, Cai T, Li Y, Boehnke M, Lin X (2011) Rare-variant association testing for sequencing data with the sequence kernel association test. Am J Hum Genet. 89:82-93.

14. Zerbino DR, Birney E (2008) Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18:821-9. PMID: 18349386.


A) Theory lectures

1. Medical Population Genetic.
a. Basic concepts in genetics: models, linkage disequilibrium, Hardy-Weinberg Equilibrium , identity by descent (IBD), pedigrees and trios.
b. Selection and Drift: Diffusion models.
c. Mutation and Random genetic drift.
d. Demography and population structure.
e. Model for population admixture.
f. Haplotype Phasing by deCODE, Expectation Maximization and Identity by Descent.
g. Informativeness SNP tagging.
h. Coalescent and statistics: the Miniciello-Durbin ancestral recombination graph reconstruction.

2. Statistical analysis of quantitative genetics.
a. Mendelian segregation.
b. Linkage analysis and map construction.
c. Quantitative genetics.
d. QTL mapping: regression analysis.
e. QTL mapping: Maximum likelihood and Bayesian approach.
f. Linkage disequilibrium analysis in natural populations.
g. Joint linkage and linkage disequilibrium analysis of QTLs.
h.Functional mapping.

3. Microarray data analysis.

4. Genome-wide association studies.
a. Disease models: common-disease common-variant, Zollner-Pritchard, McClellan- King genetic heterogeneity in human disease.
b. Genome-wide association studies in a case-control design (Tests of associations, hypothesis testing, and multiple comparisons corrections
c. Population substructure.
d. The missing heritability problem: Common variants vs. rare variants.
e. Genome-wide association studies and Next Generation Sequencing.

5. Marriage of statistical genetics and systems biology.

 B) Practical Component

1. Homework will be assigned every week. Homework problems will consist of a mix of general problems, programming assignments, problems related to the class project, and critical readings of research articles. As the module will be comprised of students from diverse academic backgrounds, homework will involve general questions for all students as well as more in-depth questions, which students will be able to choose from in accordance with their particular academic background.
Collaboration policy: Students may discuss the homework problems with other students or use other resources such as textbooks or the Internet. However, Students must not obtain answers directly from anyone else. All homework will be submitted individually.

2. Projects and presentations Each student is required to complete a final research project as part of the Module requirements. Students should work with the lecturer or assistant lecturer to frame what the project will be. There will be two presentations for the class projects, one during the middle of the term and one at the end of the term. The final hand-in will include a ~20 minute presentation of each project’s results and the adopted GWAS, a paper describing each project results, and any accompanying source code and documentation. Students not implementing any code will deliver a more in-depth presentation, paper, and/or analysis. Below is a set of suggested research projects (and a sample of related papers must be provided by the lecturer) aligned with the goals of the class. Only one student may work on each project. However, this represents only a sample of projects; students are encouraged to define a project aligned with their interests if none of these projects fit.

Suggested research projects:

a. Ancestral Recombination Graphs and Ordered Marginal Trees .
b. Minimum informative subset selection algorithm and transferability.
c. Random forests, decision trees, and GWAS.
d. Gene Sets and GWAS: From SNPs to Genes: Post Genome-wide Association Studies.
e. Epistatic Interactions in GWAS.
f. Identity by Descent in GWAS and/or computing identity-by-decent tracts in genotypes.
g. Tag SNPs-unifying LD-select/Tagger and Informativeness-Dominating Set optimization.
h. Rigorous algorithms for Global Maximum Likelihood Phasing, EM and generalized likelihood.
i. Long-range haplotype phasing – “the power of amnesia” variable-length Markov Chain and the j. Browning and Browning Beagle.
k. Long-range haplotype phasing – the deCODE algorithm; haplotype sharing in closely related populations.
l. Hidden Markov models for Ancestry Inference: local versus global ancestry. Admixture mapping and association mapping studies in admixed populations.
m. Parents of origin genetic variation.
n. Rigorous algorithms for Global Maximum Likelihood Phasing, EM and generalized likelihoods.
o. Generalized family and pedigree based statistical tests for association.


Homework (35% weight)
Projects (50% weight)
Presentations (15% weight)

Leave a Reply