Prepared by: Victor Jongeneel
Revised by: Amel Ghouila
Module name: High-Throughput Sequencing
Contact hours (to be used as a guide): Total (40 hours), Theory (50%), Practical (50%)
- Students will learn about the widely used high-throughput sequencing technologies and the differences between them.
- Students get familiar with the major applications of high-throughput sequencing and their advantages.
- Students will learn about the different file formats used to handle NGS data.
- Students will be introduced to the standard analysis workflows and a wide range of tools for the analysis of NGS used for the various applications (De novo assembly, RNA-seq, ChIP-seq, 16S rDNA sequence analysis for metagenomics, DNA-Seq and variant calling).
- Students will learn about the Web-based cyber-environments for managing, analysing and visualising NGS data (e.g., Galaxy, IGV).
SPECIFIC OUTCOMES ADDRESSED
On completion of this module, students should be able to:
- Explain the main differences between the different sequencing platforms and their utility in solving different kind of biological problems.
- Demonstrate a good understanding of the different file formats used in NGS data analysis.
- Demonstrate a good ability to distinguish between the various applications of NGS data.
- Explain easily the basic steps of the analysis workflows used in the various NGS applications.
- Demonstrate ability to run the basic steps of NGS analysis pipelines and to understand the importance of different parameters to analyze a given dataset.
- Apply the skills learned to run the major steps towards the analysis of their own datasets either via the command line or Galaxy.
BACKGROUND KNOWLEDGE REQUIRED
- Good knowledge of molecular biology and genomics
- Good command of the Unix environment and basic programing skills (Perl/Phython)
- Familiarity with cluster and HPC environments is a plus
- Biochemistry, chemistry and physics skills are a plus to better understand the operating principles of high-throughput sequencing machines
BOOKS & OTHER SOURCES USED
- EBI Next-Generation Sequencing online course
- Web resources: SEQanswers, BioStars
- Original publications describing sequencing and analysis technologies
- Review papers on the different High-throughput applications
- Introduction to Computational Biology and Bioinformatics course, Harvard School of Public Health
A) Theory Lectures
1. Introduction to High throughput sequencing
1.1. General overview of sequencing technologies and the main differences between them, past, current and in the near future. Details of current technologies and instruments: Illumina, Ion Torrent, PacBio, Nanopore (updated yearly to reflect technology trends).
1.2. Various applications of High-Throughput Sequencing: introduction to the most frequent applications: De novo assembly, RNA-seq, ChIP-seq, 16S rDNA sequence analysis for metagenomics, DNA-Seq and variant calling.
2. Introduction to file formats and Quality checks
3. Lecture title: Mapping and assembly
3.1. Mapping to a reference genome: different algorithms, most commonly used tools and difference between them.
3.2. Mapping problems caused by repeats, duplications, spliced alignment, realignment, recalibration.
3.3. Denovo assembly: principles of genome assembly, from reads to contigs to scaffolds. Basic graph theory, Eulerian and Hamiltonian paths. Overlap – layout – consensus approach vs de Bruijn graph approach. Error correction approaches. Dealing with repeats, incorporation of paired-end and mate-pair libraries. Mixed assemblies with long reads (PacBio, Moleculo, fosmids). Assessing the quality of assemblies: N50, CEGMA, scaffolding errors.
4. Processing of aligned reads for RNA-seq, normalization approaches, determination of gene expression levels. Processing of aligned reads for ChIP-seq, cross-correlation analysis, peak calling.
5. Variant analysis
5.1. Re-sequencing and variant analysis (mapping strategies, calibration, variant calling, etc).
5.2: Resequensing and variant analysis
6. Metagenomics: Evolutionary classifications based on highly conserved sequences: 16S/18S rRNA, fungal IVS. Generation and processing of amplicons, multiple alignment, derivation of OTU, measures of population diversity. Introduction to full metagenomic assemblies, techniques for read binning, contig assemblies, assessment of metabolic capabilities.
7. Design and deployment of high-performance / high-throughput computational infrastructures for the analysis of NGS data: Cluster architectures, strategy for data parallelism, file systems, workflow management, job scheduling.
8. Cyberenvironments for the processing and analysis of NGS data. Web-based environments: Galaxy, KBase, commercial systems. Genome browsers. Workflow managers, canned workflows. Cloud computing to support NGS analysis. Integrating infrastructures for sequencing and analysis.
B) Practical component
1. Managing data and simple analyses in the Galaxy cyber-environment. Importing datasets, reformatting, filtering, QC. Finding and using reference genomes. Building a workflow for RNA-seq, re-using the workflow. (follows lecture 3.2)
2. Genome assembly. Assemble the genomes of two strains of the same bacterial genus from scratch using Velvet. Tune the parameters of the assembly program. Compare the two assemblies. (follows lecture 3.3)
3. Using the same datasets as in practical 1., build a set of scripts for RNA-seq analysis to reproduce features of the Galaxy workflow. Compare performance and scalability. (follows lecture 5.2)
4. Human variant calling. Using one of the low-pass 1000 Genomes datasets as input, call variants on one genome. Incorporating pre-computed alignments for a larger set of genomes, build a list of variants found in common between the genomes, and compare to dbSNP. (follows lecture 5.2 )
5. Metagenomics. Using a small set of 16S rDNA sequences, determine the list of OTUs and perform basic diversity analysis using QIIME or Mothur. (follows lecture 6 )
ASSESSMENT ACTIVITIES AND THEIR WEIGHTS
Class exercises (no weight, not for marks)
Seminar (5% weight)
Comprehension paper (20% weight)
Test (theory test on all lectures, 75% weight)