Prepared by: Scott Hazelhurst
Module name: High Performance Computing (HPC)
Contact hours (to be used as a guide): Total (40 hours), Theory (40%), Practical (60%)
COURSE BACKGROUND AND PURPOSE
Modern bioinformatics requires high performance computing, and in many cases highly parallel computing. The usability of many programs has reached the point that in many cases, the technical details of how the HPC is done can be hidden from the user, and good quality libraries can assist developers and hide the complexity of parallelisation. However, the inherent complexity of parallelisation means that performance is often architecture dependent and insight into the theory of HPC can help users plan and run workloads. Moreover, knowledge and application of this theory is essential for bioinformaticists developing application code which exploits parallelism
SPECIFIC OUTCOMES ADDRESSED
On successful completion of this course, the student is able to:
1. Describe, compare and contrast different parallel architectures and programming models
2. Explain the impact of latency and synchronisation on parallel programming performance in general; and critically evaluate their effect for specific programs
3. Critically assess appropriate parallelisation methods and architectures for given problems
4. Predict and measuring speedup and of a given program
5. Apply Amdahl’s and Gustafson’s laws
6. Implement and analyse simple parallel programs in different programming models
BACKGROUND KNOWLEDGE REQUIRED
H3ABioNet bioinformatics modules as pre-requisites: Programming I
Additional: Programming II. Students are expected to be highly competent in use of the Linux command line, programming in Python, and preferably have some competence in C.
BOOKS & OTHER SOURCES USED
1. Norm Matloff, Programming on Parallel Machines.
2. David Silliicorn, Domenico Talia. Models and Languages for Parallel Computation. ACM Computing Surveys, 30(2), June 1998. A classic but still useful.
A) Theory Lectures
1. Introduction: Why parallelisation is difficult, overview of topic. Latency and shared state. The Dining Philosophers’ Problem.
2. Computer architecture design principles
Review of the von Neumann architecture; memory hierarchy – registers, cache, virtual memory, resident set size. Sequential I/O.
3. Parallel architectures ×2.
Shared Memory versus Distributed Memory. Multiprocessor, multi-core. Cluster com- puting. Grid computing. GPUs. FPGAs. Flynn’s taxonomy.
4. Shared and Parallel I/O
5. Parallel Programming Models Why parallelism can’t save bad programming.
6. Shared memory, Multi-core programming. ×2
Processes versus threads. Communication, synchronisation, and mutual exclusion. Ex- ample technologies (e.g. pthreads, OpenMP)
7. Distributed memory programming
Communication, synchronisation, impact of latency. Example technologies (e.g. MPI)
8. Embarrassingly Parallel Programming
And why it’s so not so easy and embarrassing when you get it wrong. Example technologies (e.g., Torque/Maui)
9. The map-reduce model
Introduction to Hadoop ideas. Motivations and critique
10. Introduction to virtualisation and cloud computing
B) Practical component (some hours can be assigned as homework)
1. Measurement exercise I. I/O performance (3 hours)
2. Measurement exercise II
Choice of parallel models. (6 hours)
3. Torque/Maui (3 hours)
4. OpenMP (3 hours)
5. Pthreads (4 hours)
6. MPI (4 hours)
7. Monitoring performance and debugging (3 hours)
8. Parallel computing libraries (2 hours)
9. Cloud Computing exercises (3 hours)
ASSESSMENT ACTIVITIES AND THEIR WEIGHTS
Short research paper (15%)
Three of the labs (practical assignments) will be handed in for marks (45%)
Written exam (40%)
1. Example measurement exercise: The ADMIXTURE program can be used to estimate ancestry proportions in population data. It is typically run many times on the same data: First, we may need to test for different estimates of the number of ancestral groups (K = 2, . . . , 12). Second, it is stochastic so it good practice to run it many times (say 50) for each value of K and to average over runs. ADMIXTURE supports multicore execution with its -j flag. Assume we have cluster of 10 computers, each with 12 cores. How should we parallelise – at one extreme we could run ADMIXTURE 11 times50 times; at the other extreme we could run it 50 times each using 11 cores. What is the best strategy? The issues you should consider are:
• The efficiency of ADMIXTURE’s parallelisation on your multicore system. • The data set sizes.
• The memory size of your computers.
Write a 4-5 page report, proposing an appropriate parallelisation structure for your cluster. The report should contain appropriate experimentation to justify your answer. Assume PLINK data set sizes of 1G, 10G, and 50G (combined bed and bim file size)
2.Sample MPI project. Imputation is a very expensive computaional exercise. In principle it is highly parallelisable because you can parallelise across chromosomes and even within chromosomes. However, if you do this crudely on a cluster that has a simple I/O systems (e.g. NFS over gigabit ethernet), performance is likely to be very poor because the same reference data set may be read many many times. Write a simple MPI program in which the (a) master process streams the data to all worker at once; (b) the worker nodes bid for work to do; (c) and copy data back to the the master node.
3.Sample paper exercise. Hadoop is now a popular method for parallelisation. Identify three different Hadoop technologies. For each, find a bioinformatics paper which used that technology. Critically evaluate the success – don’t just take the authors view. Write a 6 page report which describes the technologies and intended applications, and critically evaluates the paper that you found.