**Prepared by:** Amel Ghouila

**Module Name:** Machine Learning

**Contact hours **(to be used as a guide)**: **Total (40 hours), Theory (75%), Practical (25%)

** SPECIFIC OUTCOMES ADDRESSED**

On completion of this module, students should

1. Have an understanding of the basic concepts of machine learning and the different types of basic data mining and bioinformatics problems that can be solved with different types of classical machine learning algorithms.

2. Have an understanding of the most used algorithms for each kind of problems and how these algorithms work.

3. Know how to select the adequate method and algorithms to be used to solve a given problem.

4. Know how to apply and adapt them effectively to study a given problem.

5. Know how to select parameters and understand that the algorithm behaviour, and results it provides, may change significantly as a function of its parameters.

6. Be able to define and describe basic data mining problems and classical machine learning algorithms and apply and adapt them to solve classical bioinformatics problems.

**BACKGROUND KNOWLEDGE REQUIRED**

**H3ABioNet bioinformatics modules as pre-requisites:** Programming I, Biostatistics I

**Additional:** Algorithms and programming, Basics in statistics (probability theory, stochastic processes, etc.), Common bioinformatics problems

**BOOKS & OTHER SOURCES USED**

1. Bioinformatics: The Machine Learning Approach

2. Biological sequence Analysis: Probabilistic Models of Protein and Nucleic Acids

3. Pattern recognition and machine learning

**COURSE CONTENT**

**A) Theory lectures**

1. Introduction.

a. Data mining and machine learning algorithms basics.

b. Opportunities and applications of machine learning techniques in different areas of bioinformatics.

2. Biological sequence Mining with Hidden Markov Models.

a. Pattern matching and pattern recognition methods.

b. Finding patterns in biological sequences (motif and prediction, exons and introns boundaries prediction).

c. HMM: principle and most used algorithms (backward, forward, viterbi).

3. Clustering methods: hierarchical and partitional clustering.

a. Presentation of most used algorithms from each type.

b. Determining the number of clusters in a data set.

c. Measuring clusters’ quality.

4. Supervised Classification methods: learning from examples.

a. Step1: Inferring Rules from training sets.

b. Step2 : Classification process.

c. Presentation of most used algorithms in both step 1 and step 2.

d. Evaluation of classifiers.

5. Data pre-processing.

a. Filtering, discretization and normalization methods.

b. Handling outlayers, noisy data and missing values.

6. Estimating validity of the results.

a. Boostrap methods, cross validation and ROC plots.

b. Identification of the most suitable algorithm to solve a problem.

7. Methods for biological networks building.

a. Boolean networks.

b. Bayesian networks

**B) Practical component**

1. Practical using data mining available R packages to analyse various types of biological data.

2. Use of weka tools.

3. Use of Hmmer (HMM based tool) to detect features from biological sequences.

**ASSESSMENT ACTIVITIES AND THEIR WEIGHTS**

End of semester examination (50% weight)

Small projects with 2 or 3 students working on each (50% weight)

Project examples:

1. Implementation of select algorithms: back propagation, viterbi, etc. (Mainly for students with a background in computer sciences)

2. Microarray data analysis (benchmarked data set): Pre-processing, clustering, etc.

3. Motif and domain prediction from protein sequences using HMMs.