**Prepared by: **Jean-Baka Domelevo Entfellner

**Module Name:** Biostatistics I

**Contact hours **(to be used as a guide)**:** Total (40 hours), Theory (45%), Practical (55%)

**LEARNING OBJECTIVES
**

1. Understand the fundamental difference between population and sample

2. Know the basic descriptive statistics: sample mean, sample variance, sample median, sample variance, quantiles, etc

3. Understand the rationale for inferential statistics: observe a sample to infer statements on the underlying population.

4. Perform statistical tests

**SPECIFIC OUTCOMES ADDRESSED**

On completion of this module, students should be able to:

1. Open a dataset with R and extract some basic descriptive statistics

2. When facing a problem, be able to identify the correct statistical test to be performed.

3. Be able to carry on and perform that statistical hypothesis test.

4. Be able to run a PCA and interpret its results

**BACKGROUND KNOWLEDGE REQUIRED**

**H3ABioNet bioinformatics modules as pre-requisites:** none

**Additional:** Basic general-purpose scientific knowledge, basic arithmetic skills, and some familiarity with basic linear algebra.

**BOOKS & OTHER SOURCES USED**

1. Fundamentals of Biostatistics, 7th edition, by Bernard Rosner (Cengage Learning, 2011)

2. Biostatistics with R — An introduction to statistics through biological data, by Babak Shahbaba (Springer, 2012)

**COURSE CONTENT**

**A) Theory lectures**

**1. Probability theory**

a. Atomic and complex events, probabilities as a measure on sets. Probabilistic experiments, concept of expectation.

b. Conditional probabilities, Bayes’ law.

c. Enumerative combinatorics: counting permutations, combinations and partitions. Binomial coefficients.

d. Some common discrete probability distributions: Bernoulli, binomial, Poisson. Behaviour of a binomial when the number of trials tends to infinity. Concepts: probability mass, expectation of a discrete distribution.

e. First continuous probability distributions: uniform, exponential.

f. Central limit theorem and the normal distributions.

g. Other continuous distributions: Student’s t and chi-square distributions.

**2. Statistical hypothesis testing**

a. Framework of a hypothesis test: sample to be used, tested hypothesis/hypotheses, test statistic, its law under the null hypothesis, acceptance and rejection regions

b. the difference between parametric and non-parametric tests: assumptions made

c. a few basic tests: t-tests for the comparison of two samples (parametric), and their non-parametric equivalents (Wilcoxon sign-rank and rank-sum tests)

d. testing normality on a sample: Shapiro-Wilk and Kolmogorov-Smirnov

e. chi-square tests on contingency tables

**3. Analysis of variance and regression models**

a. one-way ANOVA, within-group sum of squares, between-groups sum of squares, F statistic

b. univariate linear regression

c. multivariate linear regression

d. logistic regression for a binary outcome

**4. Multidimensional dataset analysis: Principal Component Analysis**

**B) Practical component**

This section “Practical component” follows the same structure as the previous section “Theory lectures”: practicals aim to encourage the students to manipulate the concepts seen in the lectures, right after they are introduced to them. One can use the examples datasets from the book cited as the first bibliographic reference (Rosner). Some amount of time should be spent on data wrangling, exposing the students to potential issues in opening files for read/write operations, data formats, etc.

We suggest the use of Rstudio throughout the course, as an integrated development environment to work with R. Being the fundamental statistical software in use across various research areas, it is essential that the students develop mastery over R during this course.

Alternatively, if computing resources are extremely scarce, use an interactive R interpreter to demonstrate the concepts, plus a simple text editor later on, once the students start writing functions.

** ASSESSMENT ACTIVITIES AND THEIR WEIGHTS**

We would suggest two written exams during the course of the module (total weight = 50%) , and a final practical exam in R (weight = 50%). Of course, practicals assignments, administered throughout the module, can also count toward the module grade, but our advice is not to make each and every practical for marks, not to put too much counter-productive stress on the students. Practicals are privileged moments when students have the opportunity to understand the concepts as they put them into play.