# Exam

**EXAM**

The participants who need credits must answer the following questions and send the results as an R script with comments to rachel.marcone@sib.swiss until latest February 2023.

Data: A set of data collected by Heinz et al.(* Heinz G, Peterson LJ, Johnson RW, Kerk CJ Journal of Statistics Education Volume 11, Number 2 (2003) jse.amstat.org/v11n2/datasets.heinz.html, by Grete Heinz, Louis J. Peterson, Roger W. Johnson, and Carter J. Kerk, all rights reserved) is available in the file IS_23_exam.csv

Goals: Get to know the overall structure of the data. Summarize variables numerically and graphically. Model relationships between variables.

## Observations

- Have look at the file in a text editor to get familiar with it
- Open a new script file in R studio, comment it and save it.
- Read the file, assign it to object “IS_23_exam”. Examine “IS_23_exam”. a) How many observations and variables does the dataset have ? b) What are the names and types of the variables ? c) Get the summary statistics of “IS_23_exam”.
- Make a scatter plot of all pairs of variables in the dataset.
- Calculate the BMI of each person and add it as an extra variable “bmi” to your dataframe (Google the BMI formula).

## Modelling

- Is there a significant difference in bmi means between males and females?
- How strong is the linear (Pearson) correlation between chest girth and height? Is it significant?
- If you model a linear relationship, how much does the chest girth increase per added cm of height? Is the change significant? What if you do this for males and females separately?
- Come up with a question for hypothesis testing of your own that includes one or more variable(s) of your choosing from the data set.
- Make plots as seen in the course to try to give visualizationābased answers to this question.
- Test your hypothesis using the tests and modeling techniques from the course, based on the type of variables you have. Include tests of the assumptions where appropriate.

## PCA and clustering

- Perform a PCA using all the variables in the dataset, discarding the age and gender
- Do a PCA plot, using different colors for the data points for males and females.
- How much variance is encoded by each principal component ?
- Which variables have the strongest influence on each of the first two principal components ?
- Create a new dataframe called PCA_coord with the coordinates of the data points on PC1 and PC2
- Evaluate the Euclidean distance between the data points
- Generate a heatmap of the distance matrix
- Identify clusters of the data points using a method of your choice, that has been shown during the course