Exercise 1
Suppose we have a list of amino acids, their masses and a protein sequence:
aminoacids <- c("A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K", "M", "F", "P", "S", "T", "W", "Y", "V" )
masses <- c(71.0788, 156.1875, 114.1038, 115.0886, 103.1388, 129.1155, 128.1307, 57.0519, 137.1411, 113.1594, 113.1594, 128.1741, 131.1926, 147.1766, 97.1167, 87.0782, 101.1051, 186.2132, 163.1760, 99.1326)
protein <- "MEFARILOP"
How can we use R to efficiently calculate the mass of the protein ?
We first create a named vector for the masses, and split the protein into amino acids:
names(masses) <- aminoacids
aa <- strsplit(protein, "")[[1]]
(strsplit() returns a list with a single element, so we need to
extract the vector of amino acids using the bracket notation).
We can then use the lookup table to calculate the total, making sure to remove the missing values (non-existent amino acids and separately list the ones that were missing:
sum(masses[ aa ], na.rm=TRUE)
## [1] 958.1865
# List of missing amino acids
aa[ is.na( masses[aa] ) ]
## [1] "O"
Exercise 2
Suppose that you are given information about the amount of fertilizer used on several parts of a field, in the form of a factor:
fert <- factor( c(10, 20, 50, 30, 10, 20, 10, 45) )
Starting from the factor, how would you calculate the mean amount of fertilizer used ?
We can not convert to numeric values directly, because we would only get the indices (the link to the levels); we can first convert to character and then to numeric:
mean(as.numeric(as.character(fert)))
## [1] 24.375
mean(as.numeric(as.vector(fert))) # equivalent
## [1] 24.375
Exercise 3A
The table() function allows you to count the number of occurrences of each value in a vector; for example :
set.seed(5)
data <- sample(1:10, 10, replace=TRUE)
data
## [1] 2 9 9 9 5 7 7 3 3 6
table(data)
## data
## 2 3 5 6 7 9
## 1 2 1 1 2 3
However, the table has ‘gaps’ in it. How could we obtain the same result, but including all numbers from 1 to 10, with a count of ‘0’ when the value does not appear in the vector ?
We can make the vector into a factor, and explicitely specify the possible levels (1 to 10); table will then do the right thing:
data <- factor( data, levels=1:10 )
table(data)
## data
## 1 2 3 4 5 6 7 8 9 10
## 0 1 2 0 1 1 2 0 3 0
Side question: the tabulate() function solves part of our question here, but it is not entirely satisfactory. Why ?
tabulate(data)
## [1] 0 1 2 0 1 1 2 0 3
Tabulate will fill in the gaps, but cannot know that there are larger values (10) missing. This can be specified in its options, however.
Exercise 3B
When plotting a boxplot, the order of the groups is generally alphabetical; for example, the boxplot created by the following commands has groups A, AB, B, BA, C.
a <- runif(100)
groups <- sample( c("A", "AB", "ABC", "B", "BC", "C", "CA"), 100, replace=TRUE)
boxplot(a ~ groups)

How you can easily force a different order (e.g. A, B, C, AB, BC, CA, ABC) ?
We can make the vector into a factor, and specify the levels in the order we want them:
groups <- factor(groups, levels=c("A", "B", "C", "AB", "BC", "CA", "ABC"))
boxplot(a ~ groups)

Exercice 3C
When conducting the linear regression described below, all groups are compared to the “A” group (as the intercept, chosen because it is the first group in alphabetical order). How can we force the lm function to choose group “ref” as the reference ?
set.seed(1)
data <- runif(100)
groups <- rep( c("ref", "a", "b", "c"), each=25)
summary(lm(data~groups))
##
## Call:
## lm(formula = data ~ groups)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.52001 -0.17433 0.00602 0.23320 0.46012
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.533400 0.053933 9.890 2.56e-16 ***
## groupsb -0.071713 0.076273 -0.940 0.349
## groupsc 0.011115 0.076273 0.146 0.884
## groupsref -0.001615 0.076273 -0.021 0.983
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2697 on 96 degrees of freedom
## Multiple R-squared: 0.01517, Adjusted R-squared: -0.01561
## F-statistic: 0.4929 on 3 and 96 DF, p-value: 0.688
groups <- factor(groups)
groups <- relevel(groups, "ref")
groups
## [1] ref ref ref ref ref ref ref ref ref ref ref ref ref ref ref ref ref ref
## [19] ref ref ref ref ref ref ref a a a a a a a a a a a
## [37] a a a a a a a a a a a a a a b b b b
## [55] b b b b b b b b b b b b b b b b b b
## [73] b b b c c c c c c c c c c c c c c c
## [91] c c c c c c c c c c
## Levels: ref a b c
summary(lm(data~groups))
##
## Call:
## lm(formula = data ~ groups)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.52001 -0.17433 0.00602 0.23320 0.46012
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.531786 0.053933 9.860 2.97e-16 ***
## groupsa 0.001615 0.076273 0.021 0.983
## groupsb -0.070098 0.076273 -0.919 0.360
## groupsc 0.012730 0.076273 0.167 0.868
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2697 on 96 degrees of freedom
## Multiple R-squared: 0.01517, Adjusted R-squared: -0.01561
## F-statistic: 0.4929 on 3 and 96 DF, p-value: 0.688
groups <- factor(groups, levels=c("ref", "a", "b", "c"), ordered=TRUE)
groups
## [1] ref ref ref ref ref ref ref ref ref ref ref ref ref ref ref ref ref ref
## [19] ref ref ref ref ref ref ref a a a a a a a a a a a
## [37] a a a a a a a a a a a a a a b b b b
## [55] b b b b b b b b b b b b b b b b b b
## [73] b b b c c c c c c c c c c c c c c c
## [91] c c c c c c c c c c
## Levels: ref < a < b < c
summary(lm(data~groups))
##
## Call:
## lm(formula = data ~ groups)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.52001 -0.17433 0.00602 0.23320 0.46012
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.517847 0.026966 19.203 <2e-16 ***
## groups.L -0.007496 0.053933 -0.139 0.890
## groups.Q 0.040607 0.053933 0.753 0.453
## groups.C 0.050953 0.053933 0.945 0.347
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2697 on 96 degrees of freedom
## Multiple R-squared: 0.01517, Adjusted R-squared: -0.01561
## F-statistic: 0.4929 on 3 and 96 DF, p-value: 0.688
Exercise 4A
The following R code generates random data and fits a linear model:
set.seed(1)
x <- runif(10)
y <- 2*x + rnorm(10)
model <- lm(y~x)
summary(model)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.3327 -0.6267 0.2447 0.6006 1.2851
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.1361 0.7648 -0.178 0.863
## x 2.4039 1.2186 1.973 0.084 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.154 on 8 degrees of freedom
## Multiple R-squared: 0.3273, Adjusted R-squared: 0.2432
## F-statistic: 3.891 on 1 and 8 DF, p-value: 0.08399
How would you copy into a variable “slope” the slope of the regression line (the second regression coefficient) ? And how would you extract in a variable “pvalue” the p-value associated with it ?
We can look at the structure of the linear model:
names(model)
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"
str(model)
## List of 12
## $ coefficients : Named num [1:2] -0.136 2.404
## ..- attr(*, "names")= chr [1:2] "(Intercept)" "x"
## $ residuals : Named num [1:10] -0.792 0.473 0.643 0.345 -0.251 ...
## ..- attr(*, "names")= chr [1:10] "1" "2" "3" "4" ...
## $ effects : Named num [1:10] -3.762 2.276 0.89 0.802 -0.237 ...
## ..- attr(*, "names")= chr [1:10] "(Intercept)" "x" "" "" ...
## $ rank : int 2
## $ fitted.values: Named num [1:10] 0.502 0.758 1.241 2.047 0.349 ...
## ..- attr(*, "names")= chr [1:10] "1" "2" "3" "4" ...
## $ assign : int [1:2] 0 1
## $ qr :List of 5
## ..$ qr : num [1:10, 1:2] -3.162 0.316 0.316 0.316 0.316 ...
## .. ..- attr(*, "dimnames")=List of 2
## .. .. ..$ : chr [1:10] "1" "2" "3" "4" ...
## .. .. ..$ : chr [1:2] "(Intercept)" "x"
## .. ..- attr(*, "assign")= int [1:2] 0 1
## ..$ qraux: num [1:2] 1.32 1.12
## ..$ pivot: int [1:2] 1 2
## ..$ tol : num 1e-07
## ..$ rank : int 2
## ..- attr(*, "class")= chr "qr"
## $ df.residual : int 8
## $ xlevels : Named list()
## $ call : language lm(formula = y ~ x)
## $ terms :Classes 'terms', 'formula' language y ~ x
## .. ..- attr(*, "variables")= language list(y, x)
## .. ..- attr(*, "factors")= int [1:2, 1] 0 1
## .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. ..$ : chr [1:2] "y" "x"
## .. .. .. ..$ : chr "x"
## .. ..- attr(*, "term.labels")= chr "x"
## .. ..- attr(*, "order")= int 1
## .. ..- attr(*, "intercept")= int 1
## .. ..- attr(*, "response")= int 1
## .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## .. ..- attr(*, "predvars")= language list(y, x)
## .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
## .. .. ..- attr(*, "names")= chr [1:2] "y" "x"
## $ model :'data.frame': 10 obs. of 2 variables:
## ..$ y: num [1:10] -0.289 1.232 1.884 2.392 0.098 ...
## ..$ x: num [1:10] 0.266 0.372 0.573 0.908 0.202 ...
## ..- attr(*, "terms")=Classes 'terms', 'formula' language y ~ x
## .. .. ..- attr(*, "variables")= language list(y, x)
## .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
## .. .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. .. ..$ : chr [1:2] "y" "x"
## .. .. .. .. ..$ : chr "x"
## .. .. ..- attr(*, "term.labels")= chr "x"
## .. .. ..- attr(*, "order")= int 1
## .. .. ..- attr(*, "intercept")= int 1
## .. .. ..- attr(*, "response")= int 1
## .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## .. .. ..- attr(*, "predvars")= language list(y, x)
## .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
## .. .. .. ..- attr(*, "names")= chr [1:2] "y" "x"
## - attr(*, "class")= chr "lm"
model$coefficients
## (Intercept) x
## -0.1361272 2.4039002
model$coefficients[2]
## x
## 2.4039
coef(model)[2]
## x
## 2.4039
coef(model)["x"]
## x
## 2.4039
getAnywhere(coef.default)
## A single object matching 'coef.default' was found
## It was found in the following places
## registered S3 method for coef from namespace stats
## namespace:stats
## with value
##
## function (object, complete = TRUE, ...)
## {
## cf <- object$coefficients
## if (complete)
## cf
## else cf[!is.na(cf)]
## }
## <bytecode: 0x5628eb6e0068>
## <environment: namespace:stats>
Using coef provides the same result as direct access, but will continue to work even if the structure of the object changes.
The p-value, however, is not provided in this table. This is because it is actually calculated only when we ask for the summary:
modelsummary <- summary(model)
class(modelsummary)
## [1] "summary.lm"
names(modelsummary)
## [1] "call" "terms" "residuals" "coefficients"
## [5] "aliased" "sigma" "df" "r.squared"
## [9] "adj.r.squared" "fstatistic" "cov.unscaled"
modelsummary$coefficients # or, equivalently:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.1361272 0.764750 -0.1780023 0.86314593
## x 2.4039002 1.218592 1.9726860 0.08399393
coef(modelsummary)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.1361272 0.764750 -0.1780023 0.86314593
## x 2.4039002 1.218592 1.9726860 0.08399393
coef(modelsummary)[2,4]
## [1] 0.08399393
coef(modelsummary)["x", "Pr(>|t|)"]
## [1] 0.08399393
Exercise 4B
The following code fits an ANOVA model:
set.seed(1)
groups <- rep(LETTERS[1:3], each=5 )
data <- rnorm( length(groups) )
model_anova <- aov( data ~ groups )
summary(model_anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## groups 2 0.03 0.0148 0.012 0.988
## Residuals 12 14.47 1.2059
How would you extract the p-value provided by this model into a “pvalue” variable ?
As before, we need to look into the object to understand its structure, using str() for example. We find that the object contains a list of length 1, which contains a data.frame with the information we require:
mysummary <- summary(model_anova)
str(mysummary)
## List of 1
## $ :Classes 'anova' and 'data.frame': 2 obs. of 5 variables:
## ..$ Df : num [1:2] 2 12
## ..$ Sum Sq : num [1:2] 0.0296 14.4702
## ..$ Mean Sq: num [1:2] 0.0148 1.2059
## ..$ F value: num [1:2] 0.0123 NA
## ..$ Pr(>F) : num [1:2] 0.988 NA
## - attr(*, "class")= chr [1:2] "summary.aov" "listof"
mysummary[[1]]["groups", "Pr(>F)"]
## [1] 0.9878183
Exercise 5
Check how R implements “factors” in principle. For example, where are the labels and the fact that the factor is ordered stored ? Where is the function that prints a factor, and the function that makes a summary out of the factor ?
Looking at the summary or the names of the object does not tell us
anything. We can use the str() command to show us some information
about the structure:
fert <- factor( c(10, 20, 50, 30, 10, 20, 10, 45) )
str(fert)
## Factor w/ 5 levels "10","20","30",..: 1 2 5 3 1 2 1 4
It does not tell us much about how the data is stored; one other way is to remove the “factor” class, and see what default object is printed:
class(fert) <- ""
fert
## [1] 1 2 5 3 1 2 1 4
## attr(,"levels")
## [1] "10" "20" "30" "45" "50"
## attr(,"class")
## [1] ""
typeof(fert)
## [1] "integer"
We see that the data is stored as a vector of integer, and additional information is stored in the attributes of the object. In details:
fert <- factor( c(10, 20, 50, 30, 10, 20, 10, 45) )
attributes(fert)
## $levels
## [1] "10" "20" "30" "45" "50"
##
## $class
## [1] "factor"
methods(class="factor")
## [1] [ [[ [[<- [<- all.equal
## [6] as.character as.data.frame as.Date as.list as.logical
## [11] as.POSIXlt as.vector c coerce droplevels
## [16] format initialize is.na<- length<- levels<-
## [21] Math Ops plot print relevel
## [26] relist rep show slotsFromS3 summary
## [31] Summary xtfrm
## see '?methods' for accessing help and source code
# print.factor
# summary.factor
We can see in the list the methods print and summary, so that the two functions are print.factor and summary.factor (which you can print and look at if you want)
fert <- factor( c(10, 20, 50, 30, 10, 20, 10, 45), ordered=TRUE )
attributes(fert)
## $levels
## [1] "10" "20" "30" "45" "50"
##
## $class
## [1] "ordered" "factor"
In the case of an ordered factor, there is a second class, “ordered”; otherwise, the data is stored in the same way.