Solutions Step 1: Taxonomic profiling of metagenomic samples with mOTUs

General note: this guide has been written assuming you use a Mac or Linux Command Line.

Download example sequencing data

Explore the files, in particular you can check:

How many reads there are per sample?
Solution 1

Knowing that each read takes up four lines in the fastq file, we can simply count the number of lines with wc -l and divide the result by 4. The following command does it all in one line.
```
echo $(cat sampleA_1.fastq|wc -l)/4|bc
```
Solution 2

We can count the number of lines with @read:
```
grep -c "@read" sampleA_1.fastq
```
What is the average length of the reads? Is there a difference between the read lengths in the forward and reverse files?
Solution 1

We can first extract only the sequences:
```
grep -A 1 "@read" sampleA_1.fastq  | grep -v "\-\-" | grep -v "read" > sequences_sampleA_1
```
With -A 1 we select also 1 row after the match. With grep -v we remove what is not needed. We can now check the length with:
```
cat sequences_sampleA_1 | awk '{print length}'
```
This will print a big list, we can count how many times each length appear:
```
cat sequences_sampleA_1 | awk '{print length}' | sort | uniq -c | sort -n | tail
```
Which produces:
```
  238 96
  242 92
  258 93
  341 97
  344 94
  346 95   
  428 98
  849 20
 1194 99
61306 100
```
So the majority of the reads have length 100 (61,306 out of 67,926, 90%)
Solution 2

To quickly check the average length of the reads in the terminal, do:
```
awk 'NR%4==2{sum+=length($0)}END{print sum/(NR/4)}' sampleA_1.fastq
```
The average read length in the reverse reads seem to be slightly lower in all the samples.

Do you have the same read IDs in the forward and reverse file?

Solution

Since these are paired reads, the read ids should be identical and in the same order. You can check this in the terminal like so:

#get list of read ids from the forward and reverse files
grep '@read' sampleA_1.fastq > sampleA_ids_1.txt
grep '@read' sampleA_2.fastq > sampleA_ids_2.txt
#check if they are identical 
diff -s sampleA_ids_1.txt  sampleA_ids_2.txt

There are many different ways of performing the same task. If you have done something different and accomplished the same thing, awesome!

Check the quality of the sequencing data

Which part of the reads is of lower quality?

Solution

The ends of the reads are typically of lower quality. This is to be expected. The quality of calls typically degrades as the run progresses due to problems in the sequencing chemistry.

- Is there any difference between the quality of the forward and reverse reads?

Solution

Reverse reads are usually of lower quality than forward reads, particularly at the read ends. Again this is due to the way paired end sequencing is performed with the forward orientiation is sequenced first followed by the reverse orientation.

Filter and trim reads

Try to run trimmomatic (you can use different parameters).
Solution

Here is an example command:
```
trimmomatic PE sampleA_1.fastq sampleA_2.fastq sampleA_filtered_1P.fastq sampleA_filtered_1U.fastq sampleA_filtered_2P.fastq sampleA_filtered_2U.fastq ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
```
Here is a description of the parameters used in this specific command. You may have explored other parameters too.
- ILLUMINACLIP: specifies the file containing the adapter sequences to trim and additional parameters on how to perform the adapter trimming. TruSeq3-PE.fa (provided by trimmomatic) contains the Illumina adapter sequences used by HiSeq and MiSeq machines.
- LEADING: Remove low quality bases (lower than 3) from the beginning of the reads.
- TRAILING: Remove low quality bases (lower than 3) from the ends of the reads.
- SLIDINGWINDOW: consider a window of bases (here 4 at once) and trim once the average quality within the window falls below a threshold quality (here 15).
- MINLEN: remove reads lower than the specified min length (here 36)

How many files did trimmomatic generate? What do they contain?

Solution

4 files are produced

sampleA_filtered_1P, containing the forward reads that pass the filter and have a mate (in filtered_2P);
sampleA_filtered_1U, containing the forward reads that pass the filter and do not have a mate (the paired reverse read didn’t pass the filter)
sampleA_filtered_2P, containing the reverse reads that pass the filter and have a mate (in filtered_1P);
sampleA_filtered_2U, containing the reverse reads that pass the filter and do not have a mate (the paired forward read didn’t pass the filter)

- How many reads have been filtered out?

Solution

866 reads (1.27%) of all reads were filtered out from sampleA using the above parameters.

Check the quality of the filtered reads. Did the quality improve?

Solution

You can check the quality of the filterd reads again with fastqc.

fastqc sampleA_filtered_1P.fastq
fastqc sampleA_filtered_2P.fastq

The quality of reads (particularly of the reverse reads) has improved!

Taxonomic profiling with mOTUs

Use motus (manual: link) to create a profile from the files created by trimmomatic.
Solution
Here is the mOTU command to generate a taxonomic profile using default parameters.
```
motus profile -f sampleA_filterd_1P.fastq -r sampleA_filtered_2P.fastq -s sampleA_filtered_1U.fastq,sampleA_filtered_2U.fastq -n sampleA -o sampleA_profile.txt
```
- How many species are detected? How many are reference species and how many are unknown species?
Solution
You can quickly check how many species were detected with:
```
#this also counts unassigned so subtract 1 from the result
grep -c -v '0.0000000000\|#' sampleA_profile.txt
```
97 species were dectected. Around 3.4 % were unassigned .

You can check how many ref-mOTUs were detected using these command:
```
grep -v '0.0000000000\|#' sampleA_profile.txt > sampleA_profile_detected.txt
grep -c 'ref_mOTU_v3_' sampleA_profile_detected.txt
```
39 ref-mOTUs were detected in sampleA. Note that this number is also reported as stdout when you run motus profile
Can you change some parameters in motus to profile more or less species? (Hint, look here)
Solution
Precision is the number of TP out of the total number of detected species.

Recall is the number of detected species out of all the species actually present in the sample.

To increase precision at the cost of recall you can increase parameters -g (default: 3) and -l (default: 75).
```
motus profile -f sampleA_filterd_1P.fastq -r sampleA_filtered_2P.fastq -s sampleA_filtered_1U.fastq,sampleA_filtered_2U.fastq -n sampleA -o sampleA_profile_high_p.txt -g 8 -l 90
```
We have detected just 37 species.

To increase recall at the cost of having more false positives you can do:
```
motus profile -f sampleA_filterd_1P.fastq -r sampleA_filtered_2P.fastq -s sampleA_filtered_1U.fastq,sampleA_filtered_2U.fastq -n sampleA -o sampleA_profile_high_r.txt -g 1 -l 45
```
We have detected 331 species.
How can you merge different motus profiles into one file? Try to profile and then merge three profiles (Sample A, B and C).
Solution

After creating the individual tax profiles for all the samples, do:
```
motus merge - i sampleA_profile.txt,sampleB_profile.txt,sampleC_profile.txt -o merged_profiles.txt
```
This results in a tab-separated file containing the tax profiles.

Taxonomic profiling with MAPseq

Similar as with mOTUs, first create a profile for each sample (A,B, and C) and then merge them into one (Check the github page for the command).
Solution

In order to create a taxonomic profile using MAPseq for sampleA do:
```
mapseq sampleA_filtered_1P.fasta > sampleA.mseq
```
MAPseq seems to be a bit faster than mOTUs (took ~2 min to run)

sampleA.mseq contains the results from mapping reads to the reference database of OTUs provided by MAPseq (alignment score, database hit, etc) and the taxnomic classifications along with associated confidences.

After generating the .mseq files for all the samples, you can merge them into one OTU table like so:
```
mapseq -otutable sampleA.mseq sampleB.mseq sampleC.mseq -ti 1 -tl 3 > mapseq_otutable_otu97.tsv
```
This creates an OTU table containing reads mapped to 97% OTUs for sampleA, sampleB and sampleC.

Note that depending on whether -ti is 0 or 1, what -tl means changes.

If you have -ti 0, then -tl indicates the taxonomic level (0 (domain), 1 (phylum), 2 (class), 3 (order), 4 (family), 5 (genus), 6 (species)) . So if -ti 0 -tl 3 means that the OTU table will report only read counts mapping to order-level NCBI taxonomies.

If you have -ti 1, then -tl indicates the OTU clustering level (1 (90% OTU), 2 (96% OTU), 3 (97% OTU), 4 (98%), 5 (99%)) . So if -ti 1 -tl 3 means that the OTU table will report only read counts mapping to 97% OTUs.

To obtain reads mapping to 99% OTUs :
```
mapseq -otutable sampleA.mseq sampleB.mseq sampleC.mseq -ti 0 -tl 5 > mapseq_otutable_otu99.tsv
```
If we increase the clustering level to 99%, we observe fewer species detected for all the samples. This might be because at a finer resolution, we might not be able to assign taxonomies too well resulting in a smaller number of species being profiled.

To obtain reads mapping to 96% OTUs :
```
mapseq -otutable sampleA.mseq sampleB.mseq sampleC.mseq -ti 0 -tl 2 > mapseq_otutable_otu96.tsv
```
If we decrease the clustering level to 96%, we observe more species detected for all the samples.
Can you compare mOTUs and MAPseq profiles?

Solution

Profiler species detected in sampleA

mOTUs 97

MAPseq 97% 173

MAPseq 99% 121

MAPseq 96% 179

It looks like mOTUs is a bit more conservative at detecting species. Note that since mOTUs and OTUs are defined differently, it might not be straightforward to make a direct comparison.

Explore taxonomic profiles

Explore the taxonomic profiles (tax_profile).

Solution

Load packages and taxonomic profile into your R environment

#packages
library(tidyverse)

#tax profiles
load(url("https://zenodo.org/record/6517497/files/human_microbiome_dataset.Rdata"))

Let's have a quick peek at our data.

dim(tax_profile)

There are 246 rows and 496 columns. Let's check the content:

tax_profile[1:3,1:3]

Which results in:

             700002_T0 700002_T1 700002_T2
Blautia            229      1196      1758
Bacteroides       4018      1705      1660
Agathobacter        80      1580       126

The rows are genera and the columns are samples.

head(metadata)

Which is:

          Subject Timepoint Sex
700002_T0  700002         0   M
700002_T1  700002        12   M
700002_T2  700002        24   M
700002_T3  700002        50   M
700004_T0  700004         0   F
700004_T1  700004        12   F

The 496 samples are from 124 patients (sampled 4 times over the course of the year). `tax_profile` contains the read counts of of species (rows) across all samples (columns). For each sample, we know the corresponding subject id, timepoint of sampling and the sex of the subject.

Here are some hints of what you can check:

How many reads there are per sample?
Solution

Calculate total number of reads in every sample (also called library size of sample)
```
sample_read_counts <- data.frame(total_read_counts = colSums(tax_profile))
```
How are the sample read counts distributed?
```
ggplot(data = sample_read_counts) + geom_histogram(mapping = aes(x = total_read_counts), bins = 60) + ylab('Number of samples') + xlab('Total number of reads')
```
Which results in:

Note how variable the total read counts are across all samples. This is a problem because this variation is most likely a result of the sequencing process and not any meaningful biological variation.
If you want to compare different samples, is it a problem that there are different read counts? Try to divide each value within a sample by the sum of the reads in that sample to normalise the data (also called relative abundance).
Solution

We need to remove this technical variation caused by differing total read counts to meaningfully compare samples. We can do this by relative abundance normalization where we divide each value within a sample by the total read counts in that sample.
```
rel_ab = prop.table(tax_profile,2)
```
Now the abundances of each sample should sum to 1.
```
all(colSums(rel_ab) == 1)
```

Which genera are the most and least prevalent?

Solution

Prevalence of a species refers to the the proportion of samples in which that species is detected.

First we calculate prevalence for all the genera.

number_of_samples = dim(tax_profile)[2]
prevalence_df <- data.frame(Prevalence = rowSums(rel_ab>0)/number_of_samples, genus = rownames(tax_profile))

You can plot an histogram of the prevalences:

ggplot(data = prevalence_df) + geom_histogram(mapping = aes(x = Prevalence), bins = 60)

You can see that there are many species that apper only few times (on the left), and there are only few species that are present in all samples (on the right). We can also check which genera are the most prevalent:

head(prevalence_df[order(prevalence_df$Prevalence,decreasing = T),])

Result:

                 Prevalence            genus
Blautia           1.0000000          Blautia
Bacteroides       1.0000000      Bacteroides
-1                1.0000000               -1
Faecalibacterium  0.9899194 Faecalibacterium
Anaerostipes      0.9879032     Anaerostipes
Fusicatenibacter  0.9778226 Fusicatenibacter

It means that the genus Blautia and Bacteroides are present in all species. The -1 represents unassigned reads (i.e. that they do not map to any known genus).

The least prevalent genera are:

                               Prevalence                         genus
Rosenbergiella                0.002016129                Rosenbergiella
28-4                          0.002016129                          28-4
Gallicola                     0.002016129                     Gallicola
Sarcina                       0.002016129                       Sarcina
Harryflintia                  0.002016129                  Harryflintia
Paeniclostridium              0.002016129              Paeniclostridium

Which are present in only one sample.

You can see that the most prevalent species are typically genera that that should be present in all human guts. This type of quick exploration can also serve as a sanity check (is there something we should not be seeing at all?)

Is the relative abundance of the different genera normally distributed?
Solution

If we look at the distribution of all relative abundances with a simple histogram:
```
abundances_df = data.frame(rel_abundances=as.vector(rel_ab))
ggplot(data = abundances_df) + geom_histogram(aes(rel_abundances),bins=100) + xlab("All relative abundances")
```
We can clearly see that the relative abundances are not normally distributed. Maybe if we log transform the data, the result improve. Note that in order to log transform the data we need to add a small value (in this case 10^-4) so that we don’t have the problem of calculating the log of zero. Code:
```
log_rel_ab = data.frame(rel_abundances=log10(as.vector(rel_ab) + 10^-4))
ggplot(data = log_rel_ab) + geom_histogram(aes(rel_abundances),bins=100) + xlab("All relative abundances (log10)") + scale_y_log10()
```
As you can see even when we log transform, the high number of zero makes the distribution not normal.

We can check also single genera. Here I select 3 genera: Bacteroides with prevalence of 1, Akkermansia with a prevalence of ~0.5 and Harryflintia with the lowest relative abundance. We can plot them with the following code:
```
df_genera = data.frame(
    genus = c(rep("Bacteroides",ncol(rel_ab)),rep("Akkermansia",ncol(rel_ab)),rep("Harryflintia",ncol(rel_ab))),
    rel_ab = c(rel_ab["Bacteroides",],rel_ab["Akkermansia",],rel_ab["Harryflintia",])
)

ggplot(data = df_genera) + geom_histogram(mapping = aes(x = rel_ab), bins = 60) + 
facet_grid(. ~ genus)
```
As you can see for Harryflintia there are almost only zeros (only one sample contain this genus). On the other hand Bacteroides can almost have a normal distribution (with a long tail on the right). While Akkermansia shows a tipical microbiome distribution plot with many samples where the measure is zero and then a tail with few samples where the relative abundance is higher.

How many zeros there are per sample and per genus?

Solution

Taxonomic profiles are typically sparse because most species occur in frequently. Do we see this in our data as well?

How many 0s are present overall in the data?

# we transform the relative abundance in a vector
temp = as.vector(rel_ab)
# and check the length (how many values there are)
length(temp) # 122016
# and how many values are zero:
sum(temp == 0) # 94133

77% of the data are 0s (94133/122016)!

If we look at the percentage of 0s per sample:

head(data.frame(zeros_per_sample =colSums(rel_ab == 0)/(dim(rel_ab)[1])))

Result:

          zeros_per_sample
700002_T0        0.8536585
700002_T1        0.7398374
700002_T2        0.7520325
700002_T3        0.7723577
700004_T0        0.7723577
700004_T1        0.7723577

We can see that the sparsity is similar across all samples. Do you think that if we had samples from a different environment (like Soil for instance), we might see something different?

Percentage of 0s per genus is the same as 1 - prevalence:

head(data.frame(zeros_per_sample =rowSums(rel_ab == 0)/(dim(rel_ab)[2])))

Result:

                 zeros_per_sample
Blautia                0.00000000
Bacteroides            0.00000000
Agathobacter           0.04435484
Faecalibacterium       0.01008065
Bifidobacterium        0.09072581
Fusicatenibacter       0.02217742

Again, we note that some genera are more prevalent than others.

How much variability there is within Subject (check the metadata table), compare to between subjects? Or from another perspective, how stable it is the human gut microbiome?

Solution 1

There are different ways to explore this problem. We can try to calculate the distance between all possible samples and then compare the distances of samples that come from the same subject and the distance that come from different subjects. For example, you take sample 1 (700002_T0) and sample 2 (700002_T1), and since they are from the same subject, we will use the distance between these two samples as an example of within-subject distance.

Code:

rel_ab = prop.table(tax_profile,2)
log_rel_ab = log10(rel_ab+ 10^-4)

n_samples = ncol(log_rel_ab)

# where to save the result
distance = c()
type = c()

# we go through all the pair of samples and we calculate the distance
for(i in c(1: (n_samples-1) )){
  for(j in c( (i+1) : n_samples)){
    # name of the samples
    s_i = colnames(log_rel_ab)[i]
    s_j = colnames(log_rel_ab)[j]
    # profiles
    profile_i = log_rel_ab[,i]
    profile_j = log_rel_ab[,j]
    # calculate the distance as sum of the absolute distance
    d = sum(abs(profile_i - profile_j))
    # add the values
    distance = c(distance,d)
    # add correct type
    if(metadata[s_i,"Subject"] == metadata[s_j,"Subject"]){
      type = c(type,"Within distance")
    }else{
      type = c(type,"Between distance")
    }
  }
}

df = data.frame(dist = distance,type = type)

ggplot(df,aes(type,dist)) + geom_boxplot()

It is quite clear that they are different, but we can also test it:

wilcox.test(df[df$type == "Within distance",]$dist,
            df[df$type == "Between distance",]$dist)

Result:

Wilcoxon rank sum test with continuity correction

data:  df[df$type == "Within distance", ]$dist and df[df$type == "Between distance", ]$dist
W = 2844084, p-value < 2.2e-16
alternative hypothesis: true location shift is not equal to 0

This means that samples from the same subject have a low distance, or in other words they are similar to each other. And they are much more similar compared to other subjects, even after 1 year (the last time point is 50 weeks). From this we understand two things: - First, the human microbiome is stable over time - Second, there is a great variability between subjects

Solution 2

We can do a PCA plot:

rel_ab = prop.table(tax_profile,2)
log_rel_ab = log10(rel_ab+ 10^-4)

# remove zero rows
log_rel_ab = log_rel_ab[rowSums(rel_ab) > 0,]

pc <- prcomp(t(log_rel_ab),
             center = TRUE,
             scale. = TRUE)

df = data.frame(
  pc1 = pc$x[,1],
  pc2 = pc$x[,2],
  Subject = as.factor(metadata[rownames(pc$x),"Subject"]),
  Timepoint = metadata[rownames(pc$x),"Timepoint"],
  Sex = metadata[rownames(pc$x),"Sex"]
)

And plot the result:

ggplot(df,aes(x = pc1,y = pc2, col = Sex)) + geom_point()

ggplot(df,aes(x = pc1,y = pc2, col = Timepoint)) + geom_point()

ggplot(df,aes(x = pc1,y = pc2, col = Subject)) + geom_point() +
theme(legend.position = "none")

There is no particular grouping by sex or by timepoint (as expected). It also seems that the plot based on subject (third plot where each color is a subject) is random. But, we can actually see some structure, in particular outside of the conglomerate of points at the center.

Profiler	species detected in sampleA
mOTUs	97
MAPseq 97%	173
MAPseq 99%	121
MAPseq 96%	179