Project 3
Project 3: Assembly and annotation of bacterial genomes
You will be working with PacBio sequencing data of five different bacterial strains. Divide the strains over the members of the group and generate an assembly and annotation.
Project aim
Generate and evaluate an assembly of a bacterial genome out of PacBio reads.
There are five different strains:
- LMB2
- LWH12
- LWH7
- LWO12
- LWO14
Each strain has a tarfile available. Download only the data for the strains that you will require:
mkdir -p ~/workdir/groupwork_assembly
cd ~/workdir/groupwork_assembly
# change this to your strain:
STRAIN="LWX12"
wget https://ngs-longreads-training.s3.eu-central-1.amazonaws.com/group_work_assembly/"$STRAIN".tar.gz
tar -xvf "$STRAIN".tar.gz
rm "$STRAIN".tar.gz
The downloaded directory has the following structure (here’s an example for LWH7):
LWH7
├── CLR
│ └── reads
│ └── LWH7_CLR.fasta.gz
├── HiFi
│ └── reads
│ ├── LWH7_HiFi.fasta.gz
│ └── LWH7_HiFi.fastq.gz
└── Illumina
├── assembly
│ └── scaffolds_200.fasta
└── reads
├── LWH7_1.fastq.gz
└── LWH7_2.fastq.gz
7 directories, 6 files
The directories CLR
and HiFi
contain the CLR and HiFi reads respectively. The directories in Illumina
contain an assembly as produced by SPAdes
.
Before you start
You can start this project with dividing the strains over the different group members. In principle, each group member will go through all the steps of assembly and annotation:
- Quality control with
NanoPlot
- Assembly with
flye
- Assembly QC with
BUSCO
- Annotation with
prokka
You can do this for both the CLR reads and HiFi reads and compare the results.
Tasks and questions
Note
You have four cores available. Use them! For most tools you can specificy the number of cores/cpus as an argument.
Note
All require software can be found in the conda environment assembly
. Load it like this:
conda activate assembly
Before you run prokka
The conda
installation misses a perl module. Install it in the assembly
environment like this:
cpanm Bio::SearchIO::hmmer --force
- Perform a quality control with
NanoPlot
.- How is the read quality? Is this quality expected?
- How is the read length?
- Perform an assembly with
flye
.- Have a look at the helper first with
flye --help
. Make sure you pick the correct mode (i.e.--pacbio-??
). - Check out the output. Where is the assembly? How is the quality? For that, check out
assembly_info.txt
.
- Have a look at the helper first with
- Check the completeness with
BUSCO
. Have a good look at the manual first. You can use automated lineage selecton by specifying--auto-lineage-prok
. After you have runBUSCO
, you can generate a nice completeness plot withgenerate_plot.py
. You can check its usage withgenerate_plot.py --help
.- How is the completeness? Is this expected?
-
Perform an annotation with
prokka
. Again, check the manual first. After the run, have a look at for example the statistics inPROKKA_[date].txt
. For a nice table of annotated genes have a look inPROKKA_[data].tsv
. -
Compare the assembly and annotation between the Illumina, CLR and HiFi reads. Do you see any differences?
- Compare the assemblies of the different strains. Are assembly qualities similar? Can you think of reasons why?
- BONUS: Polish the CLR assembly with the Illumina reads by using
pilon
. For this you will need to align the Illumina reads to the assembly first. Useminimap2
for that while setting-x
tosr
. For pilon, specify the resulting bam file by using the option--frags
.- Does the polishing improve the assembly? Why (not)?