Project 3
Project 3: Assembly and annotation of bacterial genomes
You will be working with PacBio sequencing data of eight different bacterial species. Divide the species over the members of the group and generate an assembly and annotation. After that, guess the species.
Project aim
Generate and evaluate an assembly of a bacterial genome out of PacBio reads.
There are eight different species: sample_[1-8].fastq.gz
Each species has a fastq file available. You can download all fastq files like this:
wget https://ngs-longreads-training.s3.eu-central-1.amazonaws.com/project3.tar.gz
tar -xvf project3.tar.gz
rm project3.tar.gz
Note
Download the data file package in your shared working directory, i.e. : /group_work/<group name>
or ~/<group name>
. Only one group member has to do this.
This will create a directory project3
with the following structure:
project3
|-- sample_1.fastq.gz
|-- sample_2.fastq.gz
|-- sample_3.fastq.gz
|-- sample_4.fastq.gz
|-- sample_5.fastq.gz
|-- sample_6.fastq.gz
|-- sample_7.fastq.gz
`-- sample_8.fastq.gz
0 directories, 8 files
Before you start
You can start this project with dividing the species over the different group members. In principle, each group member will go through all the steps of assembly and annotation:
- Quality control with
NanoPlot
- Assembly with
flye
- Assembly QC with
BUSCO
- Annotation with
prokka
Tasks and questions
Note
You have four cores available. Use them! For most tools you can specificy the number of cores/cpus as an argument.
Note
All require software can be found in the conda environment assembly
. Load it like this:
conda activate assembly
- Perform a quality control with
NanoPlot
.- How is the read quality? Is this quality expected?
- How is the read length?
- Perform an assembly with
flye
.- Have a look at the helper first with
flye --help
. Make sure you pick the correct mode (i.e.--pacbio-??
). - Check out the output. Where is the assembly? How is the quality? For that, check out
assembly_info.txt
. - What species did you assemble? Choose from this list:
Acinetobacter baumannii Bacillus cereus Bacillus subtilis Burkholderia cepacia Burkholderia multivorans Enterococcus faecalis Escherichia coli Helicobacter pylori Klebsiella pneumoniae Listeria monocytogenes Methanocorpusculum labreanum Neisseria meningitidis Rhodopseudomonas palustris Salmonella enterica Staphylococcus aureus Streptococcus pyogenes Thermanaerovibrio acidaminovorans Treponema denticola Vibrio parahaemolyticus
- Did flye assemble any plasmid sequences?
- Have a look at the helper first with
- Check the completeness with
BUSCO
. Have a good look at the manual first. You can use automated lineage selecton by specifying--auto-lineage-prok
. After you have runBUSCO
, you can generate a nice completeness plot withgenerate_plot.py
. You can check its usage withgenerate_plot.py --help
.- How is the completeness? Is this expected?
- Perform an annotation with
prokka
. Again, check the manual first. After the run, have a look at for example the statistics inPROKKA_[date].txt
. For a nice table of annotated genes have a look inPROKKA_[date].tsv
. - Compare the assemblies of the different species. Are assembly qualities similar? Can you think of reasons why?
This tutorial is based on data provided by Pacific Biosciences at https://downloads.pacbcloud.com/public/dataset/2021-11-Microbial-96plex/