Overview
The module enables users to assemble Oxford Nanopore Technologies (ONT) FASTQ-files. It can also directly monitor and process MinKNOW run data.
The module consists of the following tools:
- Trimming
- Chopper: Applies a headcrop (trim start of read) and tailcrop (end of read). Filtering is done on average read quality and minimal or maximal read length (by default turned on with
quality 10 and minimum length 500
).
- Subsampling (and filtering)
- Rasusa: Randomly subsamples, in contrast to Filtlong, reads of different lengths to a specified coverage (by default turned on with
coverage 100
).
- Filtlong: Filters long reads by quality (longer is better) and subsamples (by default with
coverage 100
). Might be beneficial if subsampling is applied with RBK and especially RPBK data.
- De novo assembly
- Flye: Uses a repeat graph as the core data structure. Compared to de Bruijn graphs, which require exact k-mer matches, repeat graphs are built using approximate sequence matches and thereby can tolerate higher noise of reads. Does not correct the raw reads (in contrast to the canu assembler). States circularity and assembled coverage of contigs (runs with
--nano-hq
command by default) and is not fully deterministic, i.e., if the same dataset is re-analyzed not always the exact same results are obtained (default ONT assembler).
- Raven: Overlap-layout-consensus assembler which accelerates the overlap step, builds an assembly graph from reads that were pre-processed with pile-o-grams, and polishes the unambiguous graph paths with Racon. Does not correct the raw reads. States circularity and assembled coverage of contigs, is deterministic, and includes a Racon polishing step.
- Polishing
- Medaka 2.0: Creates consensus sequences from nanopore sequencing data. This task is performed using neural networks applied to a pileup of individual sequencing reads against a draft assembly. Corrects only the FASTA consensus and not the FASTQ raw reads. If Rasusa or Filtlong was applied, medaka uses the subsampled reads only (by default turned on with model
r1041_e82_400bps_bacterial_methylation
).
- ONT-cgMLST-Polisher: The proprietary ONT-cgMLST-Polisher is part of the ONT Data Assembly module. First, it maps the with Dorado basecalled (>/= SUP 4.2 model) FASTQ reads to the from Medaka 2.0 derived assembly consensus FASTA sequence by using minimap2. Next, it scans the alignment for positions in the core and accessory genome MLST genes that might be indicative for methylation related sequencing errors, e.g., differing strand-specific majority consensus calls. Those ‘ambiguous’ positions are then compared against a sequence with a closely related cgMLST allelic profile. Finally, based on the comparison the consensus sequence of ambiguous positions is either confirmed or masked with a ‘N’ call (by default turned on).
Hybrid assemblies are not supported. For further information see our long-read de novo assembler evaluation.
Accuracy and Contiguity
For Accuracies and Contiguities evaluations see the links.
Furthermore, was the ONT-cgMLST-Polisher tested (including Dorado model 5.0) in a recent ring-trial involving six different laboratories (Prior et al. (2025)).
Requirements
The ONT Data Assembly Module is part of the extra charged Long-read Data Analysis Bundle [LDAB].
Important: