Overview

Mash Screen (citation) is a software for fast contamination screening using the MinHash algorithm.

For this purpose, SeqSphere+ comes with a Mash reference database (sketch size of s=1,000 and k=21) that contains all prokaryotic NCBI Genome entries with status complete or chromosome that were filtered for taxonomic reliable genus and species information.

Button16 Important.png Important: Mash Screen requires Linux or Windows with installed Windows Subsystem For Linux.

Pipeline Contamination Check (Mash Screen)

If the option Perform contamination check (Mash Screen) is enabled in the pipeline script Mash Screen is started to find the closest matches for the FASTA assembly contigs of a Sample in the Mash reference database (only contamination with a second species above 9 precent ratio can be reliable detected from FASTA files).

If multiple different species are found above the predefined thresholds (Identity >=0.95, Shared-hashes >=100), a potential contamination is reported. If a potential contamination is detected, a warning is logged in the pipeline log.

Up to seven fields of the procedure statistics are filled with the results of the contamination check:

  • Top Species Match
  • Top Species Match Identity
  • Top Species Match Shared-Hashes
  • Contamination Check Result
  • Potential Contaminating Species*
  • Potential Contaminating Species Identity*
  • Potential Contaminating Species Shared-Hashes*

 * only filled if potential contamination found

The Top Species Match is always filled even if the top match does not reach the thresholds. All fields can be exported, shown in a comparison table, and viewed in the procedure details panel of a sample. If a contamination is found, it is highlighted as a warning in this panel. The first two fields are default fields when creating a comparison table.

The closely related species defined in Mash equivalency groups are not treated as contamination.

Tools Menu Contamination Check (Mash Screen)

Contamination Check with Mash Screen in Tools Menu
Contamination Check by using Mash Screen

The menu function Tools | Genome Utilities | Contamination Check (Mash Screen) can be used to screen for contaminants in a read file (FASTQ) or an assembly contigs file (FASTA/GB/BAM/ACE). For read data the forward reads file is recommended to be used (from FASTQ files contamination with a second species can be reliable detected above a 1 precent ratio).

When the dialog is confirmed with Start, Mash Screen is started to find matches for the query in the Mash reference database. The resulting matches are filtered by thresholds for Identity and Shared-hashes. The default thresholds (Identity >=0.95, Shared-hashes >=100) can be changed.

By default, the option 'Winner takes all' is enabled, to remove redundancy in the result. If this option is not enabled, every matching strain from the same species of the reference database is reported in the result.

The result is shown in a dialog window containing an exportable table with all matches above the defined thresholds. The table can be exported and has the following columns:

Taxonomic Info
Contains genus, species, sub-species and strain names of found genomes in the Mash reference database.
Reference Sketch-ID
The accession number of found genomes in Mash reference database that can be used for searches of the NCBI Genomes database.
Identity
The identity score is not the true identity of a genome versus what is in the query sequence, but what fraction of bases are shared between the genome and your sequencing reads (this is estimated from the fraction of shared k-mers). Sequencing errors and gaps in coverage will reduce the identity estimate.
Shared-hashes
The more similar another genome is the more MinHashes are likely to match.
p-Value
Lower p-values correspond to more confident estimates and will often be rounded down to 0.
Median Multiplicity.
Computed for shared hashes based on the number of observations of those hashes within the query sequence.
Est. Abundance (%)
Is an estimation of abundance for each found genome based on Median Multiplicity.
This value is only available if the Winner takes all option is selected and the input file is a FASTQ file.