Overview

SeqSphere+ supports for Linux and Windows 10/Windows Server 2019 users with installed Windows Subsystem for Linux (WSL) starting from Illumina raw reads the analysis of SARS-CoV-2 tiled amplicon (e.g., ARTIC or AmpliSeq) data. BWA-MEM is used to map the FASTQ files to the seed-/reference-genome MN908947.3 (NC_045512). The iVar tool trims the primers from the BAM file with the help of a supplied BED file that contains the primer positions. Finally, Samtools mpileup and iVar generate a whole genome FASTA consensus sequence file.

Those genomic FASTA files can be analyzed by two task templates. The SARS-CoV-2 all ORFs task template separates the genomic sequence into subsequences matching the different open reading frames (ORFs). In addition, the task template queries the ORFs for notable mutations. Currently the following 9 notable mutations are defined:

Noteable Mutation Example(s) Gene Start
Genome Position
Start
Gene Position
AA Number Ref NT Mutation NT Ref AA Mutation AA
S:69–70del B.1.1.7, Cluster 5 S 21765 203 69-70 TACATG ------ HV --
S:K417N/T B.1.351/P.1 S 22811 1249 417 AAG AAT/AAC/ACT/ACC/ACA/ACG K N/T
S:Y453F Cluster 5 S 22919 1357 453 TAT TTT/TTC Y F
S:477G/N EPI_ISL_1061213 S 22991 1429 477 AGC AAC/AAT/GGT/GGC/GGA/GGG S G/N
S:E484K B.1.351, P.1, P.3, B.1.525 S 23012 1450 484 GAA AAA/AAG E K
S:N501Y B.1.1.7, B.1.351, P.1, P.3 S 23063 1501 501 AAT TAT/TAC N Y
S:D614G found in B.1 clade S 23402 1840 614 GAT GGT/GGC/GGA/GGG D G
S:P681H B.1.1.7, B.1.1.207 S 23603 2041 681 CCT CAT/CAC P H
S:F888L B.1.525 S 24224 2662 888 TTT CTT/CTC/CTA/CTG F L


For further details regarding notable mutations, it is referred for example to Wikipedia. Once new notable mutations arise the task template will be updated.

The SARS-CoV-2 Pangolin task template determines from the genomic FASTA file a PANGO lineage (Phylogenetic Assignment of Named Global Outbreak Lineages; e.g., B.1.1.7). This task template requires an Internet connection as the FASTA file is uploaded for analysis to a pangolin server run by Ridom. This server does not store any result or sequence data and checks daily for pangoLEARN lineage assignment updates.

Consensus FASTA sequence files from Illumina, other sequencing platforms, or non-tiled amplicon approaches (e.g., metagenomic) can be analyzed by all Windows users. By automatically using the SARS-CoV-2 database scheme the database fields are compliant with the GISAID SARS-CoV-2 standards and the ENA virus pathogen reporting standard checklist. Due to rather few targets and quite frequently missing alleles it is recommend doing phylogenetic analysis in a comparison table from SNP rather than from allele haplotypes.

Step by Step Instruction

Installation

  • Step 1: Updating of SeqSphere+: The SARS-CoV-2 analysis requires to update the SeqSphere+ Server and all SeqSphere+ Clients to version 7.5 or higher. If not done yet, please follow the update instructions described at https://www.ridom.de/seqsphere/update/
  • Step 2: Windows Subsystem for Linux (WSL) must be available on the Windows 10/Windows Server 2019 computers where the SeqSphere+ Clients are installed, that perform the pipeline for processing Illumina tiled amplicon raw read data. If not installed yet, please follow the installation instructions described here: Windows Subsystem For Linux (WSL). Once processed, the SARS-CoV-2 samples can also be opened from a SeqSphere+ Client (version 7.5) running on Windows without WSL.
  • Step 3: iVar software must be available on the computers where the SeqSphere+ Clients are installed, that perform the pipeline for processing Illumina tiled amplicon raw read data:
  • Windows users should invoke in the SeqSphere+ menu the function Help | Windows Subsystem for Linux | Install iVar on WSL to initiate the installation of iVar with all dependencies (includes Samtools with mpileup). The installation may take some minutes and is finished once a dialog pops-up confirming that iVar was installed.
  • Linux users should follow the conda installation as described by the iVar developers. Alternatively, if conda is not installed yet, the bash script file RidomSeqSphere/ext/install/installCondaAndIVar.sh contained in the Ridom SeqSphere+ folder might work too. Once the script is called it first downloads and installs miniconda and then installs the iVar package (including Samtools) via conda.

Processing and Analyzing

  • Step 4: Create a new project, press the button Download & Add, select in the drop-down box at the top of the window Virus, download the two task templates SARS-CoV-2 Pangolin and SARS-CoV-2 all ORFs, and save the project by pushing the OK button (If you do not see the Virus box and/or the SARS-CoV-2 task templates, please make sure that your SeqSphere+ Client and SeqSphere+ Server were both updated to version 7.5).
Doc-info.pngHint: New SeqSphere+ users find more details on how to setup a project and pipeline script in the Tutorial for SeqSphere+ Assembly and cgMLST Analysis Pipeline.
  • Step 5: Create a new pipeline script.
  • Step 6: In the panel Define Input Sources of the script choose the FASTQs (e.g., the evaluation data below) and select as Procedure Details "Amplicon, viral RNA, Illumina, paired-end reads" or create own procedure details.
  • Step 7: In the panel Define Projects choose the project and choose as assembler Tiled amplicon BWA+iVar (reference mapping). Press the Settings... button right of the assembler box. A primer BED file must be selected here. Primer BED files for ARTIC v3 and Illumina AmpliSeq are provided. Either select one of the two provided files or add via the + button a BED file for another protocol. Below the BED file dialog, the Advanced Settings... can be used to change iVar and Samtools mpileup command line parameters. For primer trimming SeqSphere+ applies by default the -e option of iVar trim, i.e., reads with no primers will not be excluded from further analysis. Only in case that the amplicons are not fragmented during library preparation this option should not be used. For Samtools mpileup default parameters are used; i.e., especially the maximum per-file coverage -d 8000 command is applied to avoid excessive memory usage. The only non-default parameters used are do not discard anomalous read pairs (-A), disable per-base alignment quality (-B), and skip bases with base quality smaller than 0 (-Q 0). For consensus calling the defaults for minimum quality score threshold (-q 20) and minimum coverage to call consensus (-m 10) are used. Furthermore, by default, the -t 0 minimum frequency threshold is applied during consensus calling, i.e., the majority base is called. For stricter consensus calling, e.g., a called base must make up at least 90% presence at a position, the parameter must be changed to -t 0.9. For further details and other parameters it is referred to the iVar manual. All by SeqSphere+ employed iVar and Samtools mpileup parameters are logged and can be found in the Assembly Post-processing row of the samples Procedure tab.
  • Step 8: When pressing Next in the Define Project panel, a warning will appear that the Mash Screen contamination check will be disabled in this pipeline as it not supported for viruses. Continue with the rest of the pipeline as usual, save the script, and start the script.
  • Step 9: After the pipeline has finished, exit the pipeline mode, and open the samples in the interactive mode. The samples Results tab will show the Pangolin lineage and the found notable mutations. For more details open on the left the task templates (e.g., Pangolin probability or how many defined variant position have good quality).
  • Step 10: When creating a comparison table for a project with the two task templates, it will contain as task result fields the Pangolin lineage (used for coloring by default), the Noteable Mutations, and the allele types for the 11 ORFs. For phylogenetic comparisons it is recommend to use the command Tools | Find SNVs in Distance Columns.... In the finally upcoming SNV Positions table push the Open in Variant Comparison Table icon. Once the comparison table has opened push either the Minimum Spanning Tree or Neighbor Joining Tree icon for tree drawing.

Evaluation Data

The example data archive SeqSphere_Examples_SARS-CoV-2_ARTICV3.zip (4.3 MB) can be downloaded for evaluation of the installation. Extract the zip-file on your computer. This evaluation data folder contains downsampled SRA Illumina paired-end FASTQ files of three SARS-CoV-2 samples. The data were produced using the ARTIC v3 tiled amplicon protocol. The PANGO lineage assignment should result in B.1.356, B.1.167, and B.1.1.7 for the three samples, respectively.