Overview

This tutorial describes how to use the Ridom SeqSphere+ software to assembly and analyze bacterial genomic data using the SeqSphere+ Pipeline Mode.

Listeria monocytogenes is used exemplary for this demonstration. However, by reading this tutorial you should be able to define your own pipelines and projects for all species with a public available cgMLST scheme in the Task Template Sphere and analyze your own raw read data (FASTQ files) or preassembled data (FASTA files). If you intend to analyze in addition or exclusively public genomic data from either NCBI genomes or NCBI Sequence Read Archive (SRA) you can use SeqSphere+ to download and process those data, too. Further details can be found on the pages Download from NCBI Genomes and Download FASTQ from SRA. If you are analyzing a species for which no public cgMLST is available yet please take a look at Core Genome MLST Schemes help.

Preliminaries

Installation of SeqSphere+: If SeqSphere+ is not available yet, a one-month trial version can be requested. The SeqSphere+ Client and Server software can be installed on the same computer for this tutorial.

System Requirements: This tutorial requires at least 8 GB RAM. It is recommended to use the tutorial on a Windows 10 system with installed Windows Subsystem for Linux (WSL) or on a Linux system. With 8 GB RAM and Core i3 CPU the pipeline takes are around 20 minutes.

Hint: If a Windows system without WSL is used for the tutorial, the Velvet Assembler must be used as an alternative de novo assembler for SKESA which increases the run time of the pipeline fourfold, 16 GB RAM are recommended, and species identification, contamination check, and run details import are not available.

Tutorial Data: Download the example data archive SeqSphere_Examples_Pipeline_L_monocytogenes_ACCO.zip (~270 MB) for this tutorial, and extract the zip-file on your computer. This example data folder contains Illumina MiSeq 250bp paired-end FASTQ files for 4 isolates of Listeria monocytogenes. The FASTQ files were downsampled to 30x coverage to decrease the assembling time for this tutorial. The original whole genome shotgun (WGS) data was published by Ruppitsch et. al. (JCM 53: 2869–2876, 2015). To demonstrate the import of Illumina run details some artificial run info files were added to the example data folder.

Define Pipeline Script

Step 1: Start the Ridom SeqSphere+ Client without logging in, and press the button Start Pipeline Mode on the bottom of the login panel or use the identical menu function in the File menu.

Seqsphere pipeline wiz pipelinemodelink.png

Step 2: The pipeline mode window starts up. The pipeline mode is designed to run SeqSphere+ in a non-interactive way to assemble, process, and analyze WGS data automatically defined by a pipeline script. Press Create New Script to start open a dialog for creating a pipeline script. In the first step the Server Host and the User Login must be defined. Just use localhost for your local computer and the same SeqSphere+ user account that you are normally using for the SeqSphere+ login. The option to store user login in the pipeline script is enabled by default. Below enter the User Password of this user account. If wanted, the password can also be stored (encrypted) in the pipeline script. However, it should be taken into account that if the password is stored in the pipeline script, anyone with access to the computer can run the pipeline. For this tutorial check the Store password encrypted in pipeline script file checkbox.

Press Next to move on.

Step 3: In the Define General Settings panel enter a Pipeline Name (e.g., 'Pipeline Tutorial'). The comment can be left empty and the access control for viewing the reports generated by this pipeline can be left to the user's group as default. If the SeqSphere+ Client is running on Linux or on Windows 10 with installed WSL four checkbox are shown. The first two checkboxes are preselected and allow to perform automatically a FASTQ file quality control with FastQC and contamination check with Mash. The third options allows to store read data in the database for problematic targets. The fourth option, "Continous Mode", can be used to monitor a directory and automatically process newly appearing sequence data files. For this demonstration leave the latter two options unchecked.

Press Next to move on.

Step 4: In the next panel the Input Sources for the WGS sequence data are selected. The Input Source Type Directory is predefined. Press the button and select the directory SeqSphere_Examples_Pipeline_L_monocytogenes_ACCO that was unpacked from the downloaded tutorial data file (see Preliminaries). The File Preview on the lower right shows the 8 fastq.gz files that are currently in this directory. When the pipeline is started all files in this directory will be processed. Files can be excluded from processing by selecting them via the right-click menu in the File Preview.

Important: FASTQ files with adapters (and multiplex indices) trimmed-off are required here for optimal de novo assembly results.

Step 5: The Procedure Details must be selected for the sequence data files to document the laboratory procedure details for all Samples. For this tutorial data select the predefined Illumina, paired-end reads procedure details.

Hint: Alternatively, for documentary purpose more details can be defined by pushing the

New Procedure Details button. A New Procedure Details window pops-up. Enter at least Library Source: genomic, Library Strategy: WGS, Library Selection: random, Sequencing Protocol: paired-end reads, Library Insert Size: 400bp, Sequencing Length: 250bp, Sequencing Vendor: Illumina, and Sequencing Platform: MiSeq. Those details are required if later on a submission with SeqSphere+ of the raw read data to the EBI European Nucleotide Archive (ENA) is intended (e.g., for publication purposes). The assembly procedure details can be left empty as they are filled-in when using the assembly pipeline. Press the OK button of the New Procedure Details window.

Step 6: Each input source must also have a File Naming Definition that describes at least how to find the Sample ID in the file names of your sequence data. The Field Terminator is automatically filled with the underline (_) symbol. You can leave this to default for the tutorial data. If your own sequence file names do not fit this definition further details can be found in the file naming documentation. Press Next to move on.

Step 7: In the next Define Projects panel the Project(s) are selected into which processed Sample data should be imported. For this tutorial the Project does not exist yet and must be created. Therefore, press Manage Projects in Database in the bottom right of the window. Press next the Create new Project icon to start defining a new project.

Step 8: Enter a name for the new Project (e.g., Pipeline Tutorial). Then press Download & Add in Task Templates section to browse the Task Template Sphere.

Step 9: The Task Template Sphere provides all predefined public Task Templates. Choose as organism Listeria monocytogenes. There are four Task Templates available for L. monocytogenes, i.e., cgMLST, Accessory, MLST and 5-plex PCR Serorgroup. The cgMLST Task Template defines the 1,701 genes of the reference strain EDG-e that are used for the public nomenclature and for the definition of the complex type (CT). The Accessory Task Templates defines in addition 1,158 genes that do not belong to the core genome. However, they can be used to increase the discriminatory power if the resolution of cgMLST is not high enough.

Step 10: Select all four Task Templates and press OK to download and to add them to the Project. Finally confirm with Save & Close to save the new Project.

Step 11: Select the just created project in the Project Name section of the Define Projects panel.

Step 12: The seed genome for L. monocytogenes that is used as genome size reference for downsampling is automatically loaded from the cgMLST task template. Check the box Perform Assembling/Mapping for read files. If on Linux or Windows 10 with installed WSL the de novo assemlber SKESA is preselected and should be used for this tutorial else Velvet can be used for de novo assembling.

Press Next to move on.

Step 13: In the upcoming Define Submission panel it can be defined if the pipeline should automatically submit the samples and alleles to the public cgMLST Nomenclature Server (cgMLST.org). The submission of new alleles is enabled by default and can only be disabled globally in the Client. The allelic profile is not stored at cgMLST.org during allele submission. The submission of Samples can be used to submit and store the allelic profile and optional metadata on cgMLST.org. In a new pipeline script this is enabled by default, however, it requires a registration of the user at cgMLST.org. For this tutorial the option Automatically submit samples to cgMLST.org Nomenclature Server should be unchecked.

Press Next to move on.

Step 14: Finally in the Define File Management panel it can be defined what the pipeline should do with the created assembly files and raw reads. Leave all to default and press Test Pipeline Script to validate your pipeline.

Step 15: The test should finish successfully. Press the Close button of this dialog. Now press Finish to store the new pipeline script.

Run the Pipeline

Step 1: Be sure that the just created pipeline script is selected and press the button Start Script to run the pipeline.

Step 2: A blue colored progress window is opened showing the current progress and messages of the pipeline.

If SKESA is used the pipeline may take around 20 minutes (8 GB RAM; Core i3 CPU) to be finished. If Velvet is used the runtime is quadrupled.

Step 3: When the pipeline has finished the background color turns to white. Press the Show Report button to see a quick overview for the statistics of the processed Samples.

Step 4: Close the report window, press Close in the pipeline progress window, and exit the pipeline mode with the button Exit and Restart SeqSphere+.

Open the Processed Samples

Step 1: The SeqSphere+ Client login window appears again. To see further details about the pipeline run and the imported Samples you can now switch back to the normal interactive login session mode. Enter the user name and password and press Login.

Step 2: On the right of the home screen in the section Recent Pipeline Reports an item for the new report is shown. Click it to open the report or use the menu function Options | Browse Pipeline Reports to choose it from a list of all reports.

Step 3: The pipeline report is the same as it was shown just before in the pipeline mode. As SeqSphere+ is now in interactive mode Samples can be directly loaded into the workspace. Go to the Processed Samples section and double-click on the first Sample to load it in the background. Close the pipeline report window. The Sample is shown in the workspace.

Step 4: The left panel of the main window shows a navigation tree with the loaded Sample. Each Sample node in the navigation has four sub nodes: The 5-plex PCR Serogroup task, the MLST task, the cgMLST task, and the Accessory task. Below the task nodes there are the target nodes. Each target node represents one sequence (here a gene) that was extracted from the input data (here the de novo assembled WGS data). The targets can have different target QC states:
- Good Targets (green) were extracted and fulfilled all requirements that are defined in the Target QC Procedure of the Task Template.
- Failed Targets (red) were extracted but failed at least in one of the requirements that are defined in those parameters. For example, they may have frame shifts and incorrect lengths compared to the allele of the seed genome.
- Not Found Targets (gray) were not extracted (because the match did not reached the thresholds or the target is not present at all).

Seqsphere pipeline browse attachment.png

Step 5: Click on the Procedure tab in the right panel of the window to see the details about the sequence data and processing. Some fields are important for quality control of the sequence data. Again traffic light colors are used to highlight their QC result status for various aspects as succeed, warning, or failed.

In the Procedure Details section the values for Sequencing Run ID and/or Sequencing Run QC can be right-clicked to show the Sequencing Run Details if they were imported by the pipeline.

In the Reads Statistics section below the QC results of FastQC Per Base Sequencing Quality and FastQC Adapter Content can be right-clicked to show the detailed FastQC results if FastQC was executed by the pipeline.

Step 6: Close the Sample by pressing the in the toolbar above the panel.

Import Epidemiological Metadata

Step 1: Invoke in the menu File | Import Epi Metadata

Step 2: In the upcoming Choose File to Import dialog select the file Lm_metadata.xls from the directory SeqSphere_Examples_Pipeline_L_monocytogenes_ACCO.

Seqsphere pipeline tutorial import metadata0.png

Step 3: A preview dialog with the content of this Excel file is shown. It contains epidemiological metadata of the four Samples for which the sequence data was already processed. Press Continue.

Step 4: The next dialog defines the import settings. First select on the project that was created and processed by the pipeline before.

Seqsphere pipeline tutorial import metadata1.png

Step 5: The table on the bottom shows the mapping between the table columns and the SeqSphere+ database fields. By default all columns are unmapped and are therefore highlighted in red. As the column headers have the same (or similar) name as the database fields, a mapping can be done automatically. Press the button Auto-Detect Mapping to recognize and map all known column names.

Seqsphere pipeline tutorial import metadata2.png

Step 6: All but two columns were mapped to SeqSphere+ database fields. For the two columns that are still red (CFU/g and Outcome) no fields exist with that name in the SeqSphere+ database. If they should be imported in the database they could be manually mapped to existing fields by clicking on the red header and selecting a field (those mappings can be stored for later re-usage for files with identical headings). If they should be imported into a new field this field must be created before invoking the metadata import. For this tutorial we leave the two fields unmapped. Press OK to start the import.

Step 7: After the import is finished a dialog is shown. Press Open Samples to open the Samples and take a short look at the imported database fields (e.g., Collection Date and City of Isolation).

Seqsphere pipeline tutorial import metadata3.png

Step 8: Close all Samples by choosing in the menu File | Close All.

Analyze Samples with Comparison Table and Draw a Tree and Epi Curve

Step 1: Choose from the menu Tools | Comparison Table to draw a tree (phylogenetic analysis), a epi curve and/or map.

Seqsphere pipeline tutorial comptable0.png

Step 2: In the Comparison Table dialog go to the first tab "Create New". In the Choose Samples section choose the first option Select all Samples of Project and the new previously created project (e.g., Pipeline Tutorial). Below the default procedure and epidemiological metadata fields for the comparison table are listed. On the bottom in the section Choose Genotypings Schemes the checkbox for L. monocytogenes cgMLST is preselected. Press the Create Comparison Table button to confirm the dialog and create the comparison table.

Seqsphere pipeline tutorial comptable1.png

Step 3: The comparison table is opened and shows the data for the four Samples. The column with the red header (Epi Info) is used by default for coloring the Sample rows. The columns with a dark green header are used for distance calculation. Those columns are the allele types of the cgMLST task. Some of those contain missing values if a cgMLST target was not found at all in an input sequence ("? (not found)") or if the Target QC Procedure for this target has failed, e.g., because of a frame shift error ("? (failed)"). The first column in the table shows number of missing values per row.

Step 4: Right-click on the column Complex Type (ninth column) and choose from the menu Set Color Groups by Column Values. Leave the upcoming dialog to defaults and confirm with OK. The Sample rows are now colored by the different cgMLST Complex Types. Further description of functionality (e.g., sorting, filtering, and exporting of data) can be found in the Comparison Table documentation.

Step 5: Press the Minimum Spanning Tree button in the toolbar to calculate the distances between the Samples and draw a minimum spanning tree for them. Because the table contains missing data it must be confirmed that the missing values are ignored pairwise or another of the available options must be chosen. Confirm with OK.

Step 6: The minimum spanning tree is calculated for the allelic profiles of the 1,701 cgMLST targets (pairwise ignored missing values) and is shown in a new window. By default the nodes are again colored by the Complex Type (CT) and it can be easily seen that 3 of the 4 isolates have the same Complex Type (CT 39). Just Sample L38-11 belongs to a different Complex Type.

Two conclusions can be drawn from this tree.

The three CT 39 isolates (with epi info ACCO II) have the same Complex Type and a close distance. Therefore, this cluster indicates epidemiological relationship.
The fourth isolate (L38-11; with epi info ACCO II outgroup) has a different Complex Type and a distance of 32 alleles to the three other isolates and does not belong to the outbreak (in fact this isolate from the original publication was epidemically unrelated but shared with the three other isolates an identical pulsotype [by PFGE with two different enzymes]).

Hint: The Minimum Spanning Tree (MST) can be exported by clicking the

Export MST icon of the toolbar. In the upcoming Export MST file dialog choose as file type the Scalable Vector Graphics (*.svg) or Windows Enhanced Metafile (*.emf) format. Note that EMF or SVG are vector graphics formats and are therefore suited for finishing publication ready figures. EMF files can imported and scaled by MS PowerPoint. SVG files can be edited, e.g., with Adobe Illustrator or the open-source InkScape tool (once the file is loaded first ungroup all objects). Further description of MST functionality (e.g., cluster shading, figure legend, tree appearance, etc.) can be found in the Minimum Spanning Tree documentation.

Step 7: Press the button to store the comparison table and the MST as Comparison Table Snapshot in the database for potential later reuse.

Step 8: Press the button to create an Epi Curve for the comparison tables. As the collection dates were imported from the metadata file the Samples are shown in the graph ordered by date. The time scale of the epi curve can be changed. Samples with incomplete time data are shown with a white rectangle border (e.g., L38-10).

Seqsphere pipeline tutorial epicurve.png

Step 9: Press the button to create an Geographical Map for the comparison table data. The country and city information was imported from the metadata file but no latitude/longitude information exists yet. Therefore, a dialog is shown that offers to perform a geocoding (Internet connection is required for this functionality).

Seqsphere pipeline tutorial geocoding.png

Step 10: Leave the dialog to default settings and confirm it by pressing the OK button, to retrieve the geographical latitude and longitude for the city and country information that was imported as metadata. After this information was retrieved the window with the geographical map opens automatically. The four Samples are drawn on their geographical location. Clicking on a Sample in the map marks the Sample also in the epi curve, the minimum spanning tree, and in the comparison table (all those windows and the comparison table are interactively linked).

Find Group Specific SNVs for Screening Assay

Step 1: The function Find Group Specific SNVs in the comparison table allows to find SNVs that are unique to a group of Samples (target-genomes) and do not appear in a second group of Samples (nontarget-genomes). Those group specific SNVs can be used to develop highly specific screening assays (e.g., real-time PCR TaqMan™ or High-Resolution Melting Curve [HRMC] assays).

Right-click on the column Epi Info and choose from the menu

Set Color Groups by Column Values to set the coloring back to the epidemiological information. Now press the button

in the toolbar of the comparison table window to invoke the SNV search function. A dialog for the handling of missing values is shown, confirm it with the default settings. In the upcoming dialog choose the ACCO II isolates (with CT 39) as target-genomes and ACCO II Outgroup with L38-11 (CT 41) as nontarget-genome. Confirm the dialog by pressing the OK button.

Seqsphere pipeline tutorial snvfinder1.png

Step 2: In the following dialog the Samples selection and some options are shown. If corrections are needed (or if no groups exist in the comparison table) the Samples can be removed or moved here between the target- and nontarget-genomes. The export button allows to write the FASTA files of all chosen Samples into two different directories (target-genomes and nontarget-genomes) to use them for a signature search with an external software (e.g., PanSeq Novel Region Finder). Such an export is not done in this tutorial.

Confirm the dialog with default settings to start the search for group specific SNVs.

Seqsphere pipeline tutorial snvfinder2.png

Step 3: After the search has finished the found 33 SNVs are shown in a table. Each row of the table represents a group specific SNV that was found in all target-genomes but not in the nontarget-genomes. The columns Target-Genomes and Nontarget-Genomes summarize the nucleotide present in the according set of Samples together with their frequencies.

Seqsphere pipeline tutorial snvfinder3.png

Step 3: The found SNVs can be filtered by pressing the button in the upper right corner of the window. Choose in the upcoming filter dialog the option to show only SNV positions where the majority of target-genome SNV is synonymous in comparison to Ref.-Seq. Press the OK button to confirm this selection and apply the filtering. The table is now reduced to the 22 SNVs that passed the filter.

Seqsphere pipeline tutorial snvfinder4.png

Contents

Overview

Preliminaries

Define Pipeline Script

Run the Pipeline

Open the Processed Samples

Import Epidemiological Metadata

Analyze Samples with Comparison Table and Draw a Tree and Epi Curve

Find Group Specific SNVs for Screening Assay