The workflow of a pipeline is defined by a pipeline script. Pipeline scripts are always stored on the computer in the profile directory of the user where the SeqSphere+ Client is installed that was used to define a script. To share a script on multiple Client computers the script can be either ex- and imported or distributed via pipeline reports. A pipeline script can be easily created using the wizard dialog which has the following six sections. ContentsUser AccountTo setup a pipeline script a user account on a specific server is required. By default, the User Login but not the User Password will be stored in a pipeline script. Thereby the script can only be edited and run when the Password of this user is entered. This means the pipeline will be run on behalf of this user, and all Samples created by such a defined pipeline script will be owned by this user. If the User Login is not stored in the script, a login dialog is shown when starting the pipeline and the pipeline runs with the supplied login information. Next to the User Login also the User Password of the user can be stored in a pipeline script. The password is always stored encrypted. Every user can run such a script. Export such a script to share it between different users. However, for editing a script always the User Password is required.
General SettingsThe section defines general settings of the pipeline. Each pipeline must have a unique name. Optionally a comment can be entered. In addition the access rights for the reports of the pipeline can be specified here (default is primary group). By pressing the Advanced General Settings button more detailed general setting options can be selected:
Input SourcesAn input source defines where the pipeline should look for sequencing data (FASTQ/ACE/BAM/GB/FASTA files). FastQ files with adapters (& multiplex indices) trimmed-off are required here for optimal de novo assembly results. Multiple input sources can be defined here for differing sequencing machine vendors (e.g. Illumina or Ion Torrent) by pushing the + Add button in the upper-left corner. The Input Source Type
The Input Source Directory must always define a path to the sequencing data. The sequencing vendor (e.g., Illumina or Ion Torrent, etc.) and sequencing protocol (e.g., single-end or paired-end reads) of the Procedure Details must be defined always if FASTQ files will be assembled and no other laboratory procedure details are available (e.g., by file naming procedure details or by SPEC files). All other procedure details are optional. However, if it is intended to submit FASTQ files to ENA then please remember that some laboratory procedure details are obligatory required by ENA. If assembled data is processed with the pipeline (ACE/BAM/GB/FASTA files) then procedure details are always optionally. In general this feature is very convenient way to attach detailed - frequently identical - laboratory procedure details for documentation purposes to a large number of Samples. File NamingEach input source must have a sequence data file naming convention defined, that enables an automatized workflow with no user intervention required. Minimum the Sample ID information must be transferred by the sequence data file name. However, additional information (in file name fields) can be transferred via the file name to further streamline this automized analysis. Usually the file names of sequence data can be controlled to a certain amount by supplying a filled-in sample sheet (e.g., by using the Illumina Experiment Manager tool) to a sequencing machine before starting a run. First, the file name can be split into different fields by a Field Delimiter until a defined Field Terminator appears. If no terminator is defined, the whole file name (without file extension) is used. Alternatively a regular expression can be used here to split a file name into various fields. Once this option is chosen and a regular expression is entered the delimiter and terminator entered above are ignored. Next the Field Positions in File Name of the various fields can be defined here, by entering a number (beginning with 1) into
The coloring of the sequence data files in the File preview box documents how current delimiter, terminator, and field settings apply to those file.
Advanced Input Source Options
ProjectsIn the project section it is defined into what projects Sample data will be imported. At least one project must be defined here even if a project acronym is used in the file names and its position is defined in the previous section. If more than one Project is defined here, each Project must have an acronym that is used in the file names and their position defined in the previous section. In this case the project for each Sample is selected by the acronym that is found in the file name of the input data. The Projects in Database button in the upper-right corner of this section can be used to modify existing projects (e.g., adding a project acronym) or to create new projects. By pushing the + Add or - Remove buttons in the upper-left corner the projects to be used by the pipeline script can be managed. Per project tab the following information needs to be selected or is shown:
The checkbox Perform Assembling/Mapping for read files can also be checked and configured per project. Once selected a pipeline using this script will assemble FASTQ files (checkbox obviously does not apply for ACE/BAM/GB/FASTa files). For each sequencing vendor (defined in the input sources section via multiple input sources and/or by using a procedure details field in the FASTQ file names) an assembler or mapper can be selected. The following assemblers/mappers are available:
Right to the assemblers/mappers drop-down selection list, an Advanced Settings button can be used to specify the preprocessing (trimming, downsampling) and assembler/mapper parameters. Downsampling is selected here by default for Velvet, SPAdes, and bwa. Trimming is only turned on by default for Velvet and can again be modified in the Advanced Settings. The project Reference Genome and Expected Genome Size are shown below the selected assembler/mapper. A reference genome is required if reference mapping (BWA) is chosen, or if the downsampling option is selected in the Advanced Settings. The reference genome is also used to calculate the unassembled coverage for the Procedure Statistics. If the Project contains a cgMLST Task Template then the reference genome is automatically taken from the Task Template (if multiple cgMLST Task Templates should be present in a single project then the reference genome of the Task Template first in ordering is taken). The reference genome can also be set manually by using the Alternative Reference Genome button. In the case that no cgMLST Task Template is present in the project then this button must always be used to define a reference genome. SubmissionFor local Task Templates the default setting is to automatically assign new types (if this option is enabled in the Task Template). For public cgMLST Task Templates an automatic submission to cgMLST.org can be enabled here. If this option is enabled, the Submission Anonymization Filter for the pipeline script is shown and can be adjusted to control which data fields and level of details should be submitted to cgMLST.org. In addition, then the checkbox (declaration that data is not sensitive, etc.) shown below needs to be checked. In contrast to the manual submission, the pipeline cgMLST.org submission process does not show a preview dialog of metadata to be submitted before the samples are submitted, because pipelines run in an automatized and unattended mode. The button Preview Epi Data at the bottom of this section can only be used to show some of the potentially submitted metadata. Only epidemiological metadata (no other metadata are available at the time of setting up a pipeline script) of sequence file data located in the input source directory (script might be configured to run in continuous mode) that have already epidemiological data stored in the SeqSphere+ server database can be previewed when the script is setup. Submission of samples via a pipeline requires that the user account of the pipeline is already registered at the cgMLST.org Nomenclature Server. If the user account is not registered, an error message dialog is shown when confirming the Submission section by pressing the Next button . This registration can be done if necessary, by pressing the according button shown in the error message dialog or by using the command Options | User Settings in the main menu of SeqSphere+ when run in interactive mode.
File ManagementIn this last section it can be defined how the pipeline process should manage the assembled/mapped files (ACE/BAM) and the read files (FASTQ). All files are referenced in the process tab of a Sample (see more details). The are three options how to handle Assembled/Mapped Files (ACE/BAM):
The very large Raw Read Files (FASTQ) cannot be uploaded to the SeqSphere+ Server. By default the option Store links to the original FASTQ files is turned on that stores the path to FASTQ(s) in the procedure tab of the Sample entry. However, again there is an option to Copy FASTQ files to folder that copies the files to a selected folder. The Sample entry stores only a link to this copied file. Subdirectories named by the project acronym (if available) with the according FASTQ files can be created automatically when the Create sub-folder for each Project option is chosen. |