Configuration File Documentation

In order to setup the sRNAnalyzer pipeline, two configuration files are required. The pipeline configuration file defines the preprocessing and alignment settings for the pipeline.

Database Configuration File

The database configuration file defines the names of the alignment databases and tells sRNAnalyzer where these databases are located. The base attribute must be an absolute path, where all all other paths are relative to the base path. The other paths should be to bowtie indexes, including the prefix to the index files. An example configuration file is shown below.

base: /DBs/bowtie/indexes/
    human_miRNA: miRBase/hairpin_hsa_anno
    human_piRNA: piRBase/piR_human_v1.0

From this configuration file, we can now use the names human_miRNA and human_piRNA in the pipeline configuration file defined below, since sRNAnalyzer can find the bowtie indexes corresponding to these database names.

Pipeline Configuration File

The pipeline configuration file allows specifying settings for the preprocessing and alignment modules of the pipeline. This file is in a the YAML file format, which makes it very readable.

An example config.yaml file is shown below,

preprocess:
    kit:        NEB
    gzip:       true
    stop-oligo: false
    barcode:    sampleBarcode
        
alignment:
    type: single
    human_miRNA:     2
    human_miRNA_sub: 2
    human_piRNA:     2
    human_snoRNA:    2

Preprocess Options

kit - specifies which sRNA library construction kit was used so the adapters can be properly trimmed. Options are "NEB", "Illumina", and "Bioo". Required if adapter-3p and adapter-5p are not provided. The sequences for these kits are as follows,

Illumina - 3' TGGAATTCTCGGGTGCCAAG, 5' GTTCAGAGTTCTACAGTCCGACGATC
NEB - 3' AGATCGGAAGAGCACACGTCT, 5' GTTCAGAGTTCTACAGTCCGACGATC
Bioo - 3' NNNNTGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC, 5' GTTCAGAGTTCTACAGTCCGACGATC

adapter-3p - specifies the 3 prime adapter sequence to be trimmed. Required if the kit option is not provided.

adapter-5p - specifies the 5 prime adapter sequence to be trimmed. Required if the kit option is not provided.

gzip - if this option is set to "true", the pipeline will read gzipped .fast.gz files instead of plain .fastq files. Optional (default is false).

stop-oligo - if this option is set to "true", stop-oligo sequences will be trimmed. Optional (default is false)

barcode - specifies the sample barcode file to use when reading barcodes. Optional.

min-length - the minimum length of reads to keep. Optional. Default is 15.

Alignment Options

Each row in the alignment section should be formatted like,

DATABASE_NAME: MAX_MISMATCH

For example,

human_miRNA: 2

The order of the databases in the config file will be the order the databases are aligned to in the pipeline. The database names are the names defined in the database configuration file, as described above.

type - this specifies whether to use single assignment or multiple assignment for read mapping. Can be "single" or "multiple". It is recommended that multiple assignment only be used for small RNA mapping. Optional (default is single assignment). Note that when using the pre-built sRNA indexes, use the human_miRNA_mult database when using multiple assignment, and use the human_miRNA and human_miRNA_sub databases when using single assignment.

cores - the number of cores that bowtie to use for alignment. Default is 15