Configuration File Documentation
In order to setup the sRNAnalyzer pipeline, two configuration files are required. The pipeline configuration file defines the preprocessing and alignment settings for the pipeline.
Database Configuration File
The database configuration file defines the names of the alignment databases and tells sRNAnalyzer where these databases are located. The base attribute must be an absolute path, where all all other paths are relative to the base path. The other paths should be to bowtie indexes, including the prefix to the index files. An example configuration file is shown below.
base: /DBs/bowtie/indexes/
human_miRNA: miRBase/hairpin_hsa_anno
human_piRNA: piRBase/piR_human_v1.0
From this configuration file, we can now use the names human_miRNA
and human_piRNA
in the
pipeline configuration file defined below, since sRNAnalyzer can find the bowtie indexes corresponding
to these database names.
Pipeline Configuration File
The pipeline configuration file allows specifying settings for the preprocessing and alignment modules of the pipeline. This file is in a the YAML file format, which makes it very readable.
An example config.yaml file is shown below,
preprocess:
kit: NEB
gzip: true
stop-oligo: false
barcode: sampleBarcode
alignment:
type: single
human_miRNA: 2
human_miRNA_sub: 2
human_piRNA: 2
human_snoRNA: 2
Preprocess Options
kit - specifies which sRNA library construction kit was used so the adapters can be properly trimmed. Options are "NEB", "Illumina", and "Bioo". Required if adapter-3p and adapter-5p are not provided. The sequences for these kits are as follows,
- Illumina - 3' TGGAATTCTCGGGTGCCAAG, 5' GTTCAGAGTTCTACAGTCCGACGATC
- NEB - 3' AGATCGGAAGAGCACACGTCT, 5' GTTCAGAGTTCTACAGTCCGACGATC
- Bioo - 3' NNNNTGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC, 5' GTTCAGAGTTCTACAGTCCGACGATC
adapter-3p - specifies the 3 prime adapter sequence to be trimmed. Required if the kit option is not provided.
adapter-5p - specifies the 5 prime adapter sequence to be trimmed. Required if the kit option is not provided.
gzip - if this option is set to "true", the pipeline will read gzipped .fast.gz files instead of plain .fastq files. Optional (default is false).
stop-oligo - if this option is set to "true", stop-oligo sequences will be trimmed. Optional (default is false)
barcode - specifies the sample barcode file to use when reading barcodes. Optional.
min-length - the minimum length of reads to keep. Optional. Default is 15.
Alignment Options
Each row in the alignment section should be formatted like,
DATABASE_NAME: MAX_MISMATCH
For example,
human_miRNA: 2
The order of the databases in the config file will be the order the databases are aligned to in the pipeline. The database names are the names defined in the database configuration file, as described above.
type - this specifies whether to use single assignment or multiple assignment for read mapping.
Can be "single" or "multiple". It is recommended that multiple assignment only be used for small RNA
mapping. Optional (default is single assignment). Note that when using the pre-built sRNA indexes, use the
human_miRNA_mult
database when using multiple assignment, and use the human_miRNA
and human_miRNA_sub
databases
when using single assignment.
cores - the number of cores that bowtie to use for alignment. Default is 15