Configuration file
The configuration file is necessary to specify where the pipeline can find the input files that it needs for a proper execution. An example for a ready-to-use config.yml is presented during the tutorial of the pipeline. Below is a table with currently implemented options, exemplary parameters, and a corresponding description for each field.
| Parameter | Exemplary value | Description |
|---|---|---|
experiment |
tutorial |
Name of the experiment. |
samples |
S1: /path/to/S1.bam S2: /path/to/S2.bam |
Paths to BAM files for the samples. Each sample should have its file path defined. |
processing_dir |
/path/to/outdir |
Directory where output files will be saved. |
threads |
10 |
Number of threads to be used for the execution. Increasing this value will mainly speed up minimap2 and BLASTN, as these are the computationally heaviest steps. If your number of insertion sites is low, the pipeline will already be very fast. |
insertion_fasta |
/path/to/construct.fa |
Path to the construct sequence file in FASTA format. |
blastn_db |
/path/to/blastNdb |
BLASTN database for reference nucleotides to check matches with the insertion sequence. If the insertion sequence contains reference genome parts, the matches here are needed to evaluate the true positive detections. |
splitmode |
Buffer, Split, or Join |
Mode used for processing reads based on the insertion:
|
fragment_size |
100 |
Size of fragments (in bp) for splitting sequences. Fragments of this size will be used to construct the BLASTN database of the insertion sequence. |
bridging_size |
300 |
Acceptable gap size (in bp) before splitting the longest consecutive interval. It is recommended to customize this parameter according to the underlying read quality and insertion length. |
MinLength |
1 |
Minimum read length (in bp) for BLASTN matches processing. |
MAPQ |
10 |
Minimum mapping quality score for reads. Filtering is applied after the modification of the reads with insertions. |
MinInsertionLength |
500 |
Minimum length (in bp) of insertions to be detected. This is dependent on the respective insertion and potentially its matches with the reference genome. |
ref_genome_ctrl |
/path/to/ref.fa |
Reference genome file in FASTA format. |
annotate_{key} |
/path/to/annotate_{key}.bed |
Only required for functional_genomics. Sorted annotation file in BED6 format. Four columns (chr, start, stop, id) must be provided. If less than six columns are provided, the empty columns are filled with .. If more than six columns are provided, these are discarded. Several annotation files can be provided via separate lines in the config. |
detection |
rules/detection.smk |
Snakemake rule file for the detection and localization of insertions. |
quality_control |
rules/qc.smk |
Snakemakerule file for the quality control of reads and insertions. |
functional_genomics |
rules/functional_genomics.smk |
Optional Snakemake rule file for the functional annotation of insertions on the genome level. |
base_modifications |
rules/base_modifications.smk |
Optional Snakemake rule file for the generation of base modification tables from the modbam (MM/ML) tag. Requires MM/ML tags. |
Info
The parameter blastn_db is optional but recommended for the comparison of insertion_fasta to a healthy reference database.
Note
Custom rules can be added to the config in the same way as the standard rules listed above. Their output, however, needs to be defined additionlly in the Snakefile.