Configuration file

The configuration file is necessary to specify where the pipeline can find the input files that it needs for a proper execution. An example for a ready-to-use config.yml is presented during the tutorial of the pipeline. Below is a table with currently implemented options, exemplary parameters, and a corresponding description for each field.

Parameter	Exemplary value	Description
`experiment`	`tutorial`	Name of the experiment.
`samples`	`S1: /path/to/S1.bam` `S2: /path/to/S2.bam`	Paths to BAM files for the samples. Each sample should have its file path defined.
`processing_dir`	`/path/to/outdir`	Directory where output files will be saved.
`threads`	`10`	Number of threads to be used for the execution. Increasing this value will mainly speed up `minimap2` and `BLASTN`, as these are the computationally heaviest steps. If your number of insertion sites is low, the pipeline will already be very fast.
`insertion_fasta`	`/path/to/construct.fa`	Path to the construct sequence file in FASTA format.
`blastn_db`	`/path/to/blastNdb`	`BLASTN` database for reference nucleotides to check matches with the insertion sequence. If the insertion sequence contains reference genome parts, the matches here are needed to evaluate the true positive detections.
`splitmode`	`Buffer`, `Split`, or `Join`	Mode used for processing reads based on the insertion: Buffer: Replaces the insertion with "N" and retains the full length of each read. Recommended for the most accurate insertion location (CIGAR-based). Split: Cuts the insertion from the read and creates individual reads from the remaining sequence. The locations of the insertions are reported as the locations of each of the individual reads. Recommended for exploratory search for insertions that might fuse otherwise non-neighboring parts of the genome together. Join: Cuts the insertion from the read and joins the remaining sequence together. The locations of the insertions are reported as the location of the joined read. Recommended for debugging of otherwise unmappable reads.
`fragment_size`	`100`	Size of fragments (in bp) for splitting sequences. Fragments of this size will be used to construct the `BLASTN` database of the insertion sequence.
`bridging_size`	`300`	Acceptable gap size (in bp) before splitting the longest consecutive interval. It is recommended to customize this parameter according to the underlying read quality and insertion length.
`MinLength`	`1`	Minimum read length (in bp) for `BLASTN` matches processing.
`MAPQ`	`10`	Minimum mapping quality score for reads. Filtering is applied after the modification of the reads with insertions.
`MinInsertionLength`	`500`	Minimum length (in bp) of insertions to be detected. This is dependent on the respective insertion and potentially its matches with the reference genome.
`ref_genome_ctrl`	`/path/to/ref.fa`	Reference genome file in FASTA format.
`annotate_{key}`	`/path/to/annotate_{key}.bed`	Only required for `functional_genomics`. Sorted annotation file in BED6 format. Four columns (chr, start, stop, id) must be provided. If less than six columns are provided, the empty columns are filled with `.`. If more than six columns are provided, these are discarded. Several annotation files can be provided via separate lines in the config.
`detection`	`rules/detection.smk`	`Snakemake` rule file for the detection and localization of insertions.
`quality_control`	`rules/qc.smk`	`Snakemake`rule file for the quality control of reads and insertions.
`functional_genomics`	`rules/functional_genomics.smk`	Optional `Snakemake` rule file for the functional annotation of insertions on the genome level.
`base_modifications`	`rules/base_modifications.smk`	Optional `Snakemake` rule file for the generation of base modification tables from the modbam (MM/ML) tag. Requires MM/ML tags.

Info

The parameter blastn_db is optional but recommended for the comparison of insertion_fasta to a healthy reference database.

Note

Custom rules can be added to the config in the same way as the standard rules listed above. Their output, however, needs to be defined additionlly in the Snakefile.