Configuration file

The configuration file is necessary to specify where the pipeline can find the input files that it needs for a proper execution. An example for a ready-to-use config.yml is presented during the tutorial of the pipeline. Below is a table with currently implemented options, exemplary parameters, and a corresponding description for each field.

Parameter Exemplary value Description
experiment tutorial Name of the experiment.
samples S1: /path/to/S1.bam
S2: /path/to/S2.bam
Paths to BAM files for the samples. Each sample should have its file path defined.
processing_dir /path/to/outdir Directory where output files will be saved.
threads 10 Number of threads to be used for the execution. Increasing this value will mainly speed up minimap2 and BLASTN, as these are the computationally heaviest steps. If your number of insertion sites is low, the pipeline will already be very fast.
insertion_fasta /path/to/construct.fa Path to the construct sequence file in FASTA format.
blastn_db /path/to/blastNdb BLASTN database for reference nucleotides to check matches with the insertion sequence. If the insertion sequence contains reference genome parts, the matches here are needed to evaluate the true positive detections.
splitmode Buffer, Split, or Join Mode used for processing reads based on the insertion:
  • Buffer: Replaces the insertion with "N" and retains the full length of each read. Recommended for the most accurate insertion location (CIGAR-based).
  • Split: Cuts the insertion from the read and creates individual reads from the remaining sequence. The locations of the insertions are reported as the locations of each of the individual reads. Recommended for exploratory search for insertions that might fuse otherwise non-neighboring parts of the genome together.
  • Join: Cuts the insertion from the read and joins the remaining sequence together. The locations of the insertions are reported as the location of the joined read. Recommended for debugging of otherwise unmappable reads.
fragment_size 100 Size of fragments (in bp) for splitting sequences. Fragments of this size will be used to construct the BLASTN database of the insertion sequence.
bridging_size 300 Acceptable gap size (in bp) before splitting the longest consecutive interval. It is recommended to customize this parameter according to the underlying read quality and insertion length.
MinLength 1 Minimum read length (in bp) for BLASTN matches processing.
MAPQ 10 Minimum mapping quality score for reads. Filtering is applied after the modification of the reads with insertions.
MinInsertionLength 500 Minimum length (in bp) of insertions to be detected. This is dependent on the respective insertion and potentially its matches with the reference genome.
ref_genome_ctrl /path/to/ref.fa Reference genome file in FASTA format.
annotate_{key} /path/to/annotate_{key}.bed Only required for functional_genomics. Sorted annotation file in BED6 format. Four columns (chr, start, stop, id) must be provided. If less than six columns are provided, the empty columns are filled with .. If more than six columns are provided, these are discarded. Several annotation files can be provided via separate lines in the config.
detection rules/detection.smk Snakemake rule file for the detection and localization of insertions.
quality_control rules/qc.smk Snakemakerule file for the quality control of reads and insertions.
functional_genomics rules/functional_genomics.smk Optional Snakemake rule file for the functional annotation of insertions on the genome level.
base_modifications rules/base_modifications.smk Optional Snakemake rule file for the generation of base modification tables from the modbam (MM/ML) tag. Requires MM/ML tags.

Info

The parameter blastn_db is optional but recommended for the comparison of insertion_fasta to a healthy reference database.

Note

Custom rules can be added to the config in the same way as the standard rules listed above. Their output, however, needs to be defined additionlly in the Snakefile.