Configuration File

All the data files and parameter settings required to train and apply a trained model are specified using a configuration file. As a starting point, a config file is provided with the source code for DeFCoM in the package directory defcom/data/example.cfg. The config file can be alternatively found here. This file can be used as a template to construct your own config file.

Structure

The structure of the config file is very straightforward with only three types of lines:

  1. Lines beginning with # denote comments that are ignored by DeFCoM.
  2. Lines following the format variable_name = value specify settings that need to be set in order to run DeFCoM.
  3. Lines with [...] signify the start of a configuration file section. For DeFCoM, the two config file sections required are [data] and [options].

Below we describe the value required by each of the config variables.

[data] Variables

genome_file(optional)

A FASTA file for the reference genome corresponding to the motif site files and the DNase-seq/ATAC-seq BAM alignment files.

Note: This variable only needs to be specified for use with some of the data pre-processing scripts. train.py and predict.py will still run without this variable.

active_sites_file

A file in pseudo-BED format containing predicted motif sites annotated as active/bound by a transcription factor. In the same directory as this file a gzipped (.gz) file and tabix index (.tbi) file must be present with the same file name prefix.

inactive_sites_file

A file in pseudo-BED format containing predicted motif sites annotated as inactive/unbound by a transcription factor. In the same directory as this file a gzipped (.gz) file and tabix index (.tbi) file must be present with the same file name prefix.

candidate_sites_file

A file in pseudo-BED format containing motif sites to be classified. In the same directory as this file a gzipped (.gz) file and tabix index (.tbi) file must be present with the same file name prefix.

training_bam_file

A BAM file for DNase-seq/ATAC-seq read alignments to be used for model training. The file must have an index file with the same name but with the ‘.bai’ extension appended.

candidate_bam_file

A BAM file for DNase-seq/ATAC-seq read alignments to be used for the classifcation phase of DeFCoM. The file must have an index file with the same name but with the ‘.bai’ extension appended.

[options] Variables

f_offset

A number denoting the offset in bases downstream from the 5’ end of a forward (+) strand read. The recommended values are 0 for DNase-seq reads and 4 for ATAC-seq. Equivalent to read shifting.

r_offset

A number denoting the offset in bases upstream from the 5’ end of a reverse (-) strand read. The recommended values are 0 for DNase-seq reads and 5 for ATAC-seq. Equivalent to read shifting.

flank_size

A number specifying how many bases upstream and downstream of the motif site center to include. The value specifies one flank. The total included bases is flank_size*2. The size of the region determines how large of a footprint profile will be generated from which features will be extracted for model training and classification.

combine_strands

Must be either True or False. Indicates if DNaseI digestion site counts (or Tn5 tagmentation site counts) should be aggregated across strands. If False then the forward and reverse strand will have digestion count data considered separately.

bias_correction

Must be either True or False. Indicates whether bias correction should be applied. Currently, DeFCoM only supports bias correction in the form of read weights defined in a BAM file by an XW tag field for each aligned read.

bootstrap_iterations

The number of bootstrap iterations to perform for the model selection component of the training phase.

bootstrap_active_set_size

The number of active set motif sites to sample per bootstrap iteration. If the number exceeds the total active set size, DeFCoM will default to using the whole active set.

bootstrap_inactive_set_size

The number of inactive set motif sites to sample per bootstrap iteration. If the number exceeds the total inactive set size, DeFCoM will default to using the whole inactive set.

training_active_set_size

The number of active set motif sites to use for final model training. If the number exceeds the total active set size, DeFCoM will default to using the whole active set.

training_inactive_set_size

The number of inactive set motif sites to use for final model training. If the number exceeds the total inactive set size, DeFCoM will default to using the whole active set.

memory_limit

The size (MB) of the cache memory allowed for storing the SVM kernel matrix.

model_data_file

The name of the file that holds trained model data. Should have a .pkl file extension. train.py will create and store the trained model in this file and predict.py will use the trained model from this file.

results_file

The name of the file where the results will be output.