Configuration File¶
All the data files and parameter settings required to train and apply a trained
model are specified using a configuration file. As a starting point, a config
file is provided with the source code for DeFCoM in the package directory
defcom/data/example.cfg. The config file can be alternatively found
here. This file can be used
as a template to construct your own config file.
Structure¶
The structure of the config file is very straightforward with only three types of lines:
- Lines beginning with
#denote comments that are ignored by DeFCoM. - Lines following the format
variable_name = valuespecify settings that need to be set in order to run DeFCoM. - Lines with
[...]signify the start of a configuration file section. For DeFCoM, the two config file sections required are[data]and[options].
Below we describe the value required by each of the config variables.
[data] Variables¶
-
genome_file(optional)¶ A FASTA file for the reference genome corresponding to the motif site files and the DNase-seq/ATAC-seq BAM alignment files.
Note: This variable only needs to be specified for use with some of the data pre-processing scripts.
train.pyandpredict.pywill still run without this variable.
-
active_sites_file¶ A file in pseudo-BED format containing predicted motif sites annotated as active/bound by a transcription factor. In the same directory as this file a gzipped (.gz) file and tabix index (.tbi) file must be present with the same file name prefix.
-
inactive_sites_file¶ A file in pseudo-BED format containing predicted motif sites annotated as inactive/unbound by a transcription factor. In the same directory as this file a gzipped (.gz) file and tabix index (.tbi) file must be present with the same file name prefix.
-
candidate_sites_file¶ A file in pseudo-BED format containing motif sites to be classified. In the same directory as this file a gzipped (.gz) file and tabix index (.tbi) file must be present with the same file name prefix.
-
training_bam_file¶ A BAM file for DNase-seq/ATAC-seq read alignments to be used for model training. The file must have an index file with the same name but with the ‘.bai’ extension appended.
-
candidate_bam_file¶ A BAM file for DNase-seq/ATAC-seq read alignments to be used for the classifcation phase of DeFCoM. The file must have an index file with the same name but with the ‘.bai’ extension appended.
[options] Variables¶
-
f_offset¶ A number denoting the offset in bases downstream from the 5’ end of a forward (+) strand read. The recommended values are 0 for DNase-seq reads and 4 for ATAC-seq. Equivalent to read shifting.
-
r_offset¶ A number denoting the offset in bases upstream from the 5’ end of a reverse (-) strand read. The recommended values are 0 for DNase-seq reads and 5 for ATAC-seq. Equivalent to read shifting.
-
flank_size¶ A number specifying how many bases upstream and downstream of the motif site center to include. The value specifies one flank. The total included bases is
flank_size*2. The size of the region determines how large of a footprint profile will be generated from which features will be extracted for model training and classification.
-
combine_strands¶ Must be either
TrueorFalse. Indicates if DNaseI digestion site counts (or Tn5 tagmentation site counts) should be aggregated across strands. IfFalsethen the forward and reverse strand will have digestion count data considered separately.
-
bias_correction¶ Must be either
TrueorFalse. Indicates whether bias correction should be applied. Currently, DeFCoM only supports bias correction in the form of read weights defined in a BAM file by anXWtag field for each aligned read.
-
bootstrap_iterations¶ The number of bootstrap iterations to perform for the model selection component of the training phase.
-
bootstrap_active_set_size¶ The number of active set motif sites to sample per bootstrap iteration. If the number exceeds the total active set size, DeFCoM will default to using the whole active set.
-
bootstrap_inactive_set_size¶ The number of inactive set motif sites to sample per bootstrap iteration. If the number exceeds the total inactive set size, DeFCoM will default to using the whole inactive set.
-
training_active_set_size¶ The number of active set motif sites to use for final model training. If the number exceeds the total active set size, DeFCoM will default to using the whole active set.
-
training_inactive_set_size¶ The number of inactive set motif sites to use for final model training. If the number exceeds the total inactive set size, DeFCoM will default to using the whole active set.
-
memory_limit¶ The size (MB) of the cache memory allowed for storing the SVM kernel matrix.
-
model_data_file¶ The name of the file that holds trained model data. Should have a
.pklfile extension.train.pywill create and store the trained model in this file andpredict.pywill use the trained model from this file.
-
results_file¶ The name of the file where the results will be output.