API¶
Overview¶
The Application Programming Interface (API) concisely organizes the class, method, and function definitions for the various components of DeFCoM. It is intended to be beneficial mainly to those interested in extending or customizing DeFCoM and learning the details of the source code.
Scripts¶
- train.py: Train a DeFCoM model
- predict.py: Perform footprint classification with a DeFCoM model
Modules¶
- alignment: Manage BAM files
- genome: Manage FASTA files
- motif_sites: Manage BED-like files for motif predictions
- features: Tools for feature extraction
- performance_stats: Performance statistics for assessing classification accuracy
defcom.alignment module¶
-
class
defcom.alignment.Alignments(aln_file)[source]¶ Bases:
objectAccess and process sequence alignment data.
The Alignment class acts as a wrapper class for pysam to provide additional footprinting specific data processing functionality. Information about pysam can be found at http://pysam.readthedocs.org/.
- Attributes:
- _alignment: A pysam AlignmentFile object.
- Args:
- aln_file: A string containing the path and filename for a sorted
- BAM file. The file must have an index file with the same name but with the ‘.bai’ extension appended. For example, the index file for ‘/path/file.bam’ should be ‘/path/file.bam.bai’.
- Raises:
- IOError: BAM file could not be found. AssertionError: File is not a BAM file. IOError: BAM file index could not be found.
-
get_cut_sites(chrom, start, end, strand=None, f_offset=0, r_offset=0, use_weights=False, multi_iter=False)[source]¶ Retrieve DNaseI digestion sites for a genomic region.
Makes a list of counts relative to the start coordinate of the genomic region specified using BED format chromosome coordinates i.e., 0-based indexing [start, end). Cuts are assumed to be 5’ read ends unless offsets are specified.
- Args:
- chrom: A string for the name of the chromosome for the desired
- genomic region. Names are derived from the BAM header.
start: An int for the 0-based start position of the genomic region.
- end: An int for the 0-based end position of the genomic region. The
- position ‘end’ denotes one position past the region of interest.
- strand: The strand from which to obtain the cut counts. Must be
- one of ‘+’, ‘-‘, ‘combined’ or ‘None’ (default). If ‘combined’ is specified then reads from both strands are aggregated. If ‘None’ is specified the read vectors for each strand are concatenated with the ‘+’ strand vector being first.
- f_offset: An int denoting the offset downstream from the 5’ end of
- a forward (+) strand read. Equivalent to read shifting.
- r_offset: An int denoting the offset upstream from the 5’ end of a
- reverse (-) strand read. Equivalent to read shifting.
- use_weights: A boolean indicating whether bias correction weights
- should be used for each read. The weights are expected to be stored as an optional tag ‘XW’ for each read.
- multi_iter: A boolean indicating whether to enable multipe
- iterators. This should be ‘True’ if multiple calls are made to this method concurrently for the same Alignments object.
- Returns:
- A list of counts as floats with positions corresponding to the specified genomic interval.
- Raises:
KeyError - The tag field ‘XW’ cannot be found for a read.
- ValueError - The genomic coordinates are out of range, are invalid,
- or the file does not permit random access.
ValueError - ‘strand’ is not specified correctly.
-
get_mapped_read_count()[source]¶ Get the number of mapped reads in the AlignmentFile object.
- Returns: A long int corresponding to the number of mapped reads in the
- ‘_alignment’ object.
-
get_read_density(chrom, start, end, strand=None, use_weights=False, multi_iter=False)[source]¶ Compute the total number of reads overlapping a genomic region.
Calculates the read density within a genomic region specified using BED format chromosome coordinates i.e., 0-based indexing [start, end). Strand-specific read density is computed if specified. It is important to note that a read is counted merely if it overlaps the specified region. The read does not have to be fully contained in the region.
- Args:
- chrom: A string for the name of the chromosome for the desired
- genomic region. Names are derived from the BAM header.
start: An int for the 0-based start position of the genomic region.
- end: An int for the 0-based end position of the genomic region. The
- position ‘end’ denotes one position past the region of interest.
- strand: The strand from which to obtain the read density. Must be
- one of ‘+’, ‘-‘, or ‘None’ (default). If ‘None’ is specified then reads from both strands are included.
- multi_iter: A boolean indicating whether to enable multipe
- iterators. This should be ‘True’ if multiple calls are made to this method concurrently for the same Alignments object.
- Returns:
- A float value corresponding to the total read count observed in the genomic interval.
- Raises:
KeyError - The tag field ‘XW’ cannot be found for a read.
- ValueError - The genomic coordinates are out of range, are invalid,
- or the file does not permit random access.
ValueError - ‘strand’ is not specified correctly.
-
get_reads(chrom, start, end, multi_iter=False)[source]¶ Get reads overlapping a genomic region.
Retrieves the set of reads overlapping a specified genomic region using BED format chromosome coordinates i.e., 0-based indexing [start, end).
- Args:
- chrom: A string for the name of the chromosome for the desired
- genomic region. Names are derived from the BAM header.
start: An int for the 0-based start position of the genomic region.
- end: An int for the 0-based end position of the genomic region. The
- position ‘end’ denotes one position past the region of interest.
- multi_iter: A boolean indicating whether to enable multipe
- iterators. This should be ‘True’ if multiple calls are made to this method for the same Alignments object and the iterators are used concurrently.
- Returns:
- An iterator of AlignedSegment objects.
- Raises:
- ValueError - The genomic coordinates are out of range, are invalid,
- or the file does not permit random access.
-
get_total_read_count()[source]¶ Get the total number of reads in the Samfile object.
- Returns: A long int corresponding to the number of total reads in the
- ‘_alignment’ object.
-
get_unmapped_read_count()[source]¶ Get the number of unmapped reads in the AlignmentFile object.
- Returns: A long int corresponding to the number of unmapped reads in
- the ‘_alignment’ object.
-
get_weights(chrom, start, end, strand=None, f_offset=0, r_offset=0, multi_iter=False)[source]¶ Retrieve bias correction weights for a given region.
Retrieves the per position bias correction weights within the specified genomic interval. The per read weights for a position are totaled to get the per position weight. The BAM file must contain bias correction information or else this method will raise an error.
- Args:
- chrom: A string for the name of the chromosome for the desired
- genomic region. Names are derived from the BAM header.
start: An int for the 0-based start position of the genomic region.
- end: An int for the 0-based end position of the genomic region. The
- position ‘end’ denotes one position past the region of interest.
- strand: The strand from which to obtain the weights. Must be
- one of ‘+’, ‘-‘, or ‘None’ (default). If ‘None’ is specified then reads from both strands are included.
- f_offset: An int denoting the offset downstream from the 5’ end of
- a forward (+) strand read. Equivalent to read shifting.
- r_offset: An int denoting the offset upstream from the 5’ end of a
- reverse (-) strand read. Equivalent to read shifting.
- multi_iter: A boolean indicating whether to enable multipe
- iterators. This should be ‘True’ if multiple calls are made to this method concurrently for the same Alignments object.
- Returns:
- A 2-D list of bias correction weights where the first dimension represents a genomic position. Index 0 corresponds to ‘start’. The second dimension is a list of all the read weights at a position.
- Raises:
KeyError - The tag field ‘XW’ cannot be found for a read.
- ValueError - The genomic coordinates are out of range, are invalid,
- or the file does not permit random access.
ValueError - ‘strand’ is not specified correctly.
defcom.features module¶
-
class
defcom.features.Features[source]¶ Bases:
objectFunctions for applying feature extraction.
-
classmethod
convert_to_cuts(motif_sites, aln_data, flank_size, combine_strands=True, f_offset=0, r_offset=0, use_weights=False)[source]¶ Convert motif sites into a matrix of DNaseI digestion site counts.
Retrieves a vector of DNaseI digestion sites for each motif site in the ‘motif_sites’ list. Puts all the vectors into a 2D list.
- Args:
- motif_sites: A 2D list or iterator of motif sites in
- Pseudo-BED format.
aln_data: An Alignment object.
- flank_size: An int specifying how many bases upstream and
- downstream of the motif site center to include. The value specifies one flank. The total included bases is flank_size*2.
- combine_strands: A boolean indicating if DNaseI digestion site
- counts should be aggregated across strands. If ‘False’ then the forward and reverse strand digestion count vectors are concatenated with the forward strand being first.
- f_offset: An int denoting the offset downstream from the 5’ end of
- a forward (+) strand read. Equivalent to read shifting.
- r_offset: An int denoting the offset upstream from the 5’ end of a
- reverse (-) strand read. Equivalent to read shifting.
- use_weights: A boolean indicating whether bias correction weights
- should be used for each read.
- Returns:
- A 2D list where each inner list contains DNaseI digestion counts per base within and/or around a motif site. Each index of the first dimension corresponds to a motif site.
- Raises:
- ValueError - The genomic coordinates are out of range or invalid.
-
classmethod
get_pwm_scores(motif_sites, score_column=4)[source]¶ Retrieve PWM scores given a list of motif_sites.
Gets PWM scores from a 2D list of motif site data in Pseudo-BED format and stores them into a list.
- Args:
- motif_sites: A 2D list or iterator of motif sites in
- Pseudo-BED format.
- score_column: The 2nd dimension index in “motif_sites” that
- specifies the position in the list that contains the PWM score. By default it is assumed that this is position 4.
-
classmethod
get_slopes(vector)[source]¶ Extract slope values for segments of a given vector.
Computes the slope assuming that the segments of the given vector represent y-coordinate values and x-coordinate values are spaced 1 unit apart.
- Args:
- vector: An array-like of numpy arrays with numeric values.
- Returns:
- A numpy array with slope values for each segment in the given vector.
- Raises:
- AttributeError: Segments of vector are not numpy arrays.
-
classmethod
get_splits(length, partition_fn)[source]¶ Get indices for dividing a vector into multiple parts.
Gets the indices for partitioning a vector into multiple subvectors based on a supplied partition function. If the partition function does not evenly divide the vector then the segments furthest from the vector center will include the remainder. This function is designed to easily create partitions whose sizes are approximately symmetric about the vector center but may vary in length otherwise. The list of indices matches the format required as an argument for numpy.split().
- Args:
length: An int specifying the size of the vector to partition.
- partition_fn: A vectorized callable that accepts and returns int
- types. A passed int value denotes the distance from the center of the vector in partition function distance. A returned int should specify the size of the partition at the provided partition distance. We define a partition distance unit to be the number of partitions away from a specified point where 0 denotes an initial partition that borders a specified point.
- Returns:
- An array containing the indices at which a vector of the size specified would be split based on the given partition function.
- Raises:
AssertionError: The length supplied is not positive.
TypeError: Partition function is not numpy vectorized.
TypeError: The ‘length’ argument is not the correct type.
ValueError: Non-positive value returned by partition function.
-
classmethod
defcom.genome module¶
-
class
defcom.genome.Genome(fasta_file)[source]¶ Bases:
objectAccess and process genomic sequence information.
Provides functionality to retrieve and process genomic sequence data. The class acts as a wrapper around the pysam FastaFile class.
- Attributes:
- _genome: A pysam Fastafile object.
- Args:
- fasta_file: A string containing the path and filename for a genome
- FASTA file. The file must have an index file with the same name but with the ‘.fai’ extension appended. For example, the index file for ‘/path/file.fa’ should be ‘/path/file.fa.fai’.
- Raises:
IOError: FASTA file could not be found or opened.
IOError: FASTA file index could not be found.
-
classmethod
complement(sequence)[source]¶ Get the complement of the given DNA sequence.
- Args:
- sequence: A string from alphabet {A,C,G,T,N,a,c,g,t,n}.
- Returns:
- A string with the complementary DNA sequence of the sequence provided.
- Raises:
- KeyError: The string provided contains characters other than
- ‘A’, ‘C’, ‘G’, ‘T’, ‘N’, ‘a’, ‘c’, ‘g’, ‘t’, and ‘n’.
-
get_chr_length(chrom)[source]¶ Get the length of specified chromosome.
- Args:
- chrom: A string specifying the chromosome of interest. Must follow
- the naming convention specified by the FASTA file.
- Returns:
- An int value representing the length of the chromosome.
- Raises:
- KeyError: Chromosome name could not be found.
-
get_chr_names()[source]¶ Return the chromosome names associated with the genome.
- Returns:
- A tuple containing chromosome names as strings.
-
get_genome_filename()[source]¶ Get the associated FASTA file name.
- Returns:
- A string for the FASTA file name.
-
get_sequence(chrom, start, end, strand='+')[source]¶ Get nucleotide sequence from genomic coordinates.
Retrieves the nucleotide sequence for a specified genomic region using BED format chromosome coordinates i.e., 0-based indexing [start, end).
- Args:
- chrom: A string specifying the chromosome of interest. Must follow
- the naming convention specified by the FASTA file.
start: An int for the 0-based start position of the genomic region.
- end: An int for the 0-based end position of the genomic region. The
- position ‘end’ denotes one position past the region of interest.
- strand: The strand from which to obtain the cut counts. Must be
- either ‘+’ or ‘-‘.
- Returns:
- A string of uppercase nucleotides corresponding to specified strand of the genomic region supplied.
- Raises:
KeyError - The chromosome name is specified incorrectly.
- ValueError - The genomic coordinates are out of range, are invalid,
- or the file does not permit random access.
ValueError - ‘strand’ is not specified correctly.
-
classmethod
reverse_complement(sequence)[source]¶ Reverse complement the given DNA sequence.
- Args:
- sequence: A string from alphabet {A,C,G,T,N,a,c,g,t,n}.
- Returns:
- A string with the reverse complement DNA sequence of the sequence provided.
- Raises:
- KeyError: The string provided contains characters other than
- ‘A’, ‘C’, ‘G’, ‘T’, and ‘N’.
defcom.motif_sites module¶
-
class
defcom.motif_sites.MotifSites(motif_file)[source]¶ Bases:
objectAccess and process pseudo-BED annotations for motif sites.
Provides functionality for working with motif predictions. Assumes that coordinates are 0-based [start, end).
- Attributes:
_motif_sites: A pysam Tabixfile object.
filename: The filename associated with the Tabixfile object.
num_sites: Number of motif site annotations in the given file
- Args:
- motif_file: A string specifying the path and filename of the
- pseudo-BED file containing motif sites. The file must be gzipped and have the ‘.gz’ file extension. It should also have an index file with the same name but with the ‘.tbi’ extension appended. For example, ‘/path/file.bed’ should be gzipped to ‘/path/file.bed.gz’ and have an index file called ‘/path/file.bed.gz.tbi’.
- Raises:
IOError: File could not be found or opened.
IOError: File index could not be found.
IOError: Gzipped version of file could not be found.
-
get_all_sites()[source]¶ Retrieve all motif sites in the MotifSites object.
Retrieves all the motif sites for the MotifSites object and stores them in an iterator.
- Returns:
- A pysam iterator containing motif site data.
-
get_chr_sites(chrom, multi_iter=False)[source]¶ Retrieve all motif sites on a specified chromosome.
Retrieves all the motif sites in the MotifSites object from a specified chromosome and stores them in an iterator.
- Args:
- chrom: A string specifying the chromosome of interest. Must follow
- the naming convention specified by the pseudo-BED file.
- multi_iter: A boolean indicating whether to enable multipe
- iterators. This should be ‘True’ if multiple calls are made to this method concurrently for the same MotifSites object.
- Returns:
- A pysam iterator that contains motif site data.
- Raises:
- ValueError - The genomic coordinates are out of range or invalid.
-
get_sites(chrom, start, end, multi_iter=False)[source]¶ Retrieve motif sites overlapping a genomic interval.
Retrieves the motif sites in the MotifSites object that overlap with a specified genomic interval. Does not require complete overlap.
- Args:
- chrom: A string specifying the chromosome of interest. Must follow
- the naming convention specified by the pseudo-BED file.
start: An int for the 0-based start position of the genomic region.
- end: An int for the 0-based end position of the genomic region. The
- position ‘end’ denotes one position past the region of interest.
- multi_iter: A boolean indicating whether to enable multipe
- iterators. This should be ‘True’ if multiple calls are made to this method concurrently for the same MotifSites object.
- Returns:
- A pysam iterator that contains motif site data.
- Raises:
- ValueError - The genomic coordinates are out of range or invalid.
-
subsample_sites(n_samples, iterations=1)[source]¶ Choose a random subset of motif sites.
Chooses a random subset of motif sites assuming that n_samples is less than the total number of sites. If n_samples is greater than or equal to the total number of sites, then the whole set is returned.
- Args:
- n_samples: The number of sites to subsample.
- Returns:
- A 2D list containing pseudo-BED format data for the sites selected.
defcom.performance_stats module¶
-
defcom.performance_stats.pAUC(y_true, y_score, fpr_cutoff)[source]¶ Calculate partial Area Under ROC (pAUC).
Computes a pAUC value given a specified false positive rate (FPR) cutoff. It is important to note that the exact pAUC cannot be computed. The accuracy of the calculation depends on the resolution of data points produced by an intermediary ROC curve. The FPR data point closest to and greater than the cutoff specified will be used for interpolation to determine the pAUC at the specified FPR cutoff. For these FPR values, the highest associated TPR values are used.
- Args:
- y_true: Array-like of true binary class labels in range {0, 1} or
- {-1, 1} corresponding to y_score. The larger value represents the positive class.
- y_score: Array-like of target scores with higher scores indicating
- more confidence in the positive class.
- fpr_cutoff: A float specifying the FPR cutoff to use in computing
- the pAUC. Must be in the interval (0,1).
- Returns:
- A float representing the pAUC value.
- Raises:
- AssertionError: The FPR cutoff is not in the interval (0,1)