API documentation¶
Major modules¶
aldy.gene module¶
-
class
aldy.gene.
CNConfig
(cn: List[Dict[str, int]], kind: aldy.gene.CNConfigType, alleles: Set[str], description: str = '')[source]¶ Bases:
object
Copy-number (CN) configuration description. Immutable.
Parameters: - cn – Value of the expected region copy number in each gene. For example, cn[0][“e1”] == 1 means that the exon 1 of the main gene (ID 0) has one copy (and thus should be present) in the configuration cn.
- kind – Type of the copy-number configuration.
- alleles – Allele IDs that have this CN configuration.
- description – Human-readable description (e.g. “deletion”).
-
description
= ''¶
-
class
aldy.gene.
CNConfigType
[source]¶ Bases:
enum.Enum
Enumeration describing the type of a copy-number configuration
-
DEFAULT
= 0¶
-
DELETION
= 3¶
-
LEFT_FUSION
= 1¶
-
RIGHT_FUSION
= 2¶
-
-
class
aldy.gene.
Gene
(path: Optional[str], name: Optional[str] = None, yml: Optional[str] = None, genome: Optional[str] = None)[source]¶ Bases:
object
Gene (and associated pseudogenes) description relative to a reference genome.
-
alleles
= None¶ Major star-alleles in the gene.
-
aminoacid
= None¶ Aminoacid sequence of the main gene.
-
chr
= None¶ Reference genome chromosome.
-
chr_to_ref
= None¶ Position mapping from the reference to the RefSeq sequence.
-
cn_configs
= None¶ Copy-number configurations associated with the gene. 1 (akin to *1) is the default CN configuration (no structural variations present).
-
common_tandems
= None¶ List of common allele tandems. Used in diplotype assignment heuristics. For example, the fact that *13 is always followed by *1 (encoded as (‘13’, ‘1’)) will be used to group *13 and *1 together within the same haplotype (e.g. *13+*1).
-
deletion_allele
() → Optional[str][source]¶ Returns: The ID of the deletion allele (or None if gene has no such allele).
-
do_copy_number
= None¶ Set if the gene is subject to structural variations.
-
ensembl
= None¶ ENSEMBL gene ID.
-
exons
= None¶ List of RefSeq exon coordinates.
-
genome
= None¶ Reference genome version (e.g., hg19).
-
get_functional
(mut, infer=True) → Optional[str][source]¶ Returns: String describing the mutation effect if a mutation is functional; otherwise None.
-
name
= None¶ Gene name (e.g. _CYP2D6_).
-
pharmvar
= None¶ PharmVar ID.
-
pseudogenes
= None¶ Pseudogene names. A pseudogene has ID greater than zero.
-
random_mutations
= None¶ Set of mutations that can occur in any allele.
-
ref_to_chr
= None¶ Position mapping from the RefSeq sequence to the reference.
-
refseq
= None¶ RefSeq sequence ID.
-
region_at
(pos: int) → Optional[Tuple[int, str]][source]¶ Returns: Gene ID and a region that covers the position.
-
regions
= None¶ Collection of regions (names and ranges) for each gene in the database. Maps a gene ID to a dictionary that maps a gene region name (e.g. “e9” for exon 9) to its region in the reference genome. Gene 0 is the main gene.
-
seq
= None¶ RefSeq sequence (typically *1 allele).
-
strand
= None¶ RefSeq sequence strand within the reference genome.
-
unique_regions
= None¶ List of genic regions used for copy-number calling.
-
version
= None¶ Database version.
-
-
class
aldy.gene.
MajorAllele
(name: str, cn_config: str = '1', func_muts: Set[aldy.gene.Mutation] = <factory>, minors: Dict[str, aldy.gene.MinorAllele] = <factory>)[source]¶ Bases:
object
Major allele description.
Parameters: - name – Major allele name.
- cn_config – Copy-number configuration.
- func_muts – Functional mutations that describe the major allele.
- minors – Minor alleles (suballeles) that are derived from this major allele.
-
cn_config
= '1'¶
-
class
aldy.gene.
MinorAllele
(name: str, alt_name: Optional[str] = None, neutral_muts: Set[aldy.gene.Mutation] = <factory>, activity: Optional[str] = None, evidence: Optional[str] = None, pharmvar: Optional[str] = None)[source]¶ Bases:
object
Minor allele description.
Parameters: - name – Minor allale name.
- alt_name – List of alternative names.
- neutral_muts – Netural mutations that describe the minor allele.
- activity – Activity indicator (see PharmVar for details).
- evidence – Evidence indicator (see PharmVar for details).
- pharmvar – PharmVar ID.
-
activity
= None¶
-
alt_name
= None¶
-
evidence
= None¶
-
pharmvar
= None¶
-
class
aldy.gene.
Mutation
[source]¶ Bases:
tuple
Mutation description. Immutable.
Parameters: - pos – Reference genome position (0-based).
- op – Variant operation in HGVS-like format:
- X>Y: a SNP from X to Y
- insX: an insertion of X
- delX: a deletion of X (X, Y are both of format [ACGT]+).
-
op
¶ Alias for field number 1
-
pos
¶ Alias for field number 0
aldy.profile module¶
-
class
aldy.profile.
Profile
(name, cn_region=None, data=None, **kwargs)[source]¶ Bases:
object
Profile and model parameter information.
-
cn_diff
= None¶ The first CN objective term (coverage fit) coefficient. Default: 10.0
-
cn_fit
= None¶ The second CN objective term (gene fit) coefficient. Default: 1.0
-
cn_fusion_left
= None¶ Extra penalty for the left fusions. Default: 0.5.
-
cn_fusion_right
= None¶ Extra penalty for the right fusions. Default: 0.25.
-
cn_max
= None¶ Maximum possible copy number of a gene. Default: 20
-
cn_parsimony
= None¶ The third CN objective term (max. parsimony) coefficient. Default: 0.5
-
cn_pce_penalty
= None¶ Error penalty applied to the PCE region during CN calling (1 for no penalty). Default: 2.0
-
cn_region
= None¶ Location of the copy-number neutral region.
-
cn_solution
= None¶ User-specified copy-number configuration. Default: None (uses CN solver in
aldy.cn
for detection)
-
data
= None¶ Profile coverage data.
-
display_format
= None¶ New novel allele display format. Default: False
-
gap
= None¶ Optimality gap. Non-zero values enable non-optimal solutions. Default: 0 (only optimal solutions)
-
static
get_sam_profile_data
(sam_path: str, ref_path: Optional[str] = None, regions: Dict[Tuple[str, str, int], aldy.common.GRange] = {}, cn_region: Optional[aldy.common.GRange] = None, genome: Optional[str] = 'hg19', params: Dict[KT, VT] = {}) → Dict[str, Dict[str, List[float]]][source]¶ Load the profile information from a SAM/BAM/CRAM file.
Parameters: - regions – List of regions to be extracted.
- cn_region – Copy-number neutral region.
Returns: list of tuples (gene_name, chromosome, loci, coverage).
Note
Profile samples used in the original Aldy paper:
- PGRNseq-v1/v3: NA17642
- PGRNseq-v2: NA19789
- Illumina: by definition contains all ones (uniform coverage).
-
static
load
(gene, profile, cn_region=None, **params)[source]¶ Load the copy number profile and parameters from a profile file.
Parameters: - gene – Gene instance.
- profile – A profile YAML or a SAM/BAM/CRAM file that contains profile data.
- cn_region – Copy-number neutral region.
-
major_novel
= None¶ Penalty for novel functional mutation (0 for no penalty). Should be large enough to avoid calling novel mutations unless really necessary. Default: 21.0 (i.e., max_cn + 1)
-
male
= None¶ Set if the sample is male (i.e., has two X chromosomes). Used for calling sex chromosome genes (e.g., G6PD) when the CN calling is disabled. Default: False
-
max_minor_solutions
= None¶ Maximum number of minor solutions to report for each major solution. Default: 1
-
min_coverage
= None¶ Minimum coverage needed to call a variant. Default: 2 (5 for illumina/wgs)
-
min_mapq
= None¶ Minimum mapping quality for a read to be considered. Default: 10
-
min_quality
= None¶ Minimum base quality for a read base to be considered. Default: 10
-
minor_add
= None¶ Penalty for novel minor mutations (0 for no penalty). Zero penalty ensures that extra mutations are preferred over the coverage errors if the normalized variant slack coverage is >= 50%. Penalty of 1.0 prefers additions only if the variant slack coverage is >= 75%. Default: 1.0
-
minor_miss
= None¶ Penalty for missed minor mutations (0 for no penalty). Ideally larger than minor_add as additions should be cheaper. Default: 1.5
-
minor_phase
= None¶ The minor star-allele calling model’s phasing term coefficient. Default: 0.4
-
minor_phase_vars
= None¶ Number of variables to use during the phasing. Use lower number if the model takes too long to complete. Default: 3,000
-
name
= None¶ Name of the profile.
-
neutral_value
= None¶ Joint coverage of the copy-number neutral region. Default: 0 (typically specified in the profile’s YAML file)
-
phase
= None¶ Set if the phasing model in
aldy.minor
is to be used. Default: True
-
sam_long_reads
= None¶ Set if long reads should be split-mapped. Should be used when dealing with long PacBio or Nanopore reads Default: False (typically specified in the profile’s YAML file)
-
sam_mappy_preset
= None¶ Mappy preset to use for split-mapping. Default: map-hifi
-
threshold
= None¶ Single-copy variant threshold. Its value indicate the fraction of total reads that contain the given variant in a single gene copy. For example, if two copies are given (maternal and paternal), and if the single copy coverage is 10, threshold of 0.5 will ensure that all variants with coverage less than 5 (i.e., 0.5 * 10) are filtered out. Default: 0.5
-
aldy.sam module¶
aldy.coverage module¶
-
class
aldy.coverage.
Coverage
(gene: aldy.gene.Gene, profile: aldy.profile.Profile, sam, coverage: Dict[int, Dict[str, List[T]]], indel_coverage: Optional[Dict[KT, VT]], cnv_coverage: Dict[int, int])[source]¶ Bases:
object
Data structure that maintains the coverage information for a given sample.
-
filtered
(filter_fn: Callable[[Any, aldy.gene.Mutation], List[T]])[source]¶ Parameters: filter_fn – Function that performs mutation filtering with the following arguments:
- mut (
aldy.gene.Mutation
): mutation to be filtered - cov (float): coverage of the mutation
- total (float): total coverage of the mutation locus
- thres (float): filtering threshold
filter_fn returns False if a mutation should be filtered out.
Returns: Filtered coverage. - mut (
-
percentage
(m: aldy.gene.Mutation) → float[source]¶ Returns: Mutation coverage expressed as percentage (0-100%).
-
Genotyping modules¶
aldy.genotype module¶
aldy.cn module¶
-
aldy.cn.
estimate_cn
(gene: aldy.gene.Gene, profile: aldy.profile.Profile, coverage: Optional[aldy.coverage.Coverage], solver: str, debug: Optional[str] = None) → List[aldy.solutions.CNSolution][source]¶ Estimate the optimal copy number configuration for a sample given a gene and coverage information.
Parameters: - gene – Gene instance.
- profile – Profile instance.
- coverage – Read coverage instance.
- solver – ILP solver (see
aldy.lpinterface
for supported solvers). - debug – When set, create a {debug}.cn.lp model description file for debug purposes. Default: None (no debug dumps).
Returns: List of optimal copy number configurations.
-
aldy.cn.
solve_cn_model
(gene: aldy.gene.Gene, profile: aldy.profile.Profile, cn_configs: Dict[str, aldy.gene.CNConfig], max_cn: int, region_coverage: Dict[str, Tuple[float, float]], solver: str, debug: Optional[str] = None, fusion_support: Optional[Dict[str, float]] = None) → List[aldy.solutions.CNSolution][source]¶ Solve the copy number estimation problem (instance of the closest vector problem).
Parameters: - gene – Gene instance.
- profile – Profile instance.
- cn_configs – Available copy number configurations (vectors).
- max_cn – Maximum allowed copy number.
- region_coverage – Observed copy number of the main gene and the pseudogene for each genic region.
- solver – ILP solver (see
aldy.lpinterface
for supported solvers). - gap – Optimality gap. Non-zero values enable non-optimal solutions. Default: 0 (only optimal solutions).
- debug – When set, create a {debug}.cn.lp model description file for debug purposes. Default: None (no debug dumps).
- fusion_support – Dictionary that contains read support of each available fusion. Used only for the long-read fusion calling. Default is None (all fusions are treated equally).
Returns: List of optimal copy-number solutions.
Note
Please see Aldy paper (section Methods/Copy number and structural variation estimation) for the detailed description of ILP model.
aldy.major module¶
-
aldy.major.
estimate_major
(gene: aldy.gene.Gene, coverage: aldy.coverage.Coverage, cn_solution: aldy.solutions.CNSolution, solver: str, identifier: int = 0, debug: Optional[str] = None) → List[aldy.solutions.MajorSolution][source]¶ Estimate optimal major star-alleles.
Parameters: - gene – Gene instance.
- profile – Profile instance.
- coverage – Read coverage instance.
- cn_solution – Copy-number solution for major star-allele calling.
- solver – ILP solver (see
aldy.lpinterface
for supported solvers). - identifier – Unique solution identifier. Used for debug purposes. Default: 0.
- debug – When set, create a {debug}.major{identifier}.lp model description file for debug purposes. Default: None (no debug dumps).
-
aldy.major.
solve_major_model
(gene: aldy.gene.Gene, coverage: aldy.coverage.Coverage, cn_solution: aldy.solutions.CNSolution, allele_dict: Dict[str, aldy.gene.MajorAllele], solver: str, identifier: int = 0, debug: Optional[str] = None) → List[aldy.solutions.MajorSolution][source]¶ Solves the major star-allele detection problem via integer linear programming.
Parameters: - gene – Gene instance.
- coverage – Read coverage instance.
- cn_solution – Copy-number solution for major star-allele calling.
- allele_dict – Candidate major star-alleles.
- solver – ILP solver (see
aldy.lpinterface
for supported solvers). - identifier – Unique solution identifier. Used for debug purposes. Default: 0.
- debug – When set, create a {debug}.major{identifier}.lp model description file for debug purposes. Default: None (no debug dumps).
Note
Please see Aldy paper (section Methods/Major star-allele identification) for the model explanation.
aldy.minor module¶
-
aldy.minor.
estimate_minor
(gene: aldy.gene.Gene, coverage: aldy.coverage.Coverage, major_sols: List[aldy.solutions.MajorSolution], solver: str, max_solutions: int = 1, novel: bool = False) → List[aldy.solutions.MinorSolution][source]¶ Estimate the optimal minor star-allele.
Parameters: - gene – Gene instance.
- coverage – Read coverage instance.
- major_sol – Major allele solution for minor star-allele calling.
- solver – ILP solver (see
aldy.lpinterface
for supported solvers). - max_solutions – Maximum number of solutions to report. Default: 1.
- novel – Include non-database functional and silent mutations in the model. Default: False.
-
aldy.minor.
solve_minor_model
(gene: aldy.gene.Gene, coverage: aldy.coverage.Coverage, major_sol: aldy.solutions.MajorSolution, alleles_list: List[aldy.solutions.SolvedAllele], mutations: Set[aldy.gene.Mutation], solver: str, max_solutions: int = 1) → List[aldy.solutions.MinorSolution][source]¶ Solves the minor star-allele detection problem via integer linear programming.
Parameters: - gene – Gene instance.
- coverage – Read coverage instance.
- major_sol – Major allele solution for minor star-allele calling.
- alleles_list – Candidate minor star-alleles.
- mutations – Mutations to consider for model building (all other mutations are ignored).
- solver – ILP solver (see
aldy.lpinterface
for supported solvers). - max_solutions – Maximum number of solutions to report. Default: 1.
Note
Please see Aldy paper (section Methods/Genotype refining) for the model explanation. Currently returns only the first optimal solution.
aldy.diplotype module¶
-
aldy.diplotype.
OUTPUT_COLS
= ['Sample', 'Gene', 'SolutionID', 'Major', 'Minor', 'Copy', 'Allele', 'Location', 'Type', 'Coverage', 'Effect', 'dbSNP', 'Code', 'Status']¶ Output column descriptions
-
aldy.diplotype.
estimate_diplotype
(gene: aldy.gene.Gene, solution: aldy.solutions.MinorSolution) → str[source]¶ Calculate the diplotype assignment for a minor solution. Set the diplotype attribute of the
aldy.solutions.MinorSolution
.Uses the diplotype assignment heuristics to assign correct diplotypes. This heuristics has no biological validity whatsoever—it is purely used for pretty-printing the final solutions.
Returns: Diplotype assignment.
-
aldy.diplotype.
write_decomposition
(sample: str, gene: aldy.gene.Gene, coverage: aldy.coverage.Coverage, sol_id: int, minor: aldy.solutions.MinorSolution, f)[source]¶ Write an allelic decomposition to the given file.
Parameters: - sample – Sample name.
- gene – Gene instance.
- sol_id – Solution ID (each solution must have a different ID).
- minor – Minor star-allele solution to be written.
- f – Output file.
-
aldy.diplotype.
write_vcf
(sample: str, gene: aldy.gene.Gene, coverage: aldy.coverage.Coverage, minors: List[aldy.solutions.MinorSolution], f)[source]¶ Write an allelic decomposition in the VCF format to the given file.
Parameters: - sample – Sample name.
- gene – Gene instance.
- sol_id – Solution ID (each solution must have a different ID).
- minor – Minor star-allele solution to be written.
- f – Output file.
Auxiliary modules¶
aldy.common module¶
-
class
aldy.common.
GRange
[source]¶ Bases:
aldy.common.GRange
Reference genome range (e.g. chr22:10-20). Immutable.
-
class
aldy.common.
JsonDict
[source]¶ Bases:
dict
Dictionary that adds a dictionary for each missing key. Used to ease handling and populating JSON objects.
-
aldy.common.
PROTEINS
= {'AAA': 'K', 'AAC': 'N', 'AAG': 'K', 'AAT': 'N', 'ACA': 'T', 'ACC': 'T', 'ACG': 'T', 'ACT': 'T', 'AGA': 'R', 'AGC': 'S', 'AGG': 'R', 'AGT': 'S', 'ATA': 'I', 'ATC': 'I', 'ATG': 'M', 'ATT': 'I', 'CAA': 'Q', 'CAC': 'H', 'CAG': 'Q', 'CAT': 'H', 'CCA': 'P', 'CCC': 'P', 'CCG': 'P', 'CCT': 'P', 'CGA': 'R', 'CGC': 'R', 'CGG': 'R', 'CGT': 'R', 'CTA': 'L', 'CTC': 'L', 'CTG': 'L', 'CTT': 'L', 'GAA': 'E', 'GAC': 'D', 'GAG': 'E', 'GAT': 'D', 'GCA': 'A', 'GCC': 'A', 'GCG': 'A', 'GCT': 'A', 'GGA': 'G', 'GGC': 'G', 'GGG': 'G', 'GGT': 'G', 'GTA': 'V', 'GTC': 'V', 'GTG': 'V', 'GTT': 'V', 'TAA': 'X', 'TAC': 'Y', 'TAG': 'X', 'TAT': 'Y', 'TCA': 'S', 'TCC': 'S', 'TCG': 'S', 'TCT': 'S', 'TGA': 'X', 'TGC': 'C', 'TGG': 'W', 'TGT': 'C', 'TTA': 'L', 'TTC': 'F', 'TTG': 'L', 'TTT': 'F'}¶ Codon table (stop codon is X).
-
aldy.common.
REV_COMPLEMENT
= {'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A'}¶ Reverse-complement DNA table.
-
aldy.common.
SOLUTION_PRECISION
= 0.01¶ Solution precision (all values whose absolute difference falls below the specified precision are considered equal).
-
class
aldy.common.
Timing
(name='Block', fn=None)[source]¶ Bases:
object
Context manager for timing code blocks. Prints the time spent in the function after it is completed.
-
aldy.common.
allele_name
(x: str) → str[source]¶ Returns: Major allele number of the star-allele name (e.g. ‘12A’ -> 12).
-
aldy.common.
chr_prefix
(ch: str, chrs: List[str]) → str[source]¶ Check if a chromosome needs “chr” prefix given the available chromosomes. :returns: Chromosome prefix if the chromosome does not have it.
-
aldy.common.
colorize
(text: str, color: str = 'green') → str[source]¶ Returns: xterm-compatible colorized string with a given color.
-
aldy.common.
log
= <logbook.base.Logger object>¶ Default console logger.
-
aldy.common.
parse_cn_region
(cn_region)[source]¶ Returns: GRange
object that represents the user-provided CN region in Samtools format (i.e., chr1:100-200).Raises: aldy.common.AldyException
if the region is invalid.
-
aldy.common.
script_path
(key: str) → str[source]¶ Obtain the full path of a resource.
Parameters: - key – resource to be extracted.
- key – resource to be extracted in path/file format (e.g., aldy.resources/test.txt).
Returns: Full path of the resource.
Raises: aldy.common.AldyException
if the resource does not exist.
aldy.lpinterface module¶
-
class
aldy.lpinterface.
CBC
(name)[source]¶ Bases:
aldy.lpinterface.Gurobi
Wrapper around CBC’s Python interface (Google’s ortools).
-
addVar
(*_, **kwargs)[source]¶ Add a variable to the model.
vtype is the variable type:
- B for binary variable
- I for integer variable
- C or nothing for continuous variable.
-
getValue
(var)[source]¶ Get the value of the solved variable. Automatically adjusts the return type based on the variable type.
-
quicksum
(expr)[source]¶ Perform a quick summation of the iterable expr. Much faster than Python’s sum on large iterables.
-
solve
(init: Optional[Callable] = None) → Tuple[str, float][source]¶ Solve the model. Assumes that the objective is set.
Additional parameters of the solver can be set via init function that takes the model instance as the sole argument.
Returns: Status of the solution and the objective value. Raise: NoSolutionsError
if the model is infeasible.
-
-
class
aldy.lpinterface.
Gurobi
(name, prev_model=None)[source]¶ Bases:
object
Wrapper around Gurobi’s Python interface (
gurobipy
).-
abssum
(vars: Iterable[T_co], coeffs: Optional[Dict[str, float]] = None)[source]¶ - Return the absolute sum of vars: e.g.
- \(\sum_i |c_i x_i|\) for the set \({x_1,...}\).
where \(c_i\) is defined in the coeffs dictionary.
Key of the coeffs dictionary stands for the name of the variable (should be accessible via varName call; 1 if not defined).
-
addVar
(*args, **kwargs)[source]¶ Add a variable to the model.
vtype is the variable type:
- B for binary variable
- I for integer variable
- C or nothing for continuous variable.
-
getValue
(var)[source]¶ Get the value of the solved variable. Automatically adjusts the return type based on the variable type.
-
prod
(res, terms)[source]¶ Ensure that \(res = \prod terms\) (where terms is a sequence of binary variables) by adding the appropriate linear constraints. Returns res.
-
quicksum
(expr: Iterable[T_co])[source]¶ Perform a quick summation of the iterable expr. Much faster than Python’s sum on large iterables.
-
solutions
(gap: float = 0, best_obj: Optional[float] = None, limit=None, iteration=0, init: Optional[Callable] = None)[source]¶ Solve the model and returns the list of all optimal solutions. Assumes that the objective is set. Any solution whose score is less than (1 + gap) times the optimal solution score will be included.
A solution is defined as a dictionary of set binary variables within the solution that are accessed by their name.
Additional parameters of the solver can be set via init function that takes the model instance as the sole argument.
This is a generic version that supports any solver.
Yields: Status of the solution, the objective value and the solution itself.
-
solve
(init: Optional[Callable] = None) → Tuple[str, float][source]¶ Solve the model. Assumes that the objective is set.
Additional parameters of the solver can be set via init function that takes the model instance as the sole argument.
Returns: Status of the solution and the objective value. Raise: NoSolutionsError
if the model is infeasible.
-
-
exception
aldy.lpinterface.
NoSolutionsError
[source]¶ Bases:
Exception
Raised if a model is infeasible.
-
aldy.lpinterface.
SOLVER_PRECISON
= 1e-05¶ Default solver precision