API documentation
Major modules
aldy.gene module
- class aldy.gene.CNConfig(cn: List[Dict[str, int]], kind: CNConfigType, alleles: Set[str], description: str = '')[source]
Bases:
objectCopy-number (CN) configuration description. Immutable.
- Parameters:
cn – Value of the expected region copy number in each gene. For example, cn[0][“e1”] == 1 means that the exon 1 of the main gene (ID 0) has one copy (and thus should be present) in the configuration cn.
kind – Type of the copy-number configuration.
alleles – Allele IDs that have this CN configuration.
description – Human-readable description (e.g. “deletion”).
- alleles: Set[str]
- cn: List[Dict[str, int]]
- description: str = ''
- kind: CNConfigType
- property vector
- class aldy.gene.CNConfigType(*values)[source]
Bases:
EnumEnumeration describing the type of a copy-number configuration
- CUSTOM = 4
- DEFAULT = 0
- DELETION = 3
- LEFT_FUSION = 1
- RIGHT_FUSION = 2
- class aldy.gene.Gene(path: str | None, name: str | None = None, yml: str | None = None, genome: str | None = None)[source]
Bases:
objectGene (and associated pseudogenes) description relative to a reference genome.
- alleles: Dict[str, MajorAllele]
Major star-alleles in the gene.
- aminoacid: str
Aminoacid sequence of the main gene.
- chr: str
Reference genome chromosome.
- chr_to_ref: Dict[int, int]
Position mapping from the reference to the RefSeq sequence.
- cn_configs: Dict[str, CNConfig]
Copy-number configurations associated with the gene. 1 (akin to *1) is the default CN configuration (no structural variations present).
- common_tandems: List[Tuple]
List of common allele tandems. Used in diplotype assignment heuristics. For example, the fact that *13 is always followed by *1 (encoded as (‘13’, ‘1’)) will be used to group *13 and *1 together within the same haplotype (e.g. *13+*1).
- deletion_allele() str | None[source]
- Returns:
The ID of the deletion allele (or None if gene has no such allele).
- do_copy_number: bool
Set if the gene is subject to structural variations.
- ensembl: str | None
ENSEMBL gene ID.
- exons: List[Tuple[int, int]]
List of RefSeq exon coordinates.
- genome: str
Reference genome version (e.g., hg19).
- get_functional(mut, infer=True) str | None[source]
- Returns:
String describing the mutation effect if a mutation is functional; otherwise None.
- has_coverage(a: str, pos: int)[source]
- Returns:
True if a major allele contains the given mutation.
- mutations: Dict[Tuple[int, str], Tuple[str | None, str, int, int, str]]
Maps (position, mutation) to the corresponding
Mutation.
- name: str
Gene name (e.g. _CYP2D6_).
- pharmvar: str | None
PharmVar ID.
- pseudogenes: List[str]
Pseudogene names. A pseudogene has ID greater than zero.
- ref_to_chr: Dict[int, int]
Position mapping from the RefSeq sequence to the reference.
- refseq: str
RefSeq sequence ID.
- region_at(pos: int) Tuple[int, str] | None[source]
- Returns:
Gene ID and a region that covers the position.
- regions: List[Dict[str, GRange]]
Collection of regions (names and ranges) for each gene in the database. Maps a gene ID to a dictionary that maps a gene region name (e.g. “e9” for exon 9) to its region in the reference genome. Gene 0 is the main gene.
- seq: str
RefSeq sequence (typically *1 allele).
- strand: int
RefSeq sequence strand within the reference genome.
- unique_regions: List[str]
List of genic regions used for copy-number calling.
- version: str
Database version.
- class aldy.gene.MajorAllele(name: str, cn_config: str = '1', func_muts: ~typing.Set[~aldy.gene.Mutation] = <factory>, minors: ~typing.Dict[str, ~aldy.gene.MinorAllele] = <factory>)[source]
Bases:
objectMajor allele description.
- Parameters:
name – Major allele name.
cn_config – Copy-number configuration.
func_muts – Functional mutations that describe the major allele.
minors – Minor alleles (suballeles) that are derived from this major allele.
- cn_config: str = '1'
- minors: Dict[str, MinorAllele]
- name: str
- class aldy.gene.MinorAllele(name: str, alt_name: str | None = None, neutral_muts: ~typing.Set[~aldy.gene.Mutation] = <factory>, _activity: str | None = 'unknown', evidence: str | None = None, pharmvar: str | None = None)[source]
Bases:
objectMinor allele description.
- Parameters:
name – Minor allale name.
alt_name – List of alternative names.
neutral_muts – Netural mutations that describe the minor allele.
activity – Activity indicator (see PharmVar for details).
evidence – Evidence indicator (see PharmVar for details).
pharmvar – PharmVar ID.
- property activity
- alt_name: str | None = None
- evidence: str | None = None
- name: str
- pharmvar: str | None = None
- class aldy.gene.Mutation(pos: int, op: str)[source]
Bases:
NamedTupleMutation description. Immutable.
- Parameters:
pos – Reference genome position (0-based).
op – Variant operation in HGVS-like format:
X>Y: a SNP from X to Y
insX: an insertion of X
delX: a deletion of X (X, Y are both of format [ACGT]+).
- op: str
Alias for field number 1
- pos: int
Alias for field number 0
aldy.profile module
- class aldy.profile.Profile(name, cn_region=None, data=None, **kwargs)[source]
Bases:
objectProfile and model parameter information.
- cn_diff
The first CN objective term (coverage fit) coefficient. Default: 10.0
- cn_fit
The second CN objective term (gene fit) coefficient. Default: 1.0
- cn_fusion_left
Extra penalty for the left fusions. Default: 0.5.
- cn_fusion_right
Extra penalty for the right fusions. Default: 0.25.
- cn_max
Maximum possible copy number of a gene. Default: 20
- cn_parsimony
The third CN objective term (max. parsimony) coefficient. Default: 0.5
- cn_pce_penalty
Error penalty applied to the PCE region during CN calling (1 for no penalty). Default: 2.0
- cn_region
Location of the copy-number neutral region.
- cn_solution
User-specified copy-number configuration. Default: None (uses CN solver in
aldy.cnfor detection)
- data
Profile coverage data.
- debug_novel
(Debug) Show potential novel functional mutations that are not in the database.
- debug_probe
(Debug) Show raw data for a given mutation (e.g., I223M)
- display_format
New novel allele display format. Default: False
- gap
Optimality gap. Non-zero values enable non-optimal solutions. Default: 0 (only optimal solutions)
- static get_sam_profile_data(sam_path: str, ref_path: str | None = None, regions: Dict[Tuple[str, str, int], GRange] = {}, cn_region: GRange | None = None, genome: str | None = 'hg19', params: Dict = {}) Dict[str, Dict[str, List[float]]][source]
Load the profile information from a SAM/BAM/CRAM file.
- Parameters:
regions – List of regions to be extracted.
cn_region – Copy-number neutral region.
- Returns:
list of tuples (gene_name, chromosome, loci, coverage).
Note
Profile samples used in the original Aldy paper:
PGRNseq-v1/v3: NA17642
PGRNseq-v2: NA19789
Illumina: by definition contains all ones (uniform coverage).
- indelpost
Use indelpost for indel realignment (unless sam_long_reads is set). Default: True
- static load(gene, profile, cn_region=None, **params)[source]
Load the copy number profile and parameters from a profile file.
- Parameters:
gene – Gene instance.
profile – A profile YAML or a SAM/BAM/CRAM file that contains profile data.
cn_region – Copy-number neutral region.
- major_novel
Penalty for novel functional mutation (0 for no penalty). Should be large enough to avoid calling novel mutations unless really necessary. Default: 21.0 (i.e., max_cn + 1)
- male
Set if the sample is male (i.e., has two X chromosomes). Used for calling sex chromosome genes (e.g., G6PD) when the CN calling is disabled. Default: False
- max_minor_solutions
Maximum number of minor solutions to report for each major solution. Default: 1
- min_avg_coverage
Minimum average gene coverage needed for Aldy. Default: 2
- min_coverage
Minimum coverage needed to call a variant. Default: 2 (5 for illumina/wgs)
- min_mapq
Minimum mapping quality for a read to be considered. Default: 10
- min_quality
Minimum base quality for a read base to be considered. Default: 10
- minor_add
Penalty for novel minor mutations (0 for no penalty). Zero penalty ensures that extra mutations are preferred over the coverage errors if the normalized variant slack coverage is >= 50%. Penalty of 1.0 prefers additions only if the variant slack coverage is >= 75%. Default: 1.0
- minor_miss
Penalty for missed minor mutations (0 for no penalty). Ideally larger than minor_add as additions should be cheaper. Default: 1.5
- minor_phase
The minor star-allele calling model’s phasing term coefficient. Default: 0.4
- minor_phase_vars
Number of variables to use during the phasing. Use lower number if the model takes too long to complete. Default: 3,000
- name
Name of the profile.
- neutral_value
Joint coverage of the copy-number neutral region. Default: 0 (typically specified in the profile’s YAML file)
- phase
Set if the phasing model in
aldy.minoris to be used. Default: True
- sam_long_reads
Set if long reads should be split-mapped. Should be used when dealing with long PacBio or Nanopore reads Default: False (typically specified in the profile’s YAML file)
- sam_mappy_preset
Mappy preset to use for split-mapping. Default: map-hifi
- threshold
Single-copy variant threshold. Its value indicate the fraction of total reads that contain the given variant in a single gene copy. For example, if two copies are given (maternal and paternal), and if the single copy coverage is 10, threshold of 0.5 will ensure that all variants with coverage less than 5 (i.e., 0.5 * 10) are filtered out. Default: 0.5
- vcf_sample_idx
VCF sample index. Default: 0
aldy.sam module
- class aldy.sam.Sample(gene: Gene, profile: Profile | None, path: str, reference: str | None = None, debug: str | None = None, store_reads: bool = False)[source]
Bases:
objectParse read alignments in a SAM/BAM/CRAM/VCF/dump format
- gene
Gene instance.
- is_long_read
Set if long-read data is used.
- name
Sample name.
- path
File path.
- phaseable
Locations that should be phased.
- phases: Dict[str, Dict[int, str]]
Phasing information.
- profile
Profile instance.
aldy.coverage module
- class aldy.coverage.Coverage(gene: Gene, profile: Profile, sam, coverage: Dict[int, Dict[str, List]], indel_coverage: Dict | None, cnv_coverage: Dict[int, int])[source]
Bases:
objectData structure that maintains the coverage information for a given sample.
- filtered(filter_fn: Callable[[Any, Mutation], List])[source]
- Parameters:
filter_fn –
Function that performs mutation filtering with the following arguments:
mut (
aldy.gene.Mutation): mutation to be filteredcov (float): coverage of the mutation
total (float): total coverage of the mutation locus
thres (float): filtering threshold
filter_fn returns False if a mutation should be filtered out.
- Returns:
Filtered coverage.
Genotyping modules
aldy.genotype module
- aldy.genotype.genotype(gene_db: str, sam_path: str, profile_name: str | None, output_file: ~typing.Any | None = <_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, cn_region: ~aldy.common.GRange | None = None, cn_solution: ~typing.List[str] | None = None, solver: str = 'any', reference: str | None = None, debug: str | None = None, multiple_warn_level: int = 1, report: bool = False, genome=None, is_simple: bool = False, **params) Dict[str, List[MinorSolution]][source]
Genotype a sample.
- Parameters:
gene_db – Gene name (if it is shipped with Aldy) or the location of the gene database in YAML format.
sam_path – Location of SAM/BAM/CRAM file that is to be analyzed.
profile_name – Coverage profile (e.g. WGS). None if cn_solution is provided.
output_file – Location of the output file. Use None for no output. Default: sys.stdout.
cn_region – Copy-number neutral region. Default: None (uses the provided CYP2D8 region).
cn_solution – List of the copy number configurations. Copy-number detection will not run if this parameter is set. Default: None.
solver – ILP solver (see
aldy.lpinterfacefor supported solvers).reference – Reference genome (for reading CRAM files). Default: None.
debug – Prefix for debug and core dump files. Default: None (no debug information).
multiple_warn_level – Warning level (1 for optimal solutions, 2 for major solutions and 3 for CN solutions). Default: 1 (warn only on multiple optimal solutions).
report – If set, write the solution summary to the stderr. Default: False.
genome – Reference genome (e.g., hg19 or hg38). Default: None (auto-detect).
is_simple – Use simple output format. Default: False.
params – Model parameters. See
aldy.profilefor details.
aldy.cn module
- aldy.cn.estimate_cn(gene: Gene, profile: Profile, coverage: Coverage | None, solver: str, debug: str | None = None) List[CNSolution][source]
Estimate the optimal copy number configuration for a sample given a gene and coverage information.
- Parameters:
gene – Gene instance.
profile – Profile instance.
coverage – Read coverage instance.
solver – ILP solver (see
aldy.lpinterfacefor supported solvers).debug – When set, create a {debug}.cn.lp model description file for debug purposes. Default: None (no debug dumps).
- Returns:
List of optimal copy number configurations.
- aldy.cn.solve_cn_model(gene: Gene, profile: Profile, cn_configs: Dict[str, CNConfig], max_cn: int, region_coverage: Dict[str, Tuple[float, float]], solver: str, debug: str | None = None, fusion_support: Dict[str, float] | None = None) List[CNSolution][source]
Solve the copy number estimation problem (instance of the closest vector problem).
- Parameters:
gene – Gene instance.
profile – Profile instance.
cn_configs – Available copy number configurations (vectors).
max_cn – Maximum allowed copy number.
region_coverage – Observed copy number of the main gene and the pseudogene for each genic region.
solver – ILP solver (see
aldy.lpinterfacefor supported solvers).gap – Optimality gap. Non-zero values enable non-optimal solutions. Default: 0 (only optimal solutions).
debug – When set, create a {debug}.cn.lp model description file for debug purposes. Default: None (no debug dumps).
fusion_support – Dictionary that contains read support of each available fusion. Used only for the long-read fusion calling. Default is None (all fusions are treated equally).
- Returns:
List of optimal copy-number solutions.
Note
Please see Aldy paper (section Methods/Copy number and structural variation estimation) for the detailed description of ILP model.
aldy.major module
- aldy.major.estimate_major(gene: Gene, coverage: Coverage, cn_solution: CNSolution, solver: str, identifier: int = 0, debug: str | None = None) List[MajorSolution][source]
Estimate optimal major star-alleles.
- Parameters:
gene – Gene instance.
profile – Profile instance.
coverage – Read coverage instance.
cn_solution – Copy-number solution for major star-allele calling.
solver – ILP solver (see
aldy.lpinterfacefor supported solvers).identifier – Unique solution identifier. Used for debug purposes. Default: 0.
debug – When set, create a {debug}.major{identifier}.lp model description file for debug purposes. Default: None (no debug dumps).
- aldy.major.solve_major_model(gene: Gene, coverage: Coverage, cn_solution: CNSolution, allele_dict: Dict[str, MajorAllele], solver: str, identifier: int = 0, debug: str | None = None) List[MajorSolution][source]
Solves the major star-allele detection problem via integer linear programming.
- Parameters:
gene – Gene instance.
coverage – Read coverage instance.
cn_solution – Copy-number solution for major star-allele calling.
allele_dict – Candidate major star-alleles.
solver – ILP solver (see
aldy.lpinterfacefor supported solvers).identifier – Unique solution identifier. Used for debug purposes. Default: 0.
debug – When set, create a {debug}.major{identifier}.lp model description file for debug purposes. Default: None (no debug dumps).
Note
Please see Aldy paper (section Methods/Major star-allele identification) for the model explanation.
aldy.minor module
- aldy.minor.estimate_minor(gene: Gene, coverage: Coverage, major_sols: List[MajorSolution], solver: str, max_solutions: int = 1) List[MinorSolution][source]
Estimate the optimal minor star-allele.
- Parameters:
gene – Gene instance.
coverage – Read coverage instance.
major_sol – Major allele solution for minor star-allele calling.
solver – ILP solver (see
aldy.lpinterfacefor supported solvers).max_solutions – Maximum number of solutions to report. Default: 1.
- aldy.minor.solve_minor_model(gene: Gene, coverage: Coverage, major_sol: MajorSolution, alleles_list: List[SolvedAllele], mutations: Set[Mutation], solver: str, max_solutions: int = 1) List[MinorSolution][source]
Solves the minor star-allele detection problem via integer linear programming.
- Parameters:
gene – Gene instance.
coverage – Read coverage instance.
major_sol – Major allele solution for minor star-allele calling.
alleles_list – Candidate minor star-alleles.
mutations – Mutations to consider for model building (all other mutations are ignored).
solver – ILP solver (see
aldy.lpinterfacefor supported solvers).max_solutions – Maximum number of solutions to report. Default: 1.
Note
Please see Aldy paper (section Methods/Genotype refining) for the model explanation. Currently returns only the first optimal solution.
aldy.diplotype module
- aldy.diplotype.OUTPUT_COLS = ['Sample', 'Gene', 'SolutionID', 'Major', 'Minor', 'Copy', 'Allele', 'Location', 'Type', 'Coverage', 'Effect', 'dbSNP', 'Code', 'Status']
Output column descriptions
- aldy.diplotype.estimate_cpic(gene: Gene, solution: MinorSolution) str[source]
Calculate the CPIC functionality for a minor solution. :returns: CPIC functionality.
- aldy.diplotype.estimate_diplotype(gene: Gene, solution: MinorSolution) str[source]
Calculate the diplotype assignment for a minor solution. Set the diplotype attribute of the
aldy.solutions.MinorSolution.Uses the diplotype assignment heuristics to assign correct diplotypes. This heuristics has no biological validity whatsoever—it is purely used for pretty-printing the final solutions.
- Returns:
Diplotype assignment.
- aldy.diplotype.write_decomposition(sample: str, gene: Gene, coverage: Coverage, sol_id: int, minor: MinorSolution, f)[source]
Write an allelic decomposition to the given file.
- Parameters:
sample – Sample name.
gene – Gene instance.
sol_id – Solution ID (each solution must have a different ID).
minor – Minor star-allele solution to be written.
f – Output file.
- aldy.diplotype.write_vcf(sample: str, gene: Gene, coverage: Coverage, minors: List[MinorSolution], f)[source]
Write an allelic decomposition in the VCF format to the given file.
- Parameters:
sample – Sample name.
gene – Gene instance.
sol_id – Solution ID (each solution must have a different ID).
minor – Minor star-allele solution to be written.
f – Output file.
Auxiliary modules
aldy.common module
- class aldy.common.GRange(chr, start, end)[source]
Bases:
GRangeReference genome range (e.g. chr22:10-20). Immutable.
- class aldy.common.JsonDict[source]
Bases:
dictDictionary that adds a dictionary for each missing key. Used to ease handling and populating JSON objects.
- aldy.common.PROTEINS = {'AAA': 'K', 'AAC': 'N', 'AAG': 'K', 'AAT': 'N', 'ACA': 'T', 'ACC': 'T', 'ACG': 'T', 'ACT': 'T', 'AGA': 'R', 'AGC': 'S', 'AGG': 'R', 'AGT': 'S', 'ATA': 'I', 'ATC': 'I', 'ATG': 'M', 'ATT': 'I', 'CAA': 'Q', 'CAC': 'H', 'CAG': 'Q', 'CAT': 'H', 'CCA': 'P', 'CCC': 'P', 'CCG': 'P', 'CCT': 'P', 'CGA': 'R', 'CGC': 'R', 'CGG': 'R', 'CGT': 'R', 'CTA': 'L', 'CTC': 'L', 'CTG': 'L', 'CTT': 'L', 'GAA': 'E', 'GAC': 'D', 'GAG': 'E', 'GAT': 'D', 'GCA': 'A', 'GCC': 'A', 'GCG': 'A', 'GCT': 'A', 'GGA': 'G', 'GGC': 'G', 'GGG': 'G', 'GGT': 'G', 'GTA': 'V', 'GTC': 'V', 'GTG': 'V', 'GTT': 'V', 'TAA': 'X', 'TAC': 'Y', 'TAG': 'X', 'TAT': 'Y', 'TCA': 'S', 'TCC': 'S', 'TCG': 'S', 'TCT': 'S', 'TGA': 'X', 'TGC': 'C', 'TGG': 'W', 'TGT': 'C', 'TTA': 'L', 'TTC': 'F', 'TTG': 'L', 'TTT': 'F'}
Codon table (stop codon is X).
- aldy.common.REV_COMPLEMENT = {'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A'}
Reverse-complement DNA table.
- aldy.common.SOLUTION_PRECISION = 0.01
Solution precision (all values whose absolute difference falls below the specified precision are considered equal).
- class aldy.common.Timing(name='Block', fn=None)[source]
Bases:
objectContext manager for timing code blocks. Prints the time spent in the function after it is completed.
- aldy.common.allele_name(x: str) str[source]
- Returns:
Major allele number of the star-allele name (e.g. ‘12A’ -> 12).
- aldy.common.chr_prefix(ch: str, chrs: List[str]) str[source]
Check if a chromosome needs “chr” prefix given the available chromosomes. :returns: Chromosome prefix if the chromosome does not have it.
- aldy.common.colorize(text: str, color: str = 'green') str[source]
- Returns:
xterm-compatible colorized string with a given color.
- aldy.common.log = <logbook.base.Logger object>
Default console logger.
- aldy.common.parse_cn_region(cn_region)[source]
- Returns:
GRangeobject that represents the user-provided CN region in Samtools format (i.e., chr1:100-200).- Raises:
aldy.common.AldyExceptionif the region is invalid.
- aldy.common.script_path(key: str) str[source]
Obtain the full path of a resource.
- Parameters:
key – resource to be extracted.
key – resource to be extracted in path/file format (e.g., aldy.resources/test.txt).
- Returns:
Full path of the resource.
- Raises:
aldy.common.AldyExceptionif the resource does not exist.
aldy.lpinterface module
- class aldy.lpinterface.CBC(name)[source]
Bases:
GurobiWrapper around CBC’s Python interface (Google’s ortools).
- addVar(*_, **kwargs)[source]
Add a variable to the model.
vtype is the variable type:
B for binary variable
I for integer variable
C or nothing for continuous variable.
- getValue(var)[source]
Get the value of the solved variable. Automatically adjusts the return type based on the variable type.
- quicksum(expr)[source]
Perform a quick summation of the iterable expr. Much faster than Python’s sum on large iterables.
- solve(init: Callable | None = None) Tuple[str, float][source]
Solve the model. Assumes that the objective is set.
Additional parameters of the solver can be set via init function that takes the model instance as the sole argument.
- Returns:
Status of the solution and the objective value.
- Raise:
NoSolutionsErrorif the model is infeasible.
- class aldy.lpinterface.Gurobi(name, prev_model=None)[source]
Bases:
objectWrapper around Gurobi’s Python interface (
gurobipy).- abssum(vars: Iterable, coeffs: Dict[str, float] | None = None)[source]
- Return the absolute sum of vars: e.g.
\(\sum_i |c_i x_i|\) for the set \({x_1,...}\).
where \(c_i\) is defined in the coeffs dictionary.
Key of the coeffs dictionary stands for the name of the variable (should be accessible via varName call; 1 if not defined).
- addVar(*args, **kwargs)[source]
Add a variable to the model.
vtype is the variable type:
B for binary variable
I for integer variable
C or nothing for continuous variable.
- getValue(var)[source]
Get the value of the solved variable. Automatically adjusts the return type based on the variable type.
- prod(res, terms)[source]
Ensure that \(res = \prod terms\) (where terms is a sequence of binary variables) by adding the appropriate linear constraints. Returns res.
- quicksum(expr: Iterable)[source]
Perform a quick summation of the iterable expr. Much faster than Python’s sum on large iterables.
- solutions(gap: float = 0, best_obj: float | None = None, limit=None, iteration=0, init: Callable | None = None)[source]
Solve the model and returns the list of all optimal solutions. Assumes that the objective is set. Any solution whose score is less than (1 + gap) times the optimal solution score will be included.
A solution is defined as a dictionary of set binary variables within the solution that are accessed by their name.
Additional parameters of the solver can be set via init function that takes the model instance as the sole argument.
This is a generic version that supports any solver.
- Yields:
Status of the solution, the objective value and the solution itself.
- solve(init: Callable | None = None) Tuple[str, float][source]
Solve the model. Assumes that the objective is set.
Additional parameters of the solver can be set via init function that takes the model instance as the sole argument.
- Returns:
Status of the solution and the objective value.
- Raise:
NoSolutionsErrorif the model is infeasible.
- exception aldy.lpinterface.NoSolutionsError[source]
Bases:
ExceptionRaised if a model is infeasible.
- aldy.lpinterface.SOLVER_PRECISON = 1e-05
Default solver precision