variant_extractor API

class variant_extractor.VariantExtractor(vcf_file: str, pass_only=False, ensure_pairs=True, fasta_ref: str | None = None)[source]

Bases: object

Reads and extracts variants from VCF files. This class is designed to be used in a pipeline, where the variants are ingested from VCF files and then used in downstream analysis.

__init__(vcf_file: str, pass_only=False, ensure_pairs=True, fasta_ref: str | None = None)[source]
Parameters:
  • vcf_file (str) – A VCF formatted file. The file is automatically opened.

  • pass_only (bool, optional) – If True, only records with PASS filter will be considered.

  • ensure_pairs (bool, optional) – If True, throws an exception if a breakend is missing a pair when all other were paired successfully.

  • fasta_ref (str, optional) – A FASTA file with the reference genome. Must be indexed.

close()[source]

Closes the VCF file.

static empty_dataframe(extra_fields=[])[source]

Returns an empty pandas DataFrame with the columns used by this class.

to_dataframe(extra_fields=[])[source]

Returns a pandas DataFrame with the variants extracted from the VCF file. The columns are: - start_chrom: chromosome of the start position - start: start position of the variant - end_chrom: chromosome of the end position - end: end position of the variant - ref: reference allele - alt: alternative allele - length: length of the variant (0 for insertions) - brackets: breakend brackets for breakend SVs (or equivalent for indels or shorthand SVs) - type_inferred: inferred type of the variant (see VariantType) The DataFrame can be extended with extra fields from the VariantRecord by passing their names in the extra_fields parameter. For example, passing ‘id’ will add the id field to the DataFrame. If variant_record_obj is passed in extra_fields, the original VariantRecord object will be added to the DataFrame in a column named ‘variant_record_obj’.

class variant_extractor.variants.BreakendSVRecord(prefix: str | None, bracket: str, contig: str, pos: int, suffix: str | None)[source]

Bases: NamedTuple

NamedTuple with the information of a breakend notated SV record

bracket: str

Bracket of the SV record with breakend notation. For example, for G]17:198982] the bracket will be ]

contig: str

Contig of the SV record with breakend notation. For example, for G]17:198982] the contig will be 17

pos: int

Position of the SV record with breakend notation. For example, for G]17:198982] the position will be 198982

prefix: str | None

Prefix of the SV record with breakend notation. For example, for G]17:198982] the prefix will be G

suffix: str | None

Suffix of the SV record with breakend notation. For example, for G]17:198982] the suffix will be None

class variant_extractor.variants.ShorthandSVRecord(type: str, extra: List[str] | None)[source]

Bases: NamedTuple

NamedTuple with the information of a shorthand SV record

extra: List[str] | None

Extra information of the SV. For example, for <DUP:TANDEM:AA> the extra will be ['TANDEM', 'AA']

type: str

One of the following, 'DEL', 'INS', 'DUP', 'INV' or 'CNV'

class variant_extractor.variants.VariantRecord(rec: VariantRecord, contig: str, pos: int, end: int, length: int, id: str | None, ref: str, alt: str, variant_type: VariantType, alt_sv_breakend: BreakendSVRecord | None = None, alt_sv_shorthand: ShorthandSVRecord | None = None)[source]

Bases: object

NamedTuple with the information of a variant record

__init__(rec: VariantRecord, contig: str, pos: int, end: int, length: int, id: str | None, ref: str, alt: str, variant_type: VariantType, alt_sv_breakend: BreakendSVRecord | None = None, alt_sv_shorthand: ShorthandSVRecord | None = None)[source]
alt: str

Alternative sequence

alt_sv_breakend: BreakendSVRecord | None

Breakend SV info, present only for SVs with breakend notation. For example, G]17:198982]

alt_sv_shorthand: ShorthandSVRecord | None

Shorthand SV info, present only for SVs with shorthand notation. For example, <DUP:TANDEM>

contig: str

Contig name

end: int

End position of the variant in the contig (same as pos for TRA and SNV)

filter: List[str | int]

Filter status. PASS if this position has passed all filters. Otherwise, it contains the filters that failed

property format

Specifies data types and order of the genotype information

id: str | None

Record identifier

property info

Additional information

length: int

Length of the variant

pos: int

Position of the variant in the contig

qual: float | None

Quality score for the assertion made in ALT

ref: str

Reference sequence

property samples

Genotype information for each sample

variant_type: VariantType

Variant type

class variant_extractor.variants.VariantType(*values)[source]

Bases: Enum

Enumeration with the different types of variations

CNV = 6
DEL = 2
DUP = 4
INS = 3
INV = 5
SGL = 8
SNV = 1
TRA = 7