VariantExtractor

Deterministic and standard extractor of indels, SNVs and structural variants (SVs) from VCF files built under the frame of EUCANCan’s second work package. VariantExtractor is a Python package (requires Python version 3.6 or higher) and provides a set of data structures and functions to extract variants from VCF files in a deterministic and standard way while adding information to facilitate afterwards processing. It homogenizes multiallelic variants, MNPs and SVs (including imprecise paired breakends and single breakends). The package is designed to be used in a pipeline, where the variants are ingested from VCF files and then used in downstream analysis. Check the available documentation for more information.

While there is somewhat of an agreement on how to label the SNVs and indels variants, this is not the case for the structural variants. In the current scenario, different labeling between variant callers makes comparisons between structural variants difficult. This package provides an unified interface to extract variants (included structural variants) from VCFs generated by different variant callers. Apart from reading the VCF file, VariantExtractor adds a preprocessing layer to homogenize the variants extracted from the file. This way, the variants can be used in downstream analysis in a consistent way. For more information about the homogenization process, check the homogenization rules section.

Table of contents

Getting started

Installation

VariantExtractor is available on PyPI and can be installed using pip:

pip install variant-extractor

Usage

# Import the package
from variant_extractor import VariantExtractor

# Create a new instance of the class
extractor = VariantExtractor('/path/to/file.vcf')
# Iterate through the variants
for variant_record in extractor:
    print(f'Found variant of type {variant_record.variant_type.name}: {variant_record.contig}:{variant_record.pos}')
# Import the package
from variant_extractor import VariantExtractor

# Create a new instance of the class
extractor = VariantExtractor('/path/to/file.vcf')

# Save variants to a CSV file
extractor.to_dataframe().drop(['variant_record_obj'], axis=1).to_csv('/path/to/output.csv', index=False)

For a more complete list of examples, check the examples directory. This folder also includes an example of a script for normalizing VCF files following the homogenization rules.

VariantRecord

The VariantExtractor constructor returns a generator of VariantRecord instances. The VariantRecord class is a container for the information contained in a VCF record plus some extra useful information.

Property

Type

Description

contig

str

Contig name

pos

int

Position on the contig

end

int

End position of the variant in the contig (same as pos for TRA and SNV)

length

int

Length of the variant

id

Optional[str]

Record identifier

ref

str

Reference sequence

alt

str

Alternative sequence

qual

Optional[float]

Quality score for the assertion made in ALT

filter

List[str]

Filter status. PASS if this position has passed all filters. Otherwise, it contains the filters that failed

info

Dict[str, Any]

Additional information

format

List[str]

Specifies data types and order of the genotype information

samples

Dict[str, Dict[str, Any]]

Genotype information for each sample

variant_type

VariantType

Variant type inferred

alt_sv_breakend

Optional[BreakendSVRecord]

Breakend SV info, present only for SVs with breakend notation. For example, G]17:198982]

alt_sv_shorthand

Optional[ShorthandSVRecord]

Shorthand SV info, present only for SVs with shorthand notation. For example, <DUP:TANDEM>

VariantType

The VariantType enum describes the type of the variant. For structural variants, it is inferred only from the breakend notation (or shorthand notation). It does not take into account any INFO field (SVTYPE nor EVENTYPE) that might be added by the variant caller afterwards.

REF

ALT

Variant name

Description

A

G

SNV

Single nucleotide variant

AGTG

A

DEL

Deletion

A

A[1:20[ or <DEL>

DEL

Deletion

A

ACCT or <INS>

INS

Insertion

A

]1:20]A or <DUP>

DUP

Duplication

A

A]1:20] or [1:20[A

INV

Inversion. <INV> is a special case

A

<CNV>

CNV

Copy number variation

A

A]X:20] or A[X:20[ or ]X:20]A or [X:20[A

TRA

Translocation

A

A. or .A

SGL

Single breakend

BreakendSVRecord

The BreakendSVRecord class is a container for the information contained in a VCF record for SVs with breakend notation.

Property

Type

Description

prefix

Optional[str]

Prefix of the SV record with breakend notation. For example, for G]17:198982] the prefix will be G

bracket

str

Bracket of the SV record with breakend notation. For example, for G]17:198982] the bracket will be ]

contig

str

Contig of the SV record with breakend notation. For example, for G]17:198982] the contig will be 17

pos

int

Position of the SV record with breakend notation. For example, for G]17:198982] the position will be 198982

suffix

Optional[str]

Suffix of the SV record with breakend notation. For example, for G]17:198982] the suffix will be None

ShorthandSVRecord

The ShorthandSVRecord class is a container for the information contained in a VCF record for SVs with shorthand notation.

Property

Type

Description

type

str

Type of the SV record with shorthand notation. One of the following, 'DEL', 'INS', 'DUP', 'INV' or 'CNV'. For example, for <DUP:TANDEM> the type will be DUP

extra

List[str]

Extra information of the SV. For example, for <DUP:TANDEM:AA> the extra will be ['TANDEM', 'AA']

Homogenization rules

VariantExtractor provides a unified interface to extract variants (included structural variants) from VCF files generated by different variant callers. The variants are homogenized and returned applying the following rules:

Multiallelic variants

An entry with multiple ALT sequences (multiallelic) is divided into multiple entries with a single ALT field. This entries with a single ALT field are then processed with the rest of the homogeneization rules. For example:

CHROM

POS

ID

REF

ALT

FILTER

2

1

multiallelic_1

G

C,T

PASS

is returned as:

CHROM

POS

ID

REF

ALT

FILTER

VariantType

2

1

multiallelic_1_0

G

C

PASS

SNV

2

1

multiallelic_1_1

G

T

PASS

SNV

SNVs

Entries with REF/ALT of the same lenghts are treated like SNVs. If the REF/ALT sequences are more than one nucleotide (MNPs), they are divided into multiple atomic SNVs. For example:

CHROM

POS

ID

REF

ALT

FILTER

2

1

snv_1

C

G

PASS

2

3

mnp_1

TAG

AGT

PASS

are returned as:

CHROM

POS

ID

REF

ALT

FILTER

VariantType

2

3

snv_1

C

G

PASS

SNV

2

3

mnp_1_0

T

A

PASS

SNV

2

4

mnp_1_1

A

G

PASS

SNV

2

5

mnp_1_2

G

T

PASS

SNV

|

Structural variants

VariantExtractor returns one entry per structural variant (one entry per breakend pair). This helps to avoid the ambiguity of the notation and keeps the process deterministic. For this reason, in case of paired breakends, the breakend with the lowest chromosome and/or position is returned. If a breakend is not the lowest chromosome and/or position and is missing its pair, its pair is inferred and returned.

Breakend vs shorthand notation

Entries with the same information, either described with shorthand or breakend notation, will be returned the same way. Here is an example for a DEL entry:

CHROM

POS

ID

REF

ALT

FILTER

INFO

1

3000

event_1_o

A

A[1:5000[

PASS

SVTYPE=BND

1

5000

event_1_h

A

]1:3000]A

PASS

SVTYPE=BND

1

3000

event_1

A

A[1:5000[

PASS

SVTYPE=DEL

1

3000

event_1

A

<DEL>

PASS

SVTYPE=DEL; END=5000

1

3000

event_1

AGTCACAA…

A

PASS

are returned as one entry (each one of them with their own ALT field), but with the same VariantRecord.end and VariantType:

CHROM

POS

ID

REF

ALT

FILTER

INFO

VariantType

VariantRecord.end

1

3000

event_1

A

PASS

DEL

5000

The special case of INV

<INV> is a special case of shorthand notation because it represents two paired breakends. For example, the following shorthand notation:

CHROM

POS

ID

REF

ALT

FILTER

INFO

2

321682

event_1

T

<INV>

PASS

SVTYPE=INV;END=421681

is equivalent to the following breakends:

CHROM

POS

ID

REF

ALT

FILTER

INFO

VariantType

2

321681

event_1_0

N

N]2:421681]

PASS

SVTYPE=INV

INV

2

321682

event_1_1

T

[2:421682[T

PASS

SVTYPE=INV

INV

In this case, VariantExtractor converts internally <INV> to two entries with breakend notation (one for each breakend pair). Note that the N will be replaced with the correct nucleotide if fasta_ref is provided to VariantExtractor.

Paired breakends

For paired breakends, breakends are paired using the INFO fields MATEID or PARID. If these fields are not available, they are paired using their coordinates (contig+position). The breakend with the lowest chromosome and/or position is returned. For example:

CHROM

POS

ID

REF

ALT

FILTER

INFO

2

3000

event_1_o

T

]3:5000]T

PASS

SVTYPE=BND

3

5000

event_1_h

G

G[2:3000[

PASS

SVTYPE=BND

1

3000

event_2_o

A

A[1:5000[

PASS

SVTYPE=BND

1

5000

event_2_h

A

]1:3000]A

PASS

SVTYPE=BND

are returned as one entry per variant:

CHROM

POS

ID

REF

ALT

FILTER

INFO

VariantType

2

3000

event_1_o

T

]3:5000]T

PASS

SVTYPE=BND

TRA

1

3000

event_2_o

A

A[1:5000[

PASS

SVTYPE=BND

DEL

Inferred breakend pairs

If all breakends are missing their pair, the breakends with the lowest chromosome and/or position are inferred and returned. For example:

CHROM

POS

ID

REF

ALT

FILTER

INFO

3

5000

event_1_h

G

G[2:3000[

PASS

SVTYPE=BND

1

5000

event_2_h

A

]1:3000]A

PASS

SVTYPE=BND

are returned as their inferred breakend pair with the lowest chromosome and/or position:

CHROM

POS

ID

REF

ALT

FILTER

INFO

VariantType

2

3000

event_1_h

N

]3:5000]N

PASS

SVTYPE=BND

TRA

1

3000

event_2_h

A

A[1:5000[

PASS

SVTYPE=BND

DEL

Note that the N will be replaced with the correct nucleotide if fasta_ref is provided to VariantExtractor. The following equivalencies are applied:

CHROM1

POS1

REF1

ALT1

CHROM2

POS2

REF2

ALT2

1

500

N

N[7:800[

7

800

N

]1:500]N

1

500

N

]7:800]N

7

800

N

N[1:500[

1

500

N

[7:800[N

7

800

N

[1:500[N

1

500

N

N]7:800]

7

800

N

N]1:500]

Imprecise paired breakends

Imprecise breakends do not match exactly with their pair in coordinates. In this case, they are paired using the INFO fields MATEID or PARID. As with the rest of variants, for each breakend pair, only the breakend with the lowest chromosome and/or position is returned. However, it is important to notice that the CIPOS field is lost for the other breakend. For example:

CHROM

POS

ID

REF

ALT

FILTER

INFO

2

3010

event_1_o

T

T[3:5000[

PASS

SVTYPE=BND;CIPOS=0,50;PARID=event_1_h

3

5050

event_1_h

A

]2:3050]A

PASS

SVTYPE=BND;CIPOS=0,100;PARID=event_1_o

are paired and the entry with the lowest chromosome and/or position is returned:

CHROM

POS

ID

REF

ALT

FILTER

INFO

VariantType

2

3010

event_1_o

T

T[3:5000[

PASS

SVTYPE=BND;CIPOS=0,50;PARID=a_h

TRA

Single breakends

Single breakends cannot be matched with other breakends because they lack a mate. They may be able to be matched later in downstream analysis. That is why each one is kept as a different variant. For example:

CHROM

POS

ID

REF

ALT

FILTER

INFO

2

3000

event_s

T

T.

PASS

SVTYPE=BND

3

5000

event_m

G

.G

PASS

SVTYPE=BND

are returned as two entries:

CHROM

POS

ID

REF

ALT

FILTER

INFO

VariantType

2

3000

event_s

T

T.

PASS

SVTYPE=BND

SGL

3

5000

event_m

G

.G

PASS

SVTYPE=BND

SGL

Dependencies

The dependencies are covered by their own respective licenses as follows:

Indices and tables