vcf

package
v1.0.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 1, 2024 License: BSD-3-Clause Imports: 15 Imported by: 11

Documentation

Overview

Package vcf contains functions for reading, writing, and manipulating VCF format files. More information on the VCF file format can be found in its official documentation at https://samtools.github.io/hts-specs/VCFv4.3.pdf. This file is parsed into a struct containing header information as well as a Vcf struct containing the information from each data line.

Index

Constants

View Source
const Version = "VCFv4.3"

Variables

This section is empty.

Functions

func AllEqual

func AllEqual(alpha []Vcf, beta []Vcf) bool

AllEqual returns true if each Vcf in a slice of Vcf structs contain identical information.

func BuildGenotypeMap

func BuildGenotypeMap(v Vcf, names map[string]int16, mapToVcf map[uint64]Vcf) map[uint64]Vcf

BuildGenotypeMap is included for backwards compatibility, but is currently being deprecated.

func ChanToVariants

func ChanToVariants(c <-chan Vcf, sendChans SmallVariantChans)

ChanToVariants splits an incoming channel of VCFs to a channel of variant types. Must be run as a goroutine else the thread will deadlock. Skips any vcf records where the ref/alt fields contains any of (:>[.). These records cannot be parsed into simple variants.

See documentation for SmallVariantChans for more information on variant channels.

func ChromPosToUInt64

func ChromPosToUInt64(chrom int, start int) uint64

ChromPosToUInt64 takes a chromosome number and a start position and encodes them both as a uint64

func CompareAlt

func CompareAlt(alpha []string, beta []string) int

CompareAlt compares the two slice of string Alt fields from a VCF lexicographically.

func CompareCoord

func CompareCoord(alpha Vcf, beta Vcf) int

CompareCoord compares two VCF structs by Pos for sorting or equality testing.

func CompareHeader

func CompareHeader(alpha Header, beta Header) int

CompareHeader compares two Header structs for sorting or equality testing.

func Del

func Del(v Vcf) bool

Del checks SVTYPE and returns true if the value is DEL

func FixAllVcf

func FixAllVcf(query []Vcf, ref map[string][]dna.Base)

FixAllVcf runs FixVcf on each element in a slice of vcf structs. Along with the slice of vcfs, it also needs the reference genome in the form of chromosome names mapping to DNA sequences.

func FormatToString

func FormatToString(format []string) string

FormatToString converts the []string Format struct into a string by concatenating with a colon delimiter.

func GetAltBases

func GetAltBases(words []string) [][]dna.Base

GetAltBases converts a slice of DNA sequenes encoded as strings into a slice of DNA sequences encoded as slices of dna.Base

func HasAncestor

func HasAncestor(g Vcf) bool

HasAncestor returns true if a VCF record is annotated with an ancestor allele in the Info column, false otherwise.

func HeaderGetSampleList

func HeaderGetSampleList(header Header) []string

HeaderGetSampleList returns an ordered list of the samples present in the header of a Vcf file. Useful when adding or removing samples from a VCF.

func Ins

func Ins(v Vcf) bool

Ins checks SVTYPE and returns true if the value is INS

func IsAltAncestor

func IsAltAncestor(g Vcf) bool

IsAltAncestor returns true if the first alt allele in the record matches the ancestral allele in the Info annotation, false otherwise.

func IsBiallelic

func IsBiallelic(v Vcf) bool

IsBiallelic returns true if a vcf record has 1 alt variant, false otherwise.

func IsHeterozygous

func IsHeterozygous(s Sample) bool

IsHeterozygous returns true if more than 1 allele is present in the sample.

func IsHomozygous

func IsHomozygous(s Sample) bool

IsHomozygous returns true if only 1 allele is present in the sample. Note that IsHomozygous also returns true for hemizygous samples.

func IsNotRefStrongAltWeak

func IsNotRefStrongAltWeak(v Vcf) bool

IsNotRefStrongAltWeak returns true if an input biallelic substitution variant is not a strong to weak variant, false otherwise.

func IsNotRefWeakAltStrong

func IsNotRefWeakAltStrong(v Vcf) bool

IsNotRefWeakAltStrong returns true if an input biallelic substitution variant does not have a weak Ref allele and a strong Alt allele, false otherwise.

func IsNotWeakToStrongOrStrongToWeak

func IsNotWeakToStrongOrStrongToWeak(v Vcf) bool

IsNotWeakToStrongOrStrongToWeak returns true if a variant is neither a weak to strong variant nor a strong to weak variant, false otherwise.

func IsPolarizable

func IsPolarizable(v Vcf) bool

IsPolarizable returns true if a variant can be "polarized" in a derived allele frequency spectrum, false otherwise.

func IsRefAncestor

func IsRefAncestor(g Vcf) bool

IsRefAncestor returns true if the reference allele in the record matches the ancestral allele in the Info annotation, false otherwise.

func IsRefStrongAltWeak

func IsRefStrongAltWeak(v Vcf) bool

IsStrongToWeak returns true if an input biallelic substitution variant has a strong Ref allele and a weak Alt allele, false otherwise.

func IsRefWeakAltStrong

func IsRefWeakAltStrong(v Vcf) bool

IsRefWeakAltStrong returns true if an input biallelic substitution variant has a weak Ref allele and a strong Alt allele, false otherwise.

func IsSegregating

func IsSegregating(v Vcf) bool

IsSegregating returns true if a Vcf record is a segregating site, true if the samples of the record contain at least two allelic states (ex. not all 0 or all 1).

func IsSubstitution

func IsSubstitution(v Vcf) bool

IsSubstitution returns true if all of the alt fields of a vcf records are of length 1, false otherwise.

func IsVcfFile

func IsVcfFile(filename string) bool

IsVcfFile checks suffix of filename to confirm if the file is a vcf formatted file.

func IsWeakToStrongOrStrongToWeak

func IsWeakToStrongOrStrongToWeak(v Vcf) bool

IsWeakToStrongOrStrongToWeak returns true if an input biallelic substitution variant is a strong to weak variant or a weak to strong variant, false otherwise.

func NewWriteHeader

func NewWriteHeader(file io.Writer, header Header)

NewWriteHeader writes the value of header.Text to the provided io.Writer

func PhasedToString

func PhasedToString(phased bool) string

PhasedToString returns "|" when true and "/" otherwise.

func PrintSampleNames

func PrintSampleNames(header Header) string

PrintSampleNames takes a vcf header and prints the sample names from the "#CHROM" line

func PrintSingleLine

func PrintSingleLine(data Vcf)

PrintSingleLine prints an individual Vcf line.

func PrintVcf

func PrintVcf(data []Vcf)

PrintVcf prints every line of a []Vcf.

func QueryAncestor

func QueryAncestor(g Vcf) []dna.Base

QueryAncestor finds the AA INFO from a VCF struct and returns the base of the ancestral allele.

func QueryFlag

func QueryFlag(v Vcf, k Key) bool

QueryFlag retrieves boolean value stored in the Info or Format fields of a vcf record. The input is a Key struct which is retrieved from the header and is keyed by Id. (e.g. header.Format["GT"].Key). QueryInt cannot be used until the requested field (Info or Format) has been parsed using the ParseInfo and ParseFormat functions.

Note that flags are not valid in the Format field, so this query is only for Info.

func QueryFloat

func QueryFloat(v Vcf, k Key) ([][]float64, bool)

QueryFloat retrieves float64 values stored in the Info or Format fields of a vcf record. The input is a Key struct which is retrieved from the header and is keyed by Id. (e.g. header.Format["GT"].Key). QueryFloat cannot be used until the requested field (Info or Format) has been parsed using the ParseInfo and ParseFormat functions.

The return is a slice of slices where the first slice corresponds to the sample (this is always len == 1 when querying the Info field) and the second slice corresponds to multiple values that may be present for the given tag (e.g. ref/alt read depth may be "9,1").

The second return is false if the requested value is not present in the input record.

func QueryInt

func QueryInt(v Vcf, k Key) ([][]int, bool)

QueryInt retrieves integer values stored in the Info or Format fields of a vcf record. The input is a Key struct which is retrieved from the header and is keyed by Id. (e.g. header.Format["GT"].Key). QueryInt cannot be used until the requested field (Info or Format) has been parsed using the ParseInfo and ParseFormat functions.

The return is a slice of slices where the first slice corresponds to the sample (this is always len == 1 when querying the Info field) and the second slice corresponds to multiple values that may be present for the given tag (e.g. ref/alt read depth may be "9,1").

The second return is false if the requested value is not present in the input record.

func QueryRune

func QueryRune(v Vcf, k Key) ([][]rune, bool)

QueryRune retrieves rune values stored in the Info or Format fields of a vcf record. The input is a Key struct which is retrieved from the header and is keyed by Id. (e.g. header.Format["GT"].Key). QueryRune cannot be used until the requested field (Info or Format) has been parsed using the ParseInfo and ParseFormat functions.

The return is a slice of slices where the first slice corresponds to the sample (this is always len == 1 when querying the Info field) and the second slice corresponds to multiple values that may be present for the given tag (e.g. ref/alt read depth may be "9,1").

The second return is false if the requested value is not present in the input record.

func QueryString

func QueryString(v Vcf, k Key) ([][]string, bool)

QueryString retrieves string values stored in the Info or Format fields of a vcf record. The input is a Key struct which is retrieved from the header and is keyed by Id. (e.g. header.Format["GT"].Key). QueryString cannot be used until the requested field (Info or Format) has been parsed using the ParseInfo and ParseFormat functions.

The return is a slice of slices where the first slice corresponds to the sample (this is always len == 1 when querying the Info field) and the second slice corresponds to multiple values that may be present for the given tag (e.g. ref/alt read depth may be "9,1").

The second return is false if the requested value is not present in the input record.

func Read

func Read(filename string) ([]Vcf, Header)

Read parses a slice of VCF structs from an input filename. Does not store or return the header.

func ReadToChan

func ReadToChan(file *fileio.EasyReader, data chan<- Vcf, wg *sync.WaitGroup)

ReadToChan is a helper function of GoReadToChan.

func SampleNamesInOrder

func SampleNamesInOrder(header Header) []string

SampleNamesInOrder takes in the header and gives back the sample names in the order in which they appear in the header.

func SampleVcf

func SampleVcf(records []Vcf, header Header, numVariants int, numSamples int) ([]Vcf, Header)

SampleVcf takes a set of Vcf records and returns a random subset of variants to an output VCF file. Can also retain a random subset of alleles from gVCF data (diploid, does not break allele pairs).

func SamplesToString

func SamplesToString(sample []Sample) string

SamplesToString has been deprecated

func Snp

func Snp(v Vcf) bool

Snp checks SVTYPE and returns true if the value is SNP

func Sort

func Sort(vcfFile []Vcf)

Sort sorts a slice of Vcf structs in place.

func Write

func Write(filename string, data []Vcf)

Write writes a []Vcf to an output filename.

func WriteMultiSamplesHeader

func WriteMultiSamplesHeader(file io.Writer, header Header, listNames []string)

WriteMultiSamplesHeader will write the value of header.Text to an io.Writer, but will replace the line starting with "#CHROM\t" with a "#CHROM" line that contains the standard column headers and then the names of the samples passed with listNames.

func WriteVcf

func WriteVcf(file io.Writer, input Vcf)

WriteVcf writes an individual Vcf struct to an io.Writer.

func WriteVcfToFileHandle

func WriteVcfToFileHandle(file io.Writer, input []Vcf)

TODO(craiglowe): Look into unifying WriteVcfToFileHandle and WriteVcf and benchmark speed. geno bool variable determines whether to print notes or genotypes.

Types

type FilterHeader

type FilterHeader struct {
	Id          string
	Description string
}

FilterHeader contains info encoded by header lines beginning in ##FILTER.

type FormatHeader

type FormatHeader struct {
	Key
	Description string
}

FormatHeader contains info encoded by header lines beginning in ##FORMAT.

type GVcf

type GVcf struct {
	Vcf
	Seq       [][]dna.Base
	Genotypes []Sample
}

GVcf stores genotype information, but is currently being deprecated.

type Header struct {
	FileFormat string                         // ##fileformat=VCFv4.3
	Info       map[string]InfoHeader          // key=ID
	Filter     map[string]FilterHeader        // key=ID
	Format     map[string]FormatHeader        // key=ID
	Chroms     map[string]chromInfo.ChromInfo // key=chrom name
	Samples    map[string]int                 // key=samplename val=index in Sample
	Text       []string                       // raw text
}

Header contains all information present in the header section of a VCF. Info, Filter, Format, and Contig lines are parsed into maps keyed by ID.

func AncestorFlagToHeader

func AncestorFlagToHeader(h Header) Header

AncestorFlagToHeader adds an ##INFO line to a vcfHeader to include information about the AA flag for ancestral alleles.

func GoReadToChan

func GoReadToChan(filename string) (<-chan Vcf, Header)

GoReadToChan parses VCF structs from an input filename and returns a chan of VCF structs along with the header of the VCF file.

func HeaderUpdateSampleList

func HeaderUpdateSampleList(header Header, newSamples []string) Header

HeaderUpdateSampleList can be provided with a new list of samples to update the sample list in a Header.

func NewHeader

func NewHeader() Header

NewHeader creates a new minimal header for a vcf file

func ReadHeader

func ReadHeader(er *fileio.EasyReader) Header

ReadHeader reads and parses the header in a vcf file.

type InfoHeader

type InfoHeader struct {
	Key
	Description string
	Source      string
	Version     string
}

InfoHeader contains info encoded by header lines beginning in ##INFO.

type InfoType

type InfoType byte

InfoType stores the type of variable that a field in the Header holds.

const (
	Integer InfoType = iota
	Float
	Flag
	Character
	String
)

func (InfoType) String

func (t InfoType) String() string

String converts the InfoType value as a human-readable string

type Key

type Key struct {
	Id       string
	Number   string // numeral or 'A', 'G', 'R', '.'
	DataType InfoType
	IsFormat bool // true if this key is for a Format field, false for Info
}

Key is the identifying information for a given info field. (e.g. the genotype field in format would be {"GT", "1", Integer, true}.) (e.g. a read counter may be {"ReadCount", "R", Integer, true}.)

type Sample

type Sample struct {
	Alleles    []int16  // Alleles present in genotype, 0 for reference, 1 for Alt[0], 2 for Alt[1], etc.
	Phase      []bool   // True for phased genotype, false for unphased. len(Phase) == len(Alleles). Phase[0] == true if and only if Phase[1:] == true
	FormatData []string // FormatData contains additional sample fields after the genotype, which are parsed into a slice delimited by colons.

}

Sample is a substruct of Vcf, and contains information about each sample represented in a VCF line. Indexes in Alleles are set to -1 if no genotype data is present.

func (Sample) String

func (s Sample) String() string

String implements the fmt.Stringer interface for easy printing of Sample with the fmt package.

type SampleHash

type SampleHash struct {
	Fa     map[string]int16
	GIndex map[string]int16
}

SampleHash stores index and position information, but is currently being deprecated.

func HeaderToMaps

func HeaderToMaps(header Header) *SampleHash

HeaderToMaps uses a Vcf header to create a pointer to a SampleHash

type SmallVariantChans

type SmallVariantChans struct {
	Substitutions chan variant.Substitution
	Insertions    chan variant.Insertion
	Deletions     chan variant.Deletion
	Delins        chan variant.Delins
	Records       chan Vcf
}

SmallVariantChans wraps channels for all of the small variants that are valid variant.Mutators and variant.Effectors. All channels except for Records sends a parsed version of a given Vcf record that stores the corresponding variant type. Each send on one of the variant channels must be paired with a send on the records channel. i.e. for each parsed variant sent, the original vcf record is the following send. This is useful as the Vcf records often hold metadata that is not stored in the variant struct. Even if the original Vcf record is not used, the Records channel must be read else the sending goroutine will be blocked.

func GoChanToVariants

func GoChanToVariants(c <-chan Vcf) SmallVariantChans

GoChanToVariants wraps ChanToVariants and handles variant channel creation and goroutine spawning.

func NewSmallVariantChans

func NewSmallVariantChans(buffSize int) SmallVariantChans

NewSmallVariantChans creates a series of channels to send the variant representation of a given vcf record. The input buffSize determines the channel buffer size for all on the variant channels.

type Vcf

type Vcf struct {
	Chr     string
	Pos     int
	Id      string
	Ref     string
	Alt     []string
	Qual    float64
	Filter  string
	Info    string
	Format  []string
	Samples []Sample
	// contains filtered or unexported fields
}

Vcf contains information for each line of a VCF format file, corresponding to variants at one position of a reference genome.

func AnnotateAncestorFromMultiFa

func AnnotateAncestorFromMultiFa(g Vcf, records []fasta.Fasta, RefStart int, AlnStart int) (Vcf, int, int)

AnnotateAncestorFromMultiFa adds the ancestral state to a VCF variant by inspecting a pairwise fasta of the reference genome and an ancestor sequence. records is a pairwise multiFa where the first entry is the reference genome and the second entry is the ancestor. Returns the refPos and alnPos of the current record.

func AppendAncestor

func AppendAncestor(g Vcf, b []dna.Base) Vcf

AppendAncestor adds the ancestral allele state (defined by input bases) to the INFO column of a vcf entry.

func FixVcf

func FixVcf(query Vcf, ref map[string][]dna.Base) Vcf

FixVcf "fixes" vcf records that have a dash for a deletion, which does not conform to the current VCF file specs, but is often seen in the output of different programs. The function takes the vcf record to be fixed and the reference genome as a map of chromosome name to DNA sequence.

func InvertVcf

func InvertVcf(v Vcf) Vcf

InvertVcf inverts the reference and alt variants in a Vcf record. Currently does not update other fields, but this functionality may be added.

func NextVcf

func NextVcf(reader *fileio.EasyReader) (Vcf, bool)

NextVcf is a helper function of Read and GoReadToChan. Checks a reader for additional data lines and parses a Vcf line if more lines exist.

func ParseFormat

func ParseFormat(v Vcf, header Header) Vcf

ParseFormat parses the data stored in vcf.Format. Fills an unexported map in the Vcf struct that can be queried with Query(type) functions.

func ParseInfo

func ParseInfo(v Vcf, header Header) Vcf

ParseInfo parses the data stored in vcf.Info. Fills an unexported map in the Vcf struct that can be queried with Query(type) functions.

func ReorderSampleColumns

func ReorderSampleColumns(input Vcf, samples []int16) Vcf

ReorderSampleColumns reorganizes the Samples slice based on a samples []int16 specification list.

func (Vcf) GetChrom

func (v Vcf) GetChrom() string

GetChrom returns the chromosome that the variant is on.

func (Vcf) GetChromEnd

func (v Vcf) GetChromEnd() int

GetChromEnd returns the end position of a variant. Since VCF is 1-base, GetChromEnd of a SNP variant will rerun v.Pos. For indels, GetChromEnd will return v.Pos + the length of the indel - 1.

func (Vcf) GetChromStart

func (v Vcf) GetChromStart() int

GetChromStart returns the start position of a variant. Since VCF is 1-base, GetChromStart of a SNP variant will return v.Pos - 1. VCF format defines the first base of an indel as the base prior to the change, so for indels, GetChromStart will simply return v.Pos

func (Vcf) String

func (v Vcf) String() string

String implements the fmt.Stringer interface for easy printing of Vcf with the fmt package.

func (Vcf) UpdateCoord

func (v Vcf) UpdateCoord(c string, start int, end int) interface{}

UpdateCoord modifies the position data in a VCF struct based on input values.

func (Vcf) WriteToFileHandle

func (v Vcf) WriteToFileHandle(file io.Writer)

WriteToFileHandle writes a VCF struct to an io.Writer.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL