Documentation ¶
Overview ¶
Package vcf contains functions for reading, writing, and manipulating VCF format files. More information on the VCF file format can be found in its official documentation at https://samtools.github.io/hts-specs/VCFv4.3.pdf. This file is parsed into a struct containing header information as well as a Vcf struct containing the information from each data line.
Index ¶
- Constants
- func AllEqual(alpha []Vcf, beta []Vcf) bool
- func BuildGenotypeMap(v Vcf, names map[string]int16, mapToVcf map[uint64]Vcf) map[uint64]Vcf
- func ChanToVariants(c <-chan Vcf, sendChans SmallVariantChans)
- func ChromPosToUInt64(chrom int, start int) uint64
- func CompareAlt(alpha []string, beta []string) int
- func CompareCoord(alpha Vcf, beta Vcf) int
- func CompareHeader(alpha Header, beta Header) int
- func Del(v Vcf) bool
- func FixAllVcf(query []Vcf, ref map[string][]dna.Base)
- func FormatToString(format []string) string
- func GetAltBases(words []string) [][]dna.Base
- func HasAncestor(g Vcf) bool
- func HeaderGetSampleList(header Header) []string
- func Ins(v Vcf) bool
- func IsAltAncestor(g Vcf) bool
- func IsBiallelic(v Vcf) bool
- func IsHeterozygous(s Sample) bool
- func IsHomozygous(s Sample) bool
- func IsNotRefStrongAltWeak(v Vcf) bool
- func IsNotRefWeakAltStrong(v Vcf) bool
- func IsNotWeakToStrongOrStrongToWeak(v Vcf) bool
- func IsPolarizable(v Vcf) bool
- func IsRefAncestor(g Vcf) bool
- func IsRefStrongAltWeak(v Vcf) bool
- func IsRefWeakAltStrong(v Vcf) bool
- func IsSegregating(v Vcf) bool
- func IsSubstitution(v Vcf) bool
- func IsVcfFile(filename string) bool
- func IsWeakToStrongOrStrongToWeak(v Vcf) bool
- func NewWriteHeader(file io.Writer, header Header)
- func PhasedToString(phased bool) string
- func PrintSampleNames(header Header) string
- func PrintSingleLine(data Vcf)
- func PrintVcf(data []Vcf)
- func QueryAncestor(g Vcf) []dna.Base
- func QueryFlag(v Vcf, k Key) bool
- func QueryFloat(v Vcf, k Key) ([][]float64, bool)
- func QueryInt(v Vcf, k Key) ([][]int, bool)
- func QueryRune(v Vcf, k Key) ([][]rune, bool)
- func QueryString(v Vcf, k Key) ([][]string, bool)
- func Read(filename string) ([]Vcf, Header)
- func ReadToChan(file *fileio.EasyReader, data chan<- Vcf, wg *sync.WaitGroup)
- func SampleNamesInOrder(header Header) []string
- func SampleVcf(records []Vcf, header Header, numVariants int, numSamples int) ([]Vcf, Header)
- func SamplesToString(sample []Sample) string
- func Snp(v Vcf) bool
- func Sort(vcfFile []Vcf)
- func Write(filename string, data []Vcf)
- func WriteMultiSamplesHeader(file io.Writer, header Header, listNames []string)
- func WriteVcf(file io.Writer, input Vcf)
- func WriteVcfToFileHandle(file io.Writer, input []Vcf)
- type FilterHeader
- type FormatHeader
- type GVcf
- type Header
- type InfoHeader
- type InfoType
- type Key
- type Sample
- type SampleHash
- type SmallVariantChans
- type Vcf
- func AnnotateAncestorFromMultiFa(g Vcf, records []fasta.Fasta, RefStart int, AlnStart int) (Vcf, int, int)
- func AppendAncestor(g Vcf, b []dna.Base) Vcf
- func FixVcf(query Vcf, ref map[string][]dna.Base) Vcf
- func InvertVcf(v Vcf) Vcf
- func NextVcf(reader *fileio.EasyReader) (Vcf, bool)
- func ParseFormat(v Vcf, header Header) Vcf
- func ParseInfo(v Vcf, header Header) Vcf
- func ReorderSampleColumns(input Vcf, samples []int16) Vcf
Constants ¶
const Version = "VCFv4.3"
Variables ¶
This section is empty.
Functions ¶
func AllEqual ¶
AllEqual returns true if each Vcf in a slice of Vcf structs contain identical information.
func BuildGenotypeMap ¶
BuildGenotypeMap is included for backwards compatibility, but is currently being deprecated.
func ChanToVariants ¶
func ChanToVariants(c <-chan Vcf, sendChans SmallVariantChans)
ChanToVariants splits an incoming channel of VCFs to a channel of variant types. Must be run as a goroutine else the thread will deadlock. Skips any vcf records where the ref/alt fields contains any of (:>[.). These records cannot be parsed into simple variants.
See documentation for SmallVariantChans for more information on variant channels.
func ChromPosToUInt64 ¶
ChromPosToUInt64 takes a chromosome number and a start position and encodes them both as a uint64
func CompareAlt ¶
CompareAlt compares the two slice of string Alt fields from a VCF lexicographically.
func CompareCoord ¶
CompareCoord compares two VCF structs by Pos for sorting or equality testing.
func CompareHeader ¶
CompareHeader compares two Header structs for sorting or equality testing.
func FixAllVcf ¶
FixAllVcf runs FixVcf on each element in a slice of vcf structs. Along with the slice of vcfs, it also needs the reference genome in the form of chromosome names mapping to DNA sequences.
func FormatToString ¶
FormatToString converts the []string Format struct into a string by concatenating with a colon delimiter.
func GetAltBases ¶
GetAltBases converts a slice of DNA sequenes encoded as strings into a slice of DNA sequences encoded as slices of dna.Base
func HasAncestor ¶
HasAncestor returns true if a VCF record is annotated with an ancestor allele in the Info column, false otherwise.
func HeaderGetSampleList ¶
HeaderGetSampleList returns an ordered list of the samples present in the header of a Vcf file. Useful when adding or removing samples from a VCF.
func IsAltAncestor ¶
IsAltAncestor returns true if the first alt allele in the record matches the ancestral allele in the Info annotation, false otherwise.
func IsBiallelic ¶
IsBiallelic returns true if a vcf record has 1 alt variant, false otherwise.
func IsHeterozygous ¶
IsHeterozygous returns true if more than 1 allele is present in the sample.
func IsHomozygous ¶
IsHomozygous returns true if only 1 allele is present in the sample. Note that IsHomozygous also returns true for hemizygous samples.
func IsNotRefStrongAltWeak ¶
IsNotRefStrongAltWeak returns true if an input biallelic substitution variant is not a strong to weak variant, false otherwise.
func IsNotRefWeakAltStrong ¶
IsNotRefWeakAltStrong returns true if an input biallelic substitution variant does not have a weak Ref allele and a strong Alt allele, false otherwise.
func IsNotWeakToStrongOrStrongToWeak ¶
IsNotWeakToStrongOrStrongToWeak returns true if a variant is neither a weak to strong variant nor a strong to weak variant, false otherwise.
func IsPolarizable ¶
IsPolarizable returns true if a variant can be "polarized" in a derived allele frequency spectrum, false otherwise.
func IsRefAncestor ¶
IsRefAncestor returns true if the reference allele in the record matches the ancestral allele in the Info annotation, false otherwise.
func IsRefStrongAltWeak ¶
IsStrongToWeak returns true if an input biallelic substitution variant has a strong Ref allele and a weak Alt allele, false otherwise.
func IsRefWeakAltStrong ¶
IsRefWeakAltStrong returns true if an input biallelic substitution variant has a weak Ref allele and a strong Alt allele, false otherwise.
func IsSegregating ¶
IsSegregating returns true if a Vcf record is a segregating site, true if the samples of the record contain at least two allelic states (ex. not all 0 or all 1).
func IsSubstitution ¶
IsSubstitution returns true if all of the alt fields of a vcf records are of length 1, false otherwise.
func IsVcfFile ¶
IsVcfFile checks suffix of filename to confirm if the file is a vcf formatted file.
func IsWeakToStrongOrStrongToWeak ¶
IsWeakToStrongOrStrongToWeak returns true if an input biallelic substitution variant is a strong to weak variant or a weak to strong variant, false otherwise.
func NewWriteHeader ¶
NewWriteHeader writes the value of header.Text to the provided io.Writer
func PhasedToString ¶
PhasedToString returns "|" when true and "/" otherwise.
func PrintSampleNames ¶
PrintSampleNames takes a vcf header and prints the sample names from the "#CHROM" line
func PrintSingleLine ¶
func PrintSingleLine(data Vcf)
PrintSingleLine prints an individual Vcf line.
func QueryAncestor ¶
QueryAncestor finds the AA INFO from a VCF struct and returns the base of the ancestral allele.
func QueryFlag ¶
QueryFlag retrieves boolean value stored in the Info or Format fields of a vcf record. The input is a Key struct which is retrieved from the header and is keyed by Id. (e.g. header.Format["GT"].Key). QueryInt cannot be used until the requested field (Info or Format) has been parsed using the ParseInfo and ParseFormat functions.
Note that flags are not valid in the Format field, so this query is only for Info.
func QueryFloat ¶
QueryFloat retrieves float64 values stored in the Info or Format fields of a vcf record. The input is a Key struct which is retrieved from the header and is keyed by Id. (e.g. header.Format["GT"].Key). QueryFloat cannot be used until the requested field (Info or Format) has been parsed using the ParseInfo and ParseFormat functions.
The return is a slice of slices where the first slice corresponds to the sample (this is always len == 1 when querying the Info field) and the second slice corresponds to multiple values that may be present for the given tag (e.g. ref/alt read depth may be "9,1").
The second return is false if the requested value is not present in the input record.
func QueryInt ¶
QueryInt retrieves integer values stored in the Info or Format fields of a vcf record. The input is a Key struct which is retrieved from the header and is keyed by Id. (e.g. header.Format["GT"].Key). QueryInt cannot be used until the requested field (Info or Format) has been parsed using the ParseInfo and ParseFormat functions.
The return is a slice of slices where the first slice corresponds to the sample (this is always len == 1 when querying the Info field) and the second slice corresponds to multiple values that may be present for the given tag (e.g. ref/alt read depth may be "9,1").
The second return is false if the requested value is not present in the input record.
func QueryRune ¶
QueryRune retrieves rune values stored in the Info or Format fields of a vcf record. The input is a Key struct which is retrieved from the header and is keyed by Id. (e.g. header.Format["GT"].Key). QueryRune cannot be used until the requested field (Info or Format) has been parsed using the ParseInfo and ParseFormat functions.
The return is a slice of slices where the first slice corresponds to the sample (this is always len == 1 when querying the Info field) and the second slice corresponds to multiple values that may be present for the given tag (e.g. ref/alt read depth may be "9,1").
The second return is false if the requested value is not present in the input record.
func QueryString ¶
QueryString retrieves string values stored in the Info or Format fields of a vcf record. The input is a Key struct which is retrieved from the header and is keyed by Id. (e.g. header.Format["GT"].Key). QueryString cannot be used until the requested field (Info or Format) has been parsed using the ParseInfo and ParseFormat functions.
The return is a slice of slices where the first slice corresponds to the sample (this is always len == 1 when querying the Info field) and the second slice corresponds to multiple values that may be present for the given tag (e.g. ref/alt read depth may be "9,1").
The second return is false if the requested value is not present in the input record.
func Read ¶
Read parses a slice of VCF structs from an input filename. Does not store or return the header.
func ReadToChan ¶
func ReadToChan(file *fileio.EasyReader, data chan<- Vcf, wg *sync.WaitGroup)
ReadToChan is a helper function of GoReadToChan.
func SampleNamesInOrder ¶
SampleNamesInOrder takes in the header and gives back the sample names in the order in which they appear in the header.
func SampleVcf ¶
SampleVcf takes a set of Vcf records and returns a random subset of variants to an output VCF file. Can also retain a random subset of alleles from gVCF data (diploid, does not break allele pairs).
func SamplesToString ¶
SamplesToString has been deprecated
func WriteMultiSamplesHeader ¶
WriteMultiSamplesHeader will write the value of header.Text to an io.Writer, but will replace the line starting with "#CHROM\t" with a "#CHROM" line that contains the standard column headers and then the names of the samples passed with listNames.
func WriteVcfToFileHandle ¶
TODO(craiglowe): Look into unifying WriteVcfToFileHandle and WriteVcf and benchmark speed. geno bool variable determines whether to print notes or genotypes.
Types ¶
type FilterHeader ¶
FilterHeader contains info encoded by header lines beginning in ##FILTER.
type FormatHeader ¶
FormatHeader contains info encoded by header lines beginning in ##FORMAT.
type Header ¶
type Header struct { FileFormat string // ##fileformat=VCFv4.3 Info map[string]InfoHeader // key=ID Filter map[string]FilterHeader // key=ID Format map[string]FormatHeader // key=ID Chroms map[string]chromInfo.ChromInfo // key=chrom name Samples map[string]int // key=samplename val=index in Sample Text []string // raw text }
Header contains all information present in the header section of a VCF. Info, Filter, Format, and Contig lines are parsed into maps keyed by ID.
func AncestorFlagToHeader ¶
AncestorFlagToHeader adds an ##INFO line to a vcfHeader to include information about the AA flag for ancestral alleles.
func GoReadToChan ¶
GoReadToChan parses VCF structs from an input filename and returns a chan of VCF structs along with the header of the VCF file.
func HeaderUpdateSampleList ¶
HeaderUpdateSampleList can be provided with a new list of samples to update the sample list in a Header.
func ReadHeader ¶
func ReadHeader(er *fileio.EasyReader) Header
ReadHeader reads and parses the header in a vcf file.
type InfoHeader ¶
InfoHeader contains info encoded by header lines beginning in ##INFO.
type InfoType ¶
type InfoType byte
InfoType stores the type of variable that a field in the Header holds.
type Key ¶
type Key struct { Id string Number string // numeral or 'A', 'G', 'R', '.' DataType InfoType IsFormat bool // true if this key is for a Format field, false for Info }
Key is the identifying information for a given info field. (e.g. the genotype field in format would be {"GT", "1", Integer, true}.) (e.g. a read counter may be {"ReadCount", "R", Integer, true}.)
type Sample ¶
type Sample struct { Alleles []int16 // Alleles present in genotype, 0 for reference, 1 for Alt[0], 2 for Alt[1], etc. Phase []bool // True for phased genotype, false for unphased. len(Phase) == len(Alleles). Phase[0] == true if and only if Phase[1:] == true FormatData []string // FormatData contains additional sample fields after the genotype, which are parsed into a slice delimited by colons. }
Sample is a substruct of Vcf, and contains information about each sample represented in a VCF line. Indexes in Alleles are set to -1 if no genotype data is present.
type SampleHash ¶
SampleHash stores index and position information, but is currently being deprecated.
func HeaderToMaps ¶
func HeaderToMaps(header Header) *SampleHash
HeaderToMaps uses a Vcf header to create a pointer to a SampleHash
type SmallVariantChans ¶
type SmallVariantChans struct { Substitutions chan variant.Substitution Insertions chan variant.Insertion Deletions chan variant.Deletion Delins chan variant.Delins Records chan Vcf }
SmallVariantChans wraps channels for all of the small variants that are valid variant.Mutators and variant.Effectors. All channels except for Records sends a parsed version of a given Vcf record that stores the corresponding variant type. Each send on one of the variant channels must be paired with a send on the records channel. i.e. for each parsed variant sent, the original vcf record is the following send. This is useful as the Vcf records often hold metadata that is not stored in the variant struct. Even if the original Vcf record is not used, the Records channel must be read else the sending goroutine will be blocked.
func GoChanToVariants ¶
func GoChanToVariants(c <-chan Vcf) SmallVariantChans
GoChanToVariants wraps ChanToVariants and handles variant channel creation and goroutine spawning.
func NewSmallVariantChans ¶
func NewSmallVariantChans(buffSize int) SmallVariantChans
NewSmallVariantChans creates a series of channels to send the variant representation of a given vcf record. The input buffSize determines the channel buffer size for all on the variant channels.
type Vcf ¶
type Vcf struct { Chr string Pos int Id string Ref string Alt []string Qual float64 Filter string Info string Format []string Samples []Sample // contains filtered or unexported fields }
Vcf contains information for each line of a VCF format file, corresponding to variants at one position of a reference genome.
func AnnotateAncestorFromMultiFa ¶
func AnnotateAncestorFromMultiFa(g Vcf, records []fasta.Fasta, RefStart int, AlnStart int) (Vcf, int, int)
AnnotateAncestorFromMultiFa adds the ancestral state to a VCF variant by inspecting a pairwise fasta of the reference genome and an ancestor sequence. records is a pairwise multiFa where the first entry is the reference genome and the second entry is the ancestor. Returns the refPos and alnPos of the current record.
func AppendAncestor ¶
AppendAncestor adds the ancestral allele state (defined by input bases) to the INFO column of a vcf entry.
func FixVcf ¶
FixVcf "fixes" vcf records that have a dash for a deletion, which does not conform to the current VCF file specs, but is often seen in the output of different programs. The function takes the vcf record to be fixed and the reference genome as a map of chromosome name to DNA sequence.
func InvertVcf ¶
InvertVcf inverts the reference and alt variants in a Vcf record. Currently does not update other fields, but this functionality may be added.
func NextVcf ¶
func NextVcf(reader *fileio.EasyReader) (Vcf, bool)
NextVcf is a helper function of Read and GoReadToChan. Checks a reader for additional data lines and parses a Vcf line if more lines exist.
func ParseFormat ¶
ParseFormat parses the data stored in vcf.Format. Fills an unexported map in the Vcf struct that can be queried with Query(type) functions.
func ParseInfo ¶
ParseInfo parses the data stored in vcf.Info. Fills an unexported map in the Vcf struct that can be queried with Query(type) functions.
func ReorderSampleColumns ¶
ReorderSampleColumns reorganizes the Samples slice based on a samples []int16 specification list.
func (Vcf) GetChromEnd ¶
GetChromEnd returns the end position of a variant. Since VCF is 1-base, GetChromEnd of a SNP variant will rerun v.Pos. For indels, GetChromEnd will return v.Pos + the length of the indel - 1.
func (Vcf) GetChromStart ¶
GetChromStart returns the start position of a variant. Since VCF is 1-base, GetChromStart of a SNP variant will return v.Pos - 1. VCF format defines the first base of an indel as the base prior to the change, so for indels, GetChromStart will simply return v.Pos
func (Vcf) String ¶
String implements the fmt.Stringer interface for easy printing of Vcf with the fmt package.
func (Vcf) UpdateCoord ¶
UpdateCoord modifies the position data in a VCF struct based on input values.
func (Vcf) WriteToFileHandle ¶
WriteToFileHandle writes a VCF struct to an io.Writer.