gtf

package
v1.0.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 1, 2024 License: BSD-3-Clause Imports: 16 Imported by: 1

Documentation

Overview

Package gtf contains functions for reading, writing, and manipulating GTF format files. More information on the GTF file format can be found at http://genomewiki.ucsc.edu/index.php/Genes_in_gtf_or_gff_format Structs in the GTF package are organized hierarchically, with the gene struct containing the underlying transcripts, exons, and other gene features associated with that gene.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func AllAreEqual

func AllAreEqual(a map[string]*Gene, b map[string]*Gene) bool

AllAreEqual returns true if all of the entries in a GTF map contain the same information, false otherwise.

func CdnaLength

func CdnaLength(t *Transcript) int

CdnaLength returns the length of the cDNA in nucleotides.

func CdsBoolArray

func CdsBoolArray(g map[string]*Gene, c map[string]*chromInfo.ChromInfo) map[string][]bool

CdsBoolArray returns a map of chromosome names to bool slices. The bool is true if that position lies in a cds (protein-coding) region. A map of chromosome name to the information for that chromosome is needed to know the length of the retuned bool slices.

func CdsLength

func CdsLength(t *Transcript) int

CdsLength returns the length of the Cds in nucleotides (after splicing).

func EqualCds

func EqualCds(a *Cds, b *Cds) bool

EqualGene returns true if two input Cds structs contain the same information, false otherwise.

func EqualExon

func EqualExon(a *Exon, b *Exon) bool

EqualGene returns true if two input Exon structs contain the same information, false otherwise.

func EqualFiveUtr

func EqualFiveUtr(a *FiveUtr, b *FiveUtr) bool

EqualGene returns true if two input FiveUtr structs contain the same information, false otherwise.

func EqualGene

func EqualGene(a *Gene, b *Gene) bool

EqualGene returns true if two input Gene structs contain the same information, false otherwise.

func EqualThreeUtr

func EqualThreeUtr(a *ThreeUtr, b *ThreeUtr) bool

EqualGene returns true if two input ThreeUtr structs contain the same information, false otherwise.

func EqualTranscript

func EqualTranscript(a *Transcript, b *Transcript) bool

EqualGene returns true if two input Transcript structs contain the same information, false otherwise.

func ExonBoolArray

func ExonBoolArray(g map[string]*Gene, c map[string]*chromInfo.ChromInfo) map[string][]bool

ExonBoolArray returns a map of chromosome names to bool slices. The bool is true if that position lies in an exon. A map of chromosome name to the information for that chromosome is needed to know the length of the retuned bool slices.

func FilterVariantCds

func FilterVariantCds(v *vcf.Vcf, g map[string]*Gene, c map[string]*chromInfo.ChromInfo) bool

FilterVariantCds take a vcf record, a gene list from a gtf, and ChromInfo to know the length of chromosomes. The function returns true if the vcf record overlaps a cds (protein-coding sequence) in the gtf.

func FilterVariantExon

func FilterVariantExon(v *vcf.Vcf, g map[string]*Gene, c map[string]*chromInfo.ChromInfo) bool

FilterVariantExon take a vcf record, a gene list from a gtf, and ChromInfo to know the length of chromosomes. The function returns true if the vcf record overlaps an exon in the gtf.

func FilterVariantFiveUtr

func FilterVariantFiveUtr(v *vcf.Vcf, g map[string]*Gene, c map[string]*chromInfo.ChromInfo) bool

FilterVariantFiveUtr take a vcf record, a gene list from a gtf, and ChromInfo to know the length of chromosomes. The function returns true if the vcf record overlaps a 5' UTR in the gtf.

func FilterVariantGtf

func FilterVariantGtf(v *vcf.Vcf, g map[string]*Gene, c map[string]*chromInfo.ChromInfo, exon bool, code bool, five bool, three bool) bool

FilterVariantGtf takes a vcf record, a gene list from a gtf, ChromInfo to know the length of chromosomes, and whether the function should search for exon, coding, 5' UTR, or 3' UTR overlaps. If more than one type of overlap is selected by setting the multiple of: exon, cds, five, three to true, the function returns the logical or of whether the vcf record overlaps that function.

func FilterVariantThreeUtr

func FilterVariantThreeUtr(v *vcf.Vcf, g map[string]*Gene, c map[string]*chromInfo.ChromInfo) bool

FilterVariantThreeUtr take a vcf record, a gene list from a gtf, and ChromInfo to know the length of chromosomes. The function returns true if the vcf record overlaps a 3' UTR in the gtf.

func FindPromoter added in v1.0.1

func FindPromoter(genes []string, upstream int, downstream int, gtf map[string]*Gene, size map[string]chromInfo.ChromInfo) []bed.Bed

func FiveUtrBoolArray

func FiveUtrBoolArray(g map[string]*Gene, c map[string]*chromInfo.ChromInfo) map[string][]bool

FiveUtrBoolArray returns a map of chromosome names to bool slices. The bool is true if that position lies in a 5' UTR. A map of chromosome name to the information for that chromosome is needed to know the length of the retuned bool slices.

func GeneToCanonicalBed

func GeneToCanonicalBed(g Gene, c map[string]chromInfo.ChromInfo, upstream int, downstream int) bed.Bed

GeneToCanonicalBed converts a Gene struct into a bed representing the promoter region of the canonical transcript. The user species the bases upstream and downstream of the TSS which will define the promoter region.

func GeneToCanonicalTssBed

func GeneToCanonicalTssBed(g Gene, c map[string]chromInfo.ChromInfo) bed.Bed

GeneToCanonicalTssBed converts a single Gene struct into a Bed representing the TSS position of the canonical transcript.

func GeneToPromoterBed

func GeneToPromoterBed(g Gene, c map[string]chromInfo.ChromInfo, upstream int, downstream int) []bed.Bed

GeneToPromoterBed produces a slice of beds from a gene containing the positions of promoters (TSS-500bp -> TSS+2kb) for all transcripts of the gene with the geneName in the Name field of the output Bed entries.

func GeneToTssBed

func GeneToTssBed(g Gene, c map[string]chromInfo.ChromInfo) []bed.Bed

GeneToTssBed returns the positions of all TSSs from a Gene as a slice of single base-pair bed entries.

func GenesToBedFirstTwoCodonBases added in v1.0.1

func GenesToBedFirstTwoCodonBases(genes map[string]*Gene) []bed.Bed

GenesToBedFirstTwoCodonBases takes a map[string[*Gene

func GenesToCanonicalBeds

func GenesToCanonicalBeds(g map[string]*Gene, c map[string]chromInfo.ChromInfo, upstream int, downstream int) []bed.Bed

GenesToCanonicalBeds converts all genes in a map[string]*Gene to a []bed.Bed, where each bed represents the promoter region of the canonical transcript, defined by user-specified upstream and downstream distances from the TSS.

func GenesToCanonicalTranscriptsTssBed

func GenesToCanonicalTranscriptsTssBed(g map[string]*Gene, c map[string]chromInfo.ChromInfo) []bed.Bed

GenesToCanonicalTranscriptsTssBed turns an input map of [geneId]*Gene structs, finds the canonical transcript (defined as the longest coding sequence), and turns the TSS of this trancript into a Bed struct.

func GenesToIntervalTree

func GenesToIntervalTree(genes map[string]*Gene) map[string]*interval.IntervalNode

GenesToIntervalTree builds a fractionally cascaded 2d interval tree for efficiently identifying genes that overlap a variant.

func GenesToPromoterBed

func GenesToPromoterBed(g map[string]*Gene, c map[string]chromInfo.ChromInfo, upstream int, downstream int) []bed.Bed

GenesToPromoterBed produces a slice of beds from a set of genes containing the positions of all promoters for all transcripts for all genes with the geneID in the Name field of the output Bed entries.

func GenesToTssBed

func GenesToTssBed(g map[string]*Gene, c map[string]chromInfo.ChromInfo, merge bool) []bed.Bed

GenesToTssBed returns the position of all TSSs from a Gene map as a slice of single base-pair bed entries.

func MoveAllCanonicalToZero

func MoveAllCanonicalToZero(m map[string]*Gene)

MoveAllCanonicalToZero applies MoveCanonicalToZero to every value in the map

func MoveCanonicalToZero

func MoveCanonicalToZero(g *Gene)

MoveCanonicalToZero does a single iteration of bubble sort to move the longest/canonical transcript to the first position in the slice. This is faster than SortTranscripts.

func Read

func Read(filename string) map[string]*Gene

Read generates a map[geneID]*Gene of GTF information from an input GTF format file.

func SortAllTranscripts

func SortAllTranscripts(m map[string]*Gene)

SortAllTranscripts applies SortTranscripts to every value in the map

func SortTranscripts

func SortTranscripts(g *Gene)

SortTranscripts sorts the longest transcript to the front so that the canonical/longest transcript is always g.Transcripts[0].

func ThreeUtrBoolArray

func ThreeUtrBoolArray(g map[string]*Gene, c map[string]*chromInfo.ChromInfo) map[string][]bool

ThreeUtrBoolArray returns a map of chromosome names to bool slices. The bool is true if that position lies in a 3' UTR. A map of chromosome name to the information for that chromosome is needed to know the length of the retuned bool slices.

func VariantArrayOverlap

func VariantArrayOverlap(v *vcf.Vcf, a map[string][]bool) bool

VariantArrayOverlap takes a vcf record and a map of bool slices (chrom name maps to a bool for each base in that chrom). The bool slice encodes the presense/absense of some genomic feature and true is returned if the vcf position overlaps that feature.

func VariantToAnnotation

func VariantToAnnotation(variant *vcfEffectPrediction, seq map[string][]dna.Base) string

VariantToAnnotation generates an annotation which can be appended to the INFO field of a VCF Annotation format is: GoEP= Genomic | Gene | cDNA | Protein | VariantType Genomic cDNA and Protein annotations are in HGVS variant nomenclature format https://varnomen.hgvs.org/ The sequence of the reference genome needs to be supplied as a map from chromosome name to chromosome sequence. TODO: Not sensitive to UTR splice junctions TODO: Remove reports of splice variants for terminal exons; NMD prediction?

func VcfToVariant

func VcfToVariant(v vcf.Vcf, tree map[string]*interval.IntervalNode, seq map[string][]dna.Base, allTranscripts bool) (*vcfEffectPrediction, error)

VcfToVariant determines the effects of a variant on the cDNA and amino acid sequence by querying genes in the tree made by GenesToIntervalTree Note that if multiple genes are found to overlap a variant this function will return the variant based on the first queried gene and throw an error All bases in fasta record must be uppercase.

func Write

func Write(filename string, records map[string]*Gene)

Write writes information contained in a GTF data structure to an output file.

func WriteToFileHandle

func WriteToFileHandle(file io.Writer, gene *Gene)

WriteToFileHandle is a helper function of Write that writes a single gene to an output file.

Types

type Cds

type Cds struct {
	Start int
	End   int
	Score float64
	Frame int
	Prev  *Cds
	Next  *Cds
}

Cds contains the location and score information for Cds lines of a GTF file. Cds structs also point to the next and previous Cds in the transcript.

type Exon

type Exon struct {
	Start      int
	End        int
	Score      float64
	ExonNumber string
	ExonID     string
	Cds        *Cds
	FiveUtr    *FiveUtr
	ThreeUtr   *ThreeUtr
}

Exon contains information on the location, score, and relative order of exons in a GTF file.

type FiveUtr

type FiveUtr struct {
	Start int
	End   int
	Score float64
}

FiveUtr contains the location and score information for FiveUtr lines of a GTF file.

type Gene

type Gene struct {
	GeneID      string
	GeneName    string
	Transcripts []*Transcript
}

Gene organizes all underlying data on a gene feature in a GTF file.

func (*Gene) GetChrom

func (g *Gene) GetChrom() string

GetChrom returns the name of the chromosome where the gene is located.

func (*Gene) GetChromEnd

func (g *Gene) GetChromEnd() int

GetChromEnd returns the genomics coordinate where the gene ends.

func (*Gene) GetChromStart

func (g *Gene) GetChromStart() int

GetChromStart returns the genomic coordinate where the gene starts.

func (*Gene) WriteToFileHandle

func (g *Gene) WriteToFileHandle(file io.Writer)

WriteToFileHandle writes the gene to an io.Writer

type ThreeUtr

type ThreeUtr struct {
	Start int
	End   int
	Score float64
}

ThreeUtr contains the location and score information for ThreeUtr lines of a GTF file.

type Transcript

type Transcript struct {
	Chr          string
	Source       string
	Start        int
	End          int
	Score        float64
	Strand       bool
	TranscriptID string
	Exons        []*Exon
}

Transcript contains information on the location, score and strand of a transcript, along with the underlying exons.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL