seq

package module
v0.0.0-...-72e89e6 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 3, 2014 License: MIT Imports: 7 Imported by: 49

README

This package provides common operations for dealing with biological sequence data. The package is not exhaustive and sometimes only provides representations for essential data without operations. (For example, a representation of a multiple-sequence alignment without a multiple-sequence alignment algorithm.)

The sequence type defined here is used throughout other Go packages in the TuftsBCB group.

Documentation for this package can be found here:

http://godoc.org/github.com/TuftsBCB/seq

Installation

Once you have Go installed, just run

go get github.com/TuftsBCB/seq

to install the packages.

Documentation

Overview

Package seq provides common types and operations for dealing with biological sequence data, with a bias toward amino acid sequences. Types includes sequences, profiles, multiple sequence alignments and HMMs. Operations include sequence alignment (currently only Needleman-Wunsch global alignment), building frequency profiles with background probabilities and an implementation of the Viterbi algorithm to find the probability of the most likely alignment of a sequence to an HMM.

This package is currently a "kitchen sink" of operations on biological sequences. It isn't yet clear (to me) whether it should remain a kitchen sink.

Index

Examples

Constants

This section is empty.

Variables

View Source
var (

	// A standard BLOSUM62 substitution matrix for amino acid sequences.
	SubstBlosum62 = SubstMatrix{AlphaBlosum62, blosum62}

	// An identity substitution matrix for DNA sequences.
	SubstDNA = SubstMatrix{AlphaDNA, identity}

	// An identity substitution matrix for DNA sequences.
	SubstRNA = SubstMatrix{AlphaRNA, identity}
)
View Source
var AlphaBlosum62 = NewAlphabet(
	'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', 'M',
	'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'X', 'Y', 'Z', '-',
)

The default alphabet that corresponds to the BLOSUM62 matrix included in this package.

View Source
var AlphaDNA = NewAlphabet(
	'A', 'C', 'G', 'T', 'N', '-',
)

The default alphabet for DNA sequences.

View Source
var AlphaRNA = NewAlphabet(
	'A', 'C', 'G', 'U', 'N', '-',
)

The default alphabet for RNA sequences.

View Source
var MinProb = Prob(math.MaxFloat64)

The value representing a minimum emission/transition probability. Remember, max in negative log space is minimum probability.

Functions

This section is empty.

Types

type Alignment

type Alignment struct {
	A []Residue // Reference
	B []Residue // Query
}

Alignment represents the result of aligning two sequences.

func NeedlemanWunsch

func NeedlemanWunsch(A, B []Residue, subst SubstMatrix) Alignment

Performs the Needleman-Wunsch sequence alignment algorithm on a pair of sequences. A function `subst` should return alignment scores for pairs of residues. This package provides some functions suitable for this purpose, e.g., MatBlosum62, MatDNA, MatRNA, etc.

Example
s1 := NewSequenceString("seq1", "GHIKLMNPQR")
s2 := NewSequenceString("seq1", "GAAAHIKLMN")
aligned := NeedlemanWunsch(s1.Residues, s2.Residues, SubstBlosum62)

fmt.Printf("%s\n", aligned.A)
fmt.Printf("%s\n", aligned.B)
Output:

G---HIKLMNPQR
GAAAHIKLMN---

type Alphabet

type Alphabet []Residue

Alphabet corresponds to a set of residues, in a particular order, that capture all possible residues of a particular set of sequences. For example, this is used in frequency profiles and HMMs to specify which amino acids are represented in the probabilistic model.

In most cases, the ordering of the alphabet is significant. For example, the indices of an alphabet may be in correspondence with the indices of a column in a frequency profile.

func NewAlphabet

func NewAlphabet(residues ...Residue) Alphabet

NewAlphabet creates an alphabet from the residues given.

func (Alphabet) Equals

func (a1 Alphabet) Equals(a2 Alphabet) bool

Equals returns true if and only if a1 == a2.

func (Alphabet) Index

func (a Alphabet) Index() [256]int

Index returns a constant-time mapping from ASCII to residue index in the alphabet. This depends on all residues in the alphabet being ASCII characters.

func (Alphabet) Len

func (a Alphabet) Len() int

func (*Alphabet) MarshalJSON

func (a *Alphabet) MarshalJSON() ([]byte, error)

func (Alphabet) String

func (a Alphabet) String() string

func (*Alphabet) UnmarshalJSON

func (a *Alphabet) UnmarshalJSON(bs []byte) error

type DynamicTable

type DynamicTable struct {
	// contains filtered or unexported fields
}

DynamicTable represents a dynamic programming table used in sequence alignment algorithms like Viterbi.

func AllocTable

func AllocTable(numNodes int, seqLen int) *DynamicTable

AllocTable returns a freshly allocated dynamic programming table for use in HMM alogirthms like Viterbi. It is indexed by HMM state, node index and sequence length, in that order. The total size of the table is equal to (#states * (#nodes + 1) * (seqLen + 1)).

The only states used are Match, Deletion and Insertion.

Each value is initialized to a minimum probability.

type EProbs

type EProbs struct {
	Offset Residue
	Probs  []Prob
}

EProbs represents emission probabilities, as log-odds scores. The representation of EProbs should not be used by clients; it is exported only for convenience with marshalling APIs.

func NewEProbs

func NewEProbs(alphabet Alphabet) EProbs

NewEProbs creates a new EProbs map from the given alphabet. Keys of the map are residues defined in the alphabet, and values are defaulted to the minimum probability.

func (EProbs) Lookup

func (ep EProbs) Lookup(r Residue) Prob

Returns the emission probability for a particular residue.

func (*EProbs) Set

func (ep *EProbs) Set(r Residue, p Prob)

Set sets the emission probability of the given residue.

type FrequencyProfile

type FrequencyProfile struct {
	// The columns of a frequency profile.
	Freqs []map[Residue]int

	// The alphabet of the profile. The length of the alphabet should be
	// equal to the number of rows in the frequency profile.
	// There are no restrictions on the alphabet. (i.e., Gap characters are
	// allowed but they are not treated specially.)
	Alphabet Alphabet
}

FrequencyProfile represents a sequence profile in terms of raw frequencies. A FrequencyProfile is useful as an intermediate representation. It can be used to incrementally build a Profile.

func NewFrequencyProfile

func NewFrequencyProfile(columns int) *FrequencyProfile

NewFrequencyProfile initializes a frequency profile with a default alphabet that is compatible with this package's BLOSUM62 matrix.

func NewFrequencyProfileAlphabet

func NewFrequencyProfileAlphabet(
	columns int,
	alphabet Alphabet,
) *FrequencyProfile

NewFrequencyProfileAlphabet initializes a frequency profile with the given alphabet.

func NewNullProfile

func NewNullProfile() *FrequencyProfile

NewNullProfile initializes a frequency profile that can be used to tabulate a null model. This is equivalent to calling NewFrequencyProfile with the number of columns set to 1.

func (*FrequencyProfile) Add

func (fp *FrequencyProfile) Add(s Sequence)

Add adds the sequence to the given profile. The sequence must have length equivalent to the number of columns in the profile. The sequence must also only contain residues that are in the alphabet for the profile.

As a special case, if the alphabet contains the 'X' residue, then any unrecognized residues in the sequence with respect to the profile's alphabet will be considered as an 'X' residue.

func (*FrequencyProfile) Combine

func (fp1 *FrequencyProfile) Combine(fp2 *FrequencyProfile)

Combine adds the given frequency profile to the current one. Both profiles must have the same number of columns.

func (*FrequencyProfile) Len

func (fp *FrequencyProfile) Len() int

Len returns the number of columns in the frequency profile.

func (*FrequencyProfile) Profile

func (fp *FrequencyProfile) Profile(null *FrequencyProfile) *Profile

Profile converts a raw frequency profile to a profile that uses a log-odds representation. The log-odds scores are computed with the given null model, which is itself just a raw frequency profile with a single column.

func (*FrequencyProfile) String

func (fp *FrequencyProfile) String() string

type HMM

type HMM struct {
	// An ordered list of HMM nodes.
	Nodes []HMMNode

	// The alphabet as defined by an ordering of residues.
	// Indices in this slice correspond to indices in match/insertion emissions.
	Alphabet Alphabet

	// NULL model. (Amino acid background frequencies.)
	// HMMER hmm files don't have this, but HHsuite hhm files do.
	// In the case of HHsuite, the NULL model is used for insertion emissions
	// in every node.
	Null EProbs
}

func HMMCat

func HMMCat(h1, h2 *HMM) *HMM

HMMCat joins two HMMs together. The HMMs given are not modified. Both HMMs must have the same alphabet. The null emissions for the first HMM are used.

func NewHMM

func NewHMM(nodes []HMMNode, alphabet []Residue, null EProbs) *HMM

NewHMM creates a new HMM from a list of nodes, an ordered alphabet and a set of null probabilities (which may be nil).

func (*HMM) Slice

func (hmm *HMM) Slice(start, end int) *HMM

Slice returns a slice of the HMM given. A slice of an HMM returns only the HMM nodes (i.e., columns or match/delete states) in the region specified by the slice. Also, the transition probabilities of the last state are specially set: M->M = 0, M->I = *, M->D = *, I->M = 0, I->I = *, D->M = 0, and D->D = *. No other modifications are made.

func (*HMM) ViterbiScore

func (hmm *HMM) ViterbiScore(seq Sequence) Prob

ViterbiScore returns the probability of the likeliest path through the HMM for the given sequence.

If you're running Viterbi in a performance critical section, ViterbiScoreMem may be appropriate.

Note that the state path is not computed.

func (*HMM) ViterbiScoreMem

func (hmm *HMM) ViterbiScoreMem(seq Sequence, table *DynamicTable) Prob

ViterbiScoreMem is the same as ViterbiScore, except it does not allocate, which makes it faster in performance critical sections of code. This is done by passing a pre-allocated dynamic programming table created by AllocTable function.

Note that the caller must ensure that only one goroutine is calling ViterbiScoreMem with the same dynamic programming table.

type HMMNode

type HMMNode struct {
	Residue             Residue
	NodeNum             int
	InsEmit             EProbs
	MatEmit             EProbs
	Transitions         TProbs
	NeffM, NeffI, NeffD Prob
}

HMMNode represents a single node in an HMM, including the reference residue, the node index, insertion emissions, match emissions, transition probabilities.

The NeffM, NeffI and NeffD aren't used, but are included since they exist in common HMM file formats.

type HMMState

type HMMState int
const (
	Match HMMState = iota
	Deletion
	Insertion
	Begin
	End
)

HMM states in the Plan7 architecture.

type MSA

type MSA struct {
	Entries []Sequence
	// contains filtered or unexported fields
}

func NewMSA

func NewMSA() MSA

func (*MSA) Add

func (m *MSA) Add(adds Sequence)

Add adds a new entry to the multiple sequence alignment. Sequences must be in A2M or A3M format. If your sequence is from a FASTA aligned format, use AddFasta.

Empty sequences are ignored.

func (*MSA) AddFasta

func (m *MSA) AddFasta(adds Sequence)

AddFasta adds a FASTA formatted sequence to the MSA. If your sequences are in A2M or A3M format, use Add.

Empty sequences are ignored.

func (*MSA) AddFastaSlice

func (m *MSA) AddFastaSlice(seqs []Sequence)

AddFastaSlice calls "AddFasta" for each sequence in the slice.

func (*MSA) AddSlice

func (m *MSA) AddSlice(seqs []Sequence)

AddSlice calls "Add" for each sequence in the slice.

func (MSA) Get

func (m MSA) Get(row int) Sequence

Get gets a copy of the sequence at the provided row in the MSA. The sequence is in the default format of the MSA representation. Currently, this is A2M. ('-' for deletes, '.' and a-z for inserts, and A-Z for matches.)

func (MSA) GetA2M

func (m MSA) GetA2M(row int) Sequence

GetA2M gets a copy of the sequence at the provided row in the MSA in A2M format.

func (MSA) GetA3M

func (m MSA) GetA3M(row int) Sequence

GetA3M gets a copy of the sequence at the provided row in the MSA in A3M format. This is the same as A2M format, except all '.' (period) are omitted.

func (MSA) GetFasta

func (m MSA) GetFasta(row int) Sequence

GetFasta gets a copy of the sequence at the provided row in the MSA in aligned FASTA format. This is the same as A2M format, except all '.' (period) are changed to '-' residues.

func (MSA) Len

func (m MSA) Len() int

Len returns the length of the alignment. (All entries in an MSA are guaranteed to have the same length.)

func (*MSA) SetLen

func (m *MSA) SetLen(n int)

SetLen allows you to override the length of the MSA. Only use this if you know what you're doing. (The Add and AddFasta methods will update this for you automatically.)

func (MSA) Slice

func (m MSA) Slice(start, end int) MSA

Slice returns a slice of the given MSA by slicing each sequence in the MSA. Slicing an empty MSA will panic.

Note that slicing works in terms of *match states* of the first sequence in the MSA.

func (MSA) String

func (m MSA) String() string

type Prob

type Prob float64

Prob represents a transition or emission probability.

func NewProb

func NewProb(fstr string) (Prob, error)

NewProb creates a new probability value from a string (usually read from an hmm or hhm file). If the string is equivalent to the special value "*", then the probability returned is guaranteed to be minimal. Otherwise, the string is parsed as a float, and an error returned if parsing fails.

func (Prob) Distance

func (p1 Prob) Distance(p2 Prob) float64

Distance returns the distance between `p1` and `p2`.

func (Prob) IsMin

func (p Prob) IsMin() bool

IsMin returns true if the probability is minimal.

func (Prob) Less

func (p1 Prob) Less(p2 Prob) bool

Less returns true if `p1` represents a smaller probability than `p2`.

func (Prob) MarshalJSON

func (p Prob) MarshalJSON() ([]byte, error)

func (Prob) Ratio

func (p Prob) Ratio() float64

Ratio returns the log probability as a ratio in the range [0, 1]. (This assumes that `p` is a log-odds score.)

func (Prob) String

func (p Prob) String() string

String returns a string representation of the probability. When `p` is the minimum probability, then "*" is used. Otherwise, the full number is written.

func (*Prob) UnmarshalJSON

func (p *Prob) UnmarshalJSON(bs []byte) error

type Profile

type Profile struct {
	// The columns of a profile.
	Emissions []EProbs

	// The alphabet of the profile. The length of the alphabet should be
	// equal to the number of rows in the profile.
	// There are no restrictions on the alphabet. (i.e., Gap characters are
	// allowed but they are not treated specially.)
	Alphabet Alphabet
}

Profile represents a sequence profile in terms of log-odds scores.

func NewProfile

func NewProfile(columns int) *Profile

NewProfile initializes a profile with a default alphabet that is compatible with this package's BLOSUM62 matrix. Emission probabilities are set to the minimum log-odds probability.

func NewProfileAlphabet

func NewProfileAlphabet(columns int, alphabet Alphabet) *Profile

NewProfileAlphabet initializes a profile with the given alphabet. Emission probabilities are set to the minimum log-odds probability.

func (*Profile) Len

func (p *Profile) Len() int

func (*Profile) String

func (p *Profile) String() string

type Residue

type Residue byte

A Residue corresponds to a single entry in a sequence.

func (Residue) HMMState

func (r Residue) HMMState() HMMState

HMMState returns the HMMState of a particular residue in a sequence. Residues in [A-Z] are match states. Residues matching '-' are deletion states. Residues equal to '.' or in [a-z] are insertion states.

A residue corresponding to any other value will panic.

The pre-condition here is that 'r' is a residue from a sequence from an A2M format. (N.B. MSA's formed from A3M and FASTA formatted files are repsented as A2M format, so MSA's read from A3M/FASTA files are OK.)

type Sequence

type Sequence struct {
	Name     string
	Residues []Residue
}

A Sequence corresponds to any kind of biological sequence: DNA, RNA, amino acid, secondary structure, etc.

func NewSequenceString

func NewSequenceString(name, srs string) Sequence

NewSequenceString is a convenience function for constructing a sequence from a string. It is otherwise appropriate to create new Sequence values directly.

func (Sequence) Bytes

func (s Sequence) Bytes() []byte

Bytes returns the sequence of residues as a byte slice.

func (Sequence) Copy

func (s Sequence) Copy() Sequence

Copy returns a deep copy of the sequence.

func (Sequence) IsNull

func (s Sequence) IsNull() bool

IsNull returns true if the name has zero length and the residues are nil.

func (Sequence) Len

func (s Sequence) Len() int

Len returns the number of residues in the sequence.

func (Sequence) Slice

func (s Sequence) Slice(start, end int) Sequence

Slice returns a slice of the sequence. The name stays the same, and the sequence of residues corresponds to a Go slice of the original. (This does not copy data, so that if the original or sliced sequence is changed, the other one will too. Use Sequence.Copy first if you need copy semantics.)

type SubstMatrix

type SubstMatrix struct {
	Alphabet Alphabet
	Scores   [][]int
}

SubstMatrix corresponds to a substitution matrix and an alphabet that is used in sequence alignment algorithms. The matrix given should be square, with rows and columns equivalent to the length of the alphabet. (This means that if your substitution matrix includes gap penalties, then the alphabet should also have a gap character.)

type TProbs

type TProbs struct {
	MM, MI, MD, IM, II, DM, DD Prob
}

TProbs represents transition probabilities, as log-odds scores. Note that ID and DI are omitted (Plan7).

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL