lexichash

package module
v0.3.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 21, 2024 License: MIT Imports: 13 Imported by: 1

README

lexichash

Go Reference

This project implements LexicHash in Golang, with high performance and a low memory footprint.

  • This package is used in LexicMap.
  • Bit-packed k-mer operations is provided by kmers.

Support

Please open an issue to report bugs, propose new functions or ask for help.

License

MIT License

Documentation

Index

Constants

This section is empty.

Variables

View Source
var ErrBrokenFile = errors.New("lexichash: broken file")

ErrBrokenFile means the file is not complete.

View Source
var ErrInsufficientMasks = errors.New("lexichash: insufficient masks (should be >=64)")

ErrInsufficientMasks means the number of masks is too small.

View Source
var ErrInvalidFileFormat = errors.New("lexichash: invalid binary format")

ErrInvalidFileFormat means invalid file format.

View Source
var ErrKOverflow = errors.New("lexichash: k-mer size overflow, valid range is [5-32]")

ErrKOverflow means K > 32.

View Source
var ErrVersionMismatch = errors.New("lexichash: version mismatch")

ErrVersionMismatch means version mismatch between files and program.

View Source
var Magic = [8]byte{'k', 'm', 'e', 'r', 'm', 'a', 's', 'k'}
View Source
var MainVersion uint8 = 0
View Source
var MinorVersion uint8 = 1
View Source
var Strands = [2]byte{'+', '-'}

Strands could be used to output strand for a reverse complement flag

Functions

func IsLowComplexity

func IsLowComplexity(code uint64, k int) bool

IsLowComplexity checks if a k-mer is of low-complexity.

func MustDecode

func MustDecode(code uint64, k uint8) []byte

MustDecode return k-mer string

func MustDecoder

func MustDecoder() func(code uint64, k uint8) []byte

MustDecoder returns a Decode function, which reuses the byte slice

Types

type LexicHash

type LexicHash struct {
	K int // max length of shared substrings, should be in range of [4, 31]

	Seed  int64    // seed for generating masks
	Masks []uint64 // masks/k-mers
	// contains filtered or unexported fields
}

LexicHash is for finding shared substrings between nucleotide sequences.

func New

func New(k int, nMasks int, p int) (*LexicHash, error)

New returns a new LexicHash object. nMasks should be >=64, and better be >= 1024 and better be power of 4, i.e., 64, 256, 1024, 4096 ... p is the length of mask k-mer prefixes which need to be checked for low-complexity. p == 0 for no checking.

func NewFromFile

func NewFromFile(file string) (*LexicHash, error)

NewFromFile creates a LexicHash from a binary file.

func NewFromTextFile added in v0.3.0

func NewFromTextFile(file string) (*LexicHash, error)

NewFromTextFile creates a new LexicHash object with custom kmers in a txt file.

func NewWithMasks added in v0.3.0

func NewWithMasks(k int, masks []uint64) (*LexicHash, error)

NewWithMasks creates a new LexicHash object with custom kmers. nMasks should be >=64, and better be >= 1024 and better be power of 4, i.e., 64, 256, 1024, 4096 ...

func NewWithSeed

func NewWithSeed(k int, nMasks int, randSeed int64, p int) (*LexicHash, error)

NewWithSeed creates a new LexicHash object with given seed. nMasks should be >=64, and better be >= 1024 and better be power of 4, i.e., 64, 256, 1024, 4096 ... p is the length of mask k-mer prefixes which need to be checked for low-complexity. p == 0 for no checking.

func Read

func Read(r io.Reader) (*LexicHash, error)

Read reads a LexiHash from an io.Reader.

func (*LexicHash) Mask

func (lh *LexicHash) Mask(s []byte, skipRegions [][2]int) (*[]uint64, *[][]int, error)

Mask computes the most similar substrings for each mask in sequence s. It returns

  1. the list of the most similar k-mers for each mask.
  2. the start 0-based positions of all k-mers, with the last 1 bit as the strand flag (1 for negative strand).

skipRegions is optional, which is used to skip some masked regions. E.g., in reference indexing step, contigs of a genome can be concatenated with k-1 N's, where need to be ommitted.

The regions should be 0-based and ascendingly sorted. e.g., [100, 130], [200, 230] ...

func (*LexicHash) MaskLongSeqs added in v0.2.0

func (lh *LexicHash) MaskLongSeqs(s []byte, skipRegions [][2]int) (*[]uint64, *[][]int, error)

MaskLongSeqs is faster than Mask() for longer sequences, requiring nMasks >= 1024.

func (*LexicHash) RecycleMaskResult

func (lh *LexicHash) RecycleMaskResult(kmers *[]uint64, locses *[][]int)

RecycleMaskResult recycles the results of Mask(). Please do not forget to call this method after using the mask results.

func (*LexicHash) Write

func (lh *LexicHash) Write(w io.Writer) (int, error)

Write writes a LexicHash.

Header (32 bytes):

Magic number, 8 bytes, kmermask
Main and minor versions, 2 bytes
K, 1 byte
Blank, 5 bytes
Seed: 8 bytes
Number of masks: 8 bytes

Data: k-mers.

K-mers in uint64, 8*$(the number of maskes)

func (*LexicHash) WriteToFile

func (lh *LexicHash) WriteToFile(file string) (int, error)

WriteToFile writes a LexicHash to a file, optional with file extensions of .gz, .xz, .zst, .bz2.

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL