lexichash

package module

v0.3.0 Latest Latest Go to latest Published: Mar 21, 2024 License: MIT Imports: 13 Imported by: 1

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/shenwei356/lexichash

Links

Open Source Insights

README ¶

lexichash

This project implements LexicHash in Golang, with high performance and a low memory footprint.

This package is used in LexicMap.
Bit-packed k-mer operations is provided by kmers.

Support

Please open an issue to report bugs, propose new functions or ask for help.

License

MIT License

Documentation ¶

Index ¶

Variables
func Hash64(key uint64) uint64
func IsLowComplexity(code uint64, k int) bool
func MustDecode(code uint64, k uint8) []byte
func MustDecoder() func(code uint64, k uint8) []byte
type LexicHash

Constants ¶

This section is empty.

Variables ¶

View Source

var ErrBrokenFile = errors.New("lexichash: broken file")

ErrBrokenFile means the file is not complete.

View Source

var ErrInsufficientMasks = errors.New("lexichash: insufficient masks (should be >=64)")

ErrInsufficientMasks means the number of masks is too small.

View Source

var ErrInvalidFileFormat = errors.New("lexichash: invalid binary format")

ErrInvalidFileFormat means invalid file format.

View Source

var ErrKOverflow = errors.New("lexichash: k-mer size overflow, valid range is [5-32]")

ErrKOverflow means K > 32.

View Source

var ErrVersionMismatch = errors.New("lexichash: version mismatch")

ErrVersionMismatch means version mismatch between files and program.

View Source

var Magic = [8]byte{'k', 'm', 'e', 'r', 'm', 'a', 's', 'k'}

View Source

var MainVersion uint8 = 0

View Source

var MinorVersion uint8 = 1

View Source

var Strands = [2]byte{'+', '-'}

Strands could be used to output strand for a reverse complement flag

Functions ¶

func Hash64 ¶ added in v0.2.0

func Hash64(key uint64) uint64

https://gist.github.com/badboy/6267743 . version with mask: https://gist.github.com/lh3/974ced188be2f90422cc .

func IsLowComplexity ¶

func IsLowComplexity(code uint64, k int) bool

IsLowComplexity checks if a k-mer is of low-complexity.

func MustDecoder ¶

func MustDecoder() func(code uint64, k uint8) []byte

MustDecoder returns a Decode function, which reuses the byte slice

Types ¶

type LexicHash ¶

type LexicHash struct {
	K int // max length of shared substrings, should be in range of [4, 31]

	Seed  int64    // seed for generating masks
	Masks []uint64 // masks/k-mers
	// contains filtered or unexported fields
}

LexicHash is for finding shared substrings between nucleotide sequences.

func New ¶

func New(k int, nMasks int, p int) (*LexicHash, error)

New returns a new LexicHash object. nMasks should be >=64, and better be >= 1024 and better be power of 4, i.e., 64, 256, 1024, 4096 ... p is the length of mask k-mer prefixes which need to be checked for low-complexity. p == 0 for no checking.

func NewFromFile ¶

func NewFromFile(file string) (*LexicHash, error)

NewFromFile creates a LexicHash from a binary file.

func NewFromTextFile ¶ added in v0.3.0

func NewFromTextFile(file string) (*LexicHash, error)

NewFromTextFile creates a new LexicHash object with custom kmers in a txt file.

func NewWithMasks ¶ added in v0.3.0

func NewWithMasks(k int, masks []uint64) (*LexicHash, error)

NewWithMasks creates a new LexicHash object with custom kmers. nMasks should be >=64, and better be >= 1024 and better be power of 4, i.e., 64, 256, 1024, 4096 ...

func NewWithSeed ¶

func NewWithSeed(k int, nMasks int, randSeed int64, p int) (*LexicHash, error)

NewWithSeed creates a new LexicHash object with given seed. nMasks should be >=64, and better be >= 1024 and better be power of 4, i.e., 64, 256, 1024, 4096 ... p is the length of mask k-mer prefixes which need to be checked for low-complexity. p == 0 for no checking.

func Read ¶

func Read(r io.Reader) (*LexicHash, error)

Read reads a LexiHash from an io.Reader.

func (*LexicHash) Mask ¶

func (lh *LexicHash) Mask(s []byte, skipRegions [][2]int) (*[]uint64, *[][]int, error)

Mask computes the most similar substrings for each mask in sequence s. It returns

the list of the most similar k-mers for each mask.
the start 0-based positions of all k-mers, with the last 1 bit as the strand flag (1 for negative strand).

skipRegions is optional, which is used to skip some masked regions. E.g., in reference indexing step, contigs of a genome can be concatenated with k-1 N's, where need to be ommitted.

The regions should be 0-based and ascendingly sorted. e.g., [100, 130], [200, 230] ...

func (*LexicHash) MaskLongSeqs ¶ added in v0.2.0

func (lh *LexicHash) MaskLongSeqs(s []byte, skipRegions [][2]int) (*[]uint64, *[][]int, error)

MaskLongSeqs is faster than Mask() for longer sequences, requiring nMasks >= 1024.

func (*LexicHash) RecycleMaskResult ¶

func (lh *LexicHash) RecycleMaskResult(kmers *[]uint64, locses *[][]int)

RecycleMaskResult recycles the results of Mask(). Please do not forget to call this method after using the mask results.

func (*LexicHash) Write ¶

func (lh *LexicHash) Write(w io.Writer) (int, error)

Write writes a LexicHash.

Header (32 bytes):

Magic number, 8 bytes, kmermask
Main and minor versions, 2 bytes
K, 1 byte
Blank, 5 bytes
Seed: 8 bytes
Number of masks: 8 bytes

Data: k-mers.

K-mers in uint64, 8*$(the number of maskes)

func (*LexicHash) WriteToFile ¶

func (lh *LexicHash) WriteToFile(file string) (int, error)

WriteToFile writes a LexicHash to a file, optional with file extensions of .gz, .xz, .zst, .bz2.

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
iterator

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL

README ¶

lexichash

Related projects

Support

License

Documentation ¶

Index ¶

Constants ¶

Variables ¶

Functions ¶

func Hash64 ¶ added in v0.2.0

func IsLowComplexity ¶

func MustDecode ¶

func MustDecoder ¶

Types ¶

type LexicHash ¶

func New ¶

func NewFromFile ¶

func NewFromTextFile ¶ added in v0.3.0

func NewWithMasks ¶ added in v0.3.0

func NewWithSeed ¶

func Read ¶

func (*LexicHash) Mask ¶

func (*LexicHash) MaskLongSeqs ¶ added in v0.2.0

func (*LexicHash) RecycleMaskResult ¶

func (*LexicHash) Write ¶

func (*LexicHash) WriteToFile ¶

Source Files ¶

Directories ¶