unikmer

package module
Version: v0.18.8 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Sep 17, 2021 License: MIT Imports: 15 Imported by: 3

README

unikmer

unikmer is a golang package and a toolkit for nucleic acid k-mer analysis, providing functions including set operation k-mers (sketch) optional with TaxIds but without count information.

K-mers are either encoded (k<=32) or hashed (arbitrary k) into uint64, and serialized in binary file with extension .unik.

TaxIds can be assigned when counting k-mers from genome sequences, and LCA (Lowest Common Ancestor) is computed during set opertions including computing union, intersecton, set difference, unique and repeated k-mers.

Table of Contents

The package

GoDoc Go Report Card

The unikmer package provides basic manipulations of K-mers (sketch) optional with TaxIds but without frequency information, and also provides serialization methods.

Installation
go get -u github.com/shenwei356/unikmer
Benchmark

CPU: AMD Ryzen 7 2700X Eight-Core Processor, 3.7 GHz

$ go test . -bench=Bench* -benchmem \
    | grep Bench \
    | perl -pe 's/\s\s+/\t/g' \
    | csvtk cut -Ht -f 1,3-5 \
    | csvtk add-header -t -n test,time,memory,allocs \
    | csvtk pretty -t -r

                                      test           time      memory        allocs
------------------------------------------   ------------   ---------   -----------
                     BenchmarkEncodeK32-16    18.66 ns/op      0 B/op   0 allocs/op
       BenchmarkEncodeFromFormerKmerK32-16    8.030 ns/op      0 B/op   0 allocs/op
   BenchmarkMustEncodeFromFormerKmerK32-16    1.702 ns/op      0 B/op   0 allocs/op
                     BenchmarkDecodeK32-16    78.95 ns/op     32 B/op   1 allocs/op
                 BenchmarkMustDecodeK32-16    76.86 ns/op     32 B/op   1 allocs/op
                        BenchmarkRevK32-16    3.639 ns/op      0 B/op   0 allocs/op
                       BenchmarkCompK32-16   0.7971 ns/op      0 B/op   0 allocs/op
                    BenchmarkRevCompK32-16    3.831 ns/op      0 B/op   0 allocs/op
                   BenchmarkCannonalK32-16    4.210 ns/op      0 B/op   0 allocs/op

          BenchmarkKmerIterator/1.00_KB-16    12625 ns/op    160 B/op   1 allocs/op
          BenchmarkHashIterator/1.00_KB-16     8118 ns/op    232 B/op   3 allocs/op
       BenchmarkProteinIterator/1.00_KB-16    14324 ns/op    480 B/op   3 allocs/op

       BenchmarkMinimizerSketch/1.00_KB-16    62497 ns/op    688 B/op   6 allocs/op
         BenchmarkSyncmerSketch/1.00_KB-16    99390 ns/op   1456 B/op   8 allocs/op
BenchmarkProteinMinimizerSketch/1.00_KB-16    24888 ns/op    728 B/op   5 allocs/op

The toolkit

Installation
  1. Downloading executable binary files (Latest version).

  2. Via Bioconda (not available now)

     conda install unikmer
    
  3. Via Homebrew (not lastest version)

     brew install brewsci/bio/unikmer
    
Commands
  1. Counting

     count           Generate k-mers (sketch) from FASTA/Q sequences
    
  2. Information

     stats           Statistics of binary files
     num             Quickly inspect number of k-mers in binary files
    
  3. Format conversion

     encode          Encode plain k-mer text to integer
     decode          Decode encoded integer to k-mer text
    
     view            Read and output binary format to plain text
     dump            Convert plain k-mer text to binary format
    
  4. Set operations

     head            Extract the first N k-mers
     concat          Concatenate multiple binary files without removing duplicates
     inter           Intersection of multiple binary files
     common          Find k-mers shared by most of multiple binary files
     union           Union of multiple binary files
     diff            Set difference of multiple binary files
     grep            Search k-mers from binary files
    
     sort            Sort k-mers in binary files to reduce file size
     split           Split k-mers into sorted chunk files
     tsplit          Split k-mers according to TaxId
     merge           Merge k-mers from sorted chunk files
    
     sample          Sample k-mers from binary files
     filter          Filter low-complexity k-mers
     rfilter         Filter k-mers by taxonomic rank
    
  5. Searching on genomes

     locate          Locate k-mers in genome
     uniqs           Mapping k-mers back to genome and find unique subsequences
    
  6. Misc

     genautocomplete Generate shell autocompletion script
     help            Help about any command
     version         Print version information and check for update
    
Binary file (.unik)

K-mers (represented in uint64 in RAM ) are serialized in 8-Byte (or less Bytes for shorter k-mers in compact format, or much less Bytes for sorted k-mers) arrays and optionally compressed in gzip format with extension of .unik. TaxIds are optionally stored next to k-mers with 4 or less bytes.

Compression rate comparison

No TaxIds stored in this test.

cr.jpg

label encoded-kmera gzip-compressedb compact-formatc sortedd comment
plain plain text
gzip gzipped plain text
unik.default gzipped encoded k-mers in fixed-length byte array
unik.compat gzipped encoded k-mers in shorter fixed-length byte array
unik.sorted gzipped sorted encoded k-mers
  • a One k-mer is encoded as uint64 and serialized in 8 Bytes.
  • b K-mers file is compressed in gzip format by default, users can switch on global option -C/--no-compress to output non-compressed file.
  • c One k-mer is encoded as uint64 and serialized in 8 Bytes by default. However few Bytes are needed for short k-mers, e.g., 4 Bytes are enough for 15-mers (30 bits). This makes the file more compact with smaller file size, controled by global option -c/--compact .
  • d One k-mer is encoded as uint64, all k-mers are sorted and compressed using varint-GB algorithm.
  • In all test, flag --canonical is ON when running unikmer count.
Quick Start
# memusg is for compute time and RAM usage: https://github.com/shenwei356/memusg


# counting (only keep the canonical k-mers and compact output)
# memusg -t unikmer count -k 23 Ecoli-IAI39.fasta.gz -o Ecoli-IAI39.fasta.gz.k23 --canonical --compact
$ memusg -t unikmer count -k 23 Ecoli-MG1655.fasta.gz -o Ecoli-MG1655.fasta.gz.k23 --canonical --compact
elapsed time: 0.897s
peak rss: 192.41 MB


# counting (only keep the canonical k-mers and sort k-mers)
# memusg -t unikmer count -k 23 Ecoli-IAI39.fasta.gz -o Ecoli-IAI39.fasta.gz.k23.sorted --canonical --sort
$ memusg -t unikmer count -k 23 Ecoli-MG1655.fasta.gz -o Ecoli-MG1655.fasta.gz.k23.sorted --canonical --sort
elapsed time: 1.136s
peak rss: 227.28 MB


# counting and assigning global TaxIds
$ unikmer count -k 23 -K -s Ecoli-IAI39.fasta.gz -o Ecoli-IAI39.fasta.gz.k23.sorted   -t 585057
$ unikmer count -k 23 -K -s Ecoli-MG1655.fasta.gz -o Ecoli-MG1655.fasta.gz.k23.sorted -t 511145
$ unikmer count -k 23 -K -s A.muciniphila-ATCC_BAA-835.fasta.gz -o A.muciniphila-ATCC_BAA-835.fasta.gz.sorted -t 349741

# counting minimizer and ouputting in linear order
$ unikmer count -k 23 -W 5 -H -K -l A.muciniphila-ATCC_BAA-835.fasta.gz -o A.muciniphila-ATCC_BAA-835.fasta.gz.m

# view
$ unikmer view Ecoli-MG1655.fasta.gz.k23.sorted.unik --show-taxid | head -n 3
AAAAAAAAACCATCCAAATCTGG 511145
AAAAAAAAACCGCTAGTATATTC 511145
AAAAAAAAACCTGAAAAAAACGG 511145

# view (hashed k-mers needs original FASTA/Q file)
$ unikmer view --show-code --genome A.muciniphila-ATCC_BAA-835.fasta.gz A.muciniphila-ATCC_BAA-835.fasta.gz.m.unik | head -n 3
CATCCGCCATCTTTGGGGTGTCG 1210726578792
AGCGCAAAATCCCCAAACATGTA 2286899379883
AACTGATTTTTGATGATGACTCC 3542156397282

# find the positions of k-mers
$ seqkit locate -M A.muciniphila-ATCC_BAA-835.fasta.gz \
    -f <(unikmer view -a -g A.muciniphila-ATCC_BAA-835.fasta.gz A.muciniphila-ATCC_BAA-835.fasta.gz.m.unik | seqkit head -n 5 ) \
    | csvtk sort -t -k start:n | head -n 6 | csvtk pretty -t
seqID         patternName           pattern                   strand   start   end
-----------   -------------------   -----------------------   ------   -----   ---
NC_010655.1   2090893901864583115   ATCTTATAAAATAACCACATAAC   +        3       25
NC_010655.1   696051979077366638    TTATAAAATAACCACATAACTTA   +        6       28
NC_010655.1   390297872016815006    TATAAAATAACCACATAACTTAA   +        7       29
NC_010655.1   2582400417208090837   AAAATAACCACATAACTTAAAAA   +        10      32
NC_010655.1   3048591415312050785   TAACCACATAACTTAAAAAGAAT   +        14      36

# stats
$ unikmer stats *.unik -a -j 10
file                                              k  canonical  hashed  scaled  include-taxid  global-taxid  sorted  compact  gzipped  version     number  description
A.muciniphila-ATCC_BAA-835.fasta.gz.m.unik       23          ✓       ✓       ✕              ✕                     ✕        ✕        ✓     v5.0    860,900  
A.muciniphila-ATCC_BAA-835.fasta.gz.sorted.unik  23          ✓       ✕       ✕              ✕        349741       ✓        ✕        ✓     v5.0  2,630,905  
Ecoli-IAI39.fasta.gz.k23.sorted.unik             23          ✓       ✕       ✕              ✕        585057       ✓        ✕        ✓     v5.0  4,902,266  
Ecoli-IAI39.fasta.gz.k23.unik                    23          ✓       ✕       ✕              ✕                     ✕        ✓        ✓     v5.0  4,902,266  
Ecoli-MG1655.fasta.gz.k23.sorted.unik            23          ✓       ✕       ✕              ✕        511145       ✓        ✕        ✓     v5.0  4,546,632  
Ecoli-MG1655.fasta.gz.k23.unik                   23          ✓       ✕       ✕              ✕                     ✕        ✓        ✓     v5.0  4,546,632 


# concat
$ memusg -t unikmer concat *.k23.sorted.unik -o concat.k23 -c
elapsed time: 1.020s
peak rss: 25.86 MB



# union
$ memusg -t unikmer union *.k23.sorted.unik -o union.k23 -s
elapsed time: 3.991s
peak rss: 590.92 MB


# or sorting with limited memory.
# note that taxonomy database need some memory.
$ memusg -t unikmer sort *.k23.sorted.unik -o union2.k23 -u -m 1M
elapsed time: 3.538s
peak rss: 324.2 MB

$ unikmer view -t union.k23.unik | md5sum 
4c038832209278840d4d75944b29219c  -
$ unikmer view -t union2.k23.unik | md5sum 
4c038832209278840d4d75944b29219c  -


# duplicated k-mers
$ memusg -t unikmer sort *.k23.sorted.unik -o dup.k23 -d -m 1M
elapsed time: 1.143s
peak rss: 240.18 MB


# intersection
$ memusg -t unikmer inter *.k23.sorted.unik -o inter.k23
elapsed time: 1.481s
peak rss: 399.94 MB


# difference
$ memusg -t unikmer diff -j 10 *.k23.sorted.unik -o diff.k23 -s
elapsed time: 0.793s
peak rss: 338.06 MB


$ ls -lh *.unik
-rw-r--r-- 1 shenwei shenwei 9.5M  2月 13 00:55 A.muciniphila-ATCC_BAA-835.fasta.gz.sorted.unik
-rw-r--r-- 1 shenwei shenwei  46M  2月 13 00:59 concat.k23.unik
-rw-r--r-- 1 shenwei shenwei 8.7M  2月 13 01:00 diff.k23.unik
-rw-r--r-- 1 shenwei shenwei  11M  2月 13 01:04 dup.k23.unik
-rw-r--r-- 1 shenwei shenwei  18M  2月 13 00:55 Ecoli-IAI39.fasta.gz.k23.sorted.unik
-rw-r--r-- 1 shenwei shenwei  21M  2月 13 00:48 Ecoli-IAI39.fasta.gz.k23.unik
-rw-r--r-- 1 shenwei shenwei  17M  2月 13 00:55 Ecoli-MG1655.fasta.gz.k23.sorted.unik
-rw-r--r-- 1 shenwei shenwei  19M  2月 13 00:48 Ecoli-MG1655.fasta.gz.k23.unik
-rw-r--r-- 1 shenwei shenwei 9.5M  2月 13 00:59 inter.k23.unik
-rw-r--r-- 1 shenwei shenwei  27M  2月 13 01:04 union2.k23.unik
-rw-r--r-- 1 shenwei shenwei  27M  2月 13 00:58 union.k23.unik


$ unikmer stats *.unik -a -j 10
file                                              k  canonical  hashed  scaled  include-taxid  global-taxid  sorted  compact  gzipped  version     number  description
A.muciniphila-ATCC_BAA-835.fasta.gz.m.unik       23          ✓       ✓       ✕              ✕                     ✕        ✕        ✓     v5.0    860,900  
A.muciniphila-ATCC_BAA-835.fasta.gz.sorted.unik  23          ✓       ✕       ✕              ✕        349741       ✓        ✕        ✓     v5.0  2,630,905  
concat.k23.unik                                  23          ✓       ✕       ✕              ✓                     ✕        ✓        ✓     v5.0         -1  
diff.k23.unik                                    23          ✓       ✕       ✕              ✓                     ✕        ✕        ✓     v5.0  2,326,096  
dup.k23.unik                                     23          ✓       ✕       ✕              ✓                     ✓        ✕        ✓     v5.0          0  
Ecoli-IAI39.fasta.gz.k23.sorted.unik             23          ✓       ✕       ✕              ✕        585057       ✓        ✕        ✓     v5.0  4,902,266  
Ecoli-IAI39.fasta.gz.k23.unik                    23          ✓       ✕       ✕              ✕                     ✕        ✓        ✓     v5.0  4,902,266  
Ecoli-MG1655.fasta.gz.k23.sorted.unik            23          ✓       ✕       ✕              ✕        511145       ✓        ✕        ✓     v5.0  4,546,632  
Ecoli-MG1655.fasta.gz.k23.unik                   23          ✓       ✕       ✕              ✕                     ✕        ✓        ✓     v5.0  4,546,632  
inter.k23.unik                                   23          ✓       ✕       ✕              ✓                     ✓        ✕        ✓     v5.0  2,576,170  
union2.k23.unik                                  23          ✓       ✕       ✕              ✓                     ✓        ✕        ✓     v5.0  6,872,728  
union.k23.unik                                   23          ✓       ✕       ✕              ✓                     ✓        ✕        ✓     v5.0  6,872,728


# -----------------------------------------------------------------------------------------

# mapping k-mers to genome
g=Ecoli-IAI39.fasta
f=inter.k23.unik

# to fasta
unikmer view $f -a -o $f.fa.gz

# make index
bwa index $g; samtools faidx $g

ncpu=12
ls $f.fa.gz | rush -j 1 -v ref=$g -v j=$ncpu \
' bwa aln -o 0 -l 17 -k 0 -t {j} {ref} {} \
    | bwa samse {ref} - {} \
    | samtools view -bS > {}.bam; \
    samtools sort -T {}.tmp -@ {j} {}.bam -o {}.sorted.bam; \
    samtools index {}.sorted.bam; \
    samtools flagstat {}.sorted.bam > {}.sorted.bam.flagstat; \
    /bin/rm {}.bam '  

Contributing

We welcome pull requests, bug fixes and issue reports.

License

MIT License

Documentation

Index

Constants

View Source
const (
	// UnikCompact means k-mers are serialized in fix-length (n = int((K + 3) / 4) ) of byte array.
	UnikCompact = 1 << iota
	// UnikCanonical means only canonical k-mers kept.
	UnikCanonical
	// UnikSorted means k-mers are sorted
	UnikSorted // when sorted, the serialization structure is very different
	// UnikIncludeTaxID means a k-mer is followed its LCA taxid
	UnikIncludeTaxID

	// UnikHashed means ntHash value are saved as code.
	UnikHashed
	// UnikScaled means only hashes smaller than or equal to max_hash are saved.
	UnikScaled
)
View Source
const MainVersion uint8 = 5

MainVersion is the main version number.

View Source
const MinorVersion uint8 = 0

MinorVersion is the minor version number.

Variables

View Source
var ErrBrokenFile = errors.New("unikmer: broken file")

ErrBrokenFile means the file is not complete.

View Source
var ErrBufNil = fmt.Errorf("unikmer: buffer slice is nil")

ErrBufNil means the buffer is nil

View Source
var ErrBufNotEmpty = fmt.Errorf("unikmer: buffer has elements")

ErrBufNotEmpty means the buffer has some elements

View Source
var ErrCallLate = errors.New("unikmer: SetMaxTaxid/SetGlobalTaxid should be called before writing KmerCode/code/taxid")

ErrCallLate means SetMaxTaxid/SetGlobalTaxid should be called before writing KmerCode/code/taxid

View Source
var ErrCallOrder = errors.New("unikmer: WriteTaxid/ReadTaxid should be called after WriteCode/ReadCode")

ErrCallOrder means WriteTaxid/ReadTaxid should be called after WriteCode/ReadCode

View Source
var ErrCallReadWriteTaxid = errors.New("unikmer: can not call ReadTaxid/WriteTaxid when flag UnikIncludeTaxID is off")

ErrCallReadWriteTaxid means flag UnikIncludeTaxID is off, but you call ReadTaxid/WriteTaxid

View Source
var ErrCodeOverflow = errors.New("unikmer: code value overflow")

ErrCodeOverflow means the encode interger is bigger than 4^k.

View Source
var ErrDescTooLong = errors.New("unikmer: description too long, 128 bytes at most")

ErrDescTooLong means length of description two long

View Source
var ErrEmptySeq = fmt.Errorf("unikmer: empty sequence")

ErrEmptySeq sequence is empty.

View Source
var ErrIllegalBase = errors.New("unikmer: illegal base")

ErrIllegalBase means that base beyond IUPAC symbols are detected.

View Source
var ErrIllegalColumnIndex = errors.New("unikmer: illegal column index, positive integer needed")

ErrIllegalColumnIndex means column index is 0 or negative.

View Source
var ErrInvalidFileFormat = errors.New("unikmer: invalid binary format")

ErrInvalidFileFormat means invalid file format.

View Source
var ErrInvalidK = fmt.Errorf("unikmer: invalid k-mer size")

ErrInvalidK means k < 1.

View Source
var ErrInvalidS = fmt.Errorf("unikmer: invalid s-mer size")

ErrInvalidS means s >= k.

View Source
var ErrInvalidTaxid = errors.New("unikmer: invalid taxid, 0 not allowed")

ErrInvalidTaxid means zero given for a taxid.

View Source
var ErrInvalidW = fmt.Errorf("unikmer: invalid minimimzer window")

ErrInvalidW means w < 2 or w > (1<<32)-1

View Source
var ErrKMismatch = errors.New("unikmer: K mismatch")

ErrKMismatch means K size mismatch.

View Source
var ErrKOverflow = errors.New("unikmer: k-mer size (1-32) overflow")

ErrKOverflow means K > 32.

View Source
var ErrNamesNotLoaded = errors.New("unikmer: taxonomy names not loaded, please call: LoadNames")

ErrNamesNotLoaded means you should call LoadNames before using taxonomy names.

View Source
var ErrNotConsecutiveKmers = errors.New("unikmer: not consecutive k-mers")

ErrNotConsecutiveKmers means the two k-mers are not consecutive.

View Source
var ErrRankNotLoaded = errors.New("unikmer: taxonomic ranks not loaded, please call: NewTaxonomyWithRank")

ErrRankNotLoaded means you should reate load Taxonomy with NewTaxonomyWithRank before calling some methods.

View Source
var ErrShortSeq = fmt.Errorf("unikmer: sequence too short")

ErrShortSeq means the sequence is shorter than k

View Source
var ErrTooManyRanks = errors.New("unikmer: number of ranks exceed limit of 255")

ErrTooManyRanks means number of ranks exceed limit of 255

View Source
var ErrUnkownRank = errors.New("unikmer: unknown rank")

ErrUnkownRank indicate an unknown rank

View Source
var ErrVersionMismatch = errors.New("unikmer: version mismatch")

ErrVersionMismatch means version mismatch between files and program

View Source
var Magic = [8]byte{'.', 'u', 'n', 'i', 'k', 'm', 'e', 'r'}

Magic number of binary file.

View Source
var MaxCode []uint64

MaxCode is the maxinum interger for all Ks.

Functions

func Canonical added in v0.12.0

func Canonical(code uint64, k int) uint64

Canonical returns code of its canonical kmer.

func Complement

func Complement(code uint64, k int) uint64

Complement returns code of complement sequence.

func Decode

func Decode(code uint64, k int) []byte

Decode converts the code to original seq

func Encode

func Encode(kmer []byte) (code uint64, err error)

Encode converts byte slice to bits.

Codes:

A    0b00
C    0b01
G    0b10
T    0b11

For degenerate bases, only the first base is kept.

M       AC     A
V       ACG    A
H       ACT    A
R       AG     A
D       AGT    A
W       AT     A
S       CG     C
B       CGT    C
Y       CT     C
K       GT     G
N       ACGT   A

func EncodeFromFormerKmer added in v0.2.1

func EncodeFromFormerKmer(kmer []byte, leftKmer []byte, leftCode uint64) (uint64, error)

EncodeFromFormerKmer encodes from the former k-mer, inspired by ntHash

func EncodeFromLatterKmer added in v0.2.1

func EncodeFromLatterKmer(kmer []byte, rightKmer []byte, rightCode uint64) (uint64, error)

EncodeFromLatterKmer encodes from the former k-mer.

func MustCanonical added in v0.14.0

func MustCanonical(code uint64, k int) uint64

MustCanonical is similar to Canonical, but does not check k.

func MustComplement added in v0.14.0

func MustComplement(code uint64, k int) uint64

MustComplement is similar to Complement, but does not check k.

func MustDecode added in v0.14.0

func MustDecode(code uint64, k int) []byte

MustDecode is similar to Decode, but does not check k and code.

func MustEncodeFromFormerKmer added in v0.2.1

func MustEncodeFromFormerKmer(kmer []byte, leftKmer []byte, leftCode uint64) (uint64, error)

MustEncodeFromFormerKmer encodes from former the k-mer, assuming the k-mer and leftKmer are both OK.

func MustEncodeFromLatterKmer added in v0.2.1

func MustEncodeFromLatterKmer(kmer []byte, rightKmer []byte, rightCode uint64) (uint64, error)

MustEncodeFromLatterKmer encodes from the latter k-mer, assuming the k-mer and rightKmer are both OK.

func MustRevComp added in v0.14.0

func MustRevComp(code uint64, k int) (c uint64)

MustRevComp is similar to RevComp, but does not check k.

func MustReverse added in v0.14.0

func MustReverse(code uint64, k int) (c uint64)

MustReverse is similar to Reverse, but does not check k.

func PutUint64s added in v0.4.0

func PutUint64s(buf []byte, v1, v2 uint64) (ctrl byte, n int)

PutUint64s endcodes two uint64s into 2-16 bytes, and returns control byte and encoded byte length.

func RevComp added in v0.2.1

func RevComp(code uint64, k int) (c uint64)

RevComp returns code of reverse complement sequence.

func Reverse

func Reverse(code uint64, k int) (c uint64)

Reverse returns code of the reversed sequence.

func Uint64s added in v0.4.0

func Uint64s(ctrl byte, buf []byte) (values [2]uint64, n int)

Uint64s decode from encoded bytes

Types

type CodeSlice added in v0.4.0

type CodeSlice []uint64

CodeSlice is a slice of Kmer code (uint64), for sorting

func (CodeSlice) Len added in v0.4.0

func (codes CodeSlice) Len() int

Len return length of the slice

func (CodeSlice) Less added in v0.4.0

func (codes CodeSlice) Less(i, j int) bool

Less simply compare two KmerCode

func (CodeSlice) Swap added in v0.4.0

func (codes CodeSlice) Swap(i, j int)

Swap swaps two elements

type CodeTaxid added in v0.9.0

type CodeTaxid struct {
	Code uint64
	// _     uint32 // needed? to test
	Taxid uint32
}

CodeTaxid is the code-taxid pair

type CodeTaxidSlice added in v0.9.0

type CodeTaxidSlice []CodeTaxid

CodeTaxidSlice is a list of CodeTaxid, just for sorting

func (CodeTaxidSlice) Len added in v0.9.0

func (pairs CodeTaxidSlice) Len() int

Len return length of the slice

func (CodeTaxidSlice) Less added in v0.9.0

func (pairs CodeTaxidSlice) Less(i, j int) bool

Less simply compare two KmerCode

func (CodeTaxidSlice) Swap added in v0.9.0

func (pairs CodeTaxidSlice) Swap(i, j int)

Swap swaps two elements

type Header struct {
	MainVersion  uint8
	MinorVersion uint8
	K            int
	Flag         uint32
	Number       uint64 // Number of Kmers, may not be accurate

	Description []byte // let's limit it to 128 Bytes
	Scale       uint32 // scale of down-sampling
	MaxHash     uint64 // max hash for scaling/down-sampling
	// contains filtered or unexported fields
}

Header contains metadata

func (Header) String

func (h Header) String() string

type IdxValue added in v0.15.0

type IdxValue struct {
	Idx int    // index
	Val uint64 // hash
}

IdxValue is for storing k-mer hash and it's location when computing k-mer sketches.

type Iterator added in v0.12.0

type Iterator struct {
	// contains filtered or unexported fields
}

Iterator is a kmer code (k<=32) or hash iterator.

func NewHashIterator added in v0.12.0

func NewHashIterator(s *seq.Seq, k int, canonical bool, circular bool) (*Iterator, error)

NewHashIterator returns ntHash Iterator.

func NewKmerIterator added in v0.12.0

func NewKmerIterator(s *seq.Seq, k int, canonical bool, circular bool) (*Iterator, error)

NewKmerIterator returns k-mer code iterator.

func (*Iterator) Index added in v0.14.0

func (iter *Iterator) Index() int

Index returns current 0-baesd index.

func (*Iterator) Next added in v0.12.0

func (iter *Iterator) Next() (code uint64, ok bool, err error)

Next is a wrapter for NextHash and NextKmer.

func (*Iterator) NextHash added in v0.12.0

func (iter *Iterator) NextHash() (code uint64, ok bool)

NextHash returns next ntHash.

func (*Iterator) NextKmer added in v0.12.0

func (iter *Iterator) NextKmer() (code uint64, ok bool, err error)

NextKmer returns next k-mer code.

type KmerCode

type KmerCode struct {
	Code uint64
	K    int
}

KmerCode is a struct representing a k-mer in 64-bits.

func NewKmerCode

func NewKmerCode(kmer []byte) (KmerCode, error)

NewKmerCode returns a new KmerCode struct from byte slice.

func NewKmerCodeFromFormerOne added in v0.2.1

func NewKmerCodeFromFormerOne(kmer []byte, leftKmer []byte, preKcode KmerCode) (KmerCode, error)

NewKmerCodeFromFormerOne computes KmerCode from the Former consecutive k-mer.

func NewKmerCodeMustFromFormerOne added in v0.2.1

func NewKmerCodeMustFromFormerOne(kmer []byte, leftKmer []byte, preKcode KmerCode) (KmerCode, error)

NewKmerCodeMustFromFormerOne computes KmerCode from the Former consecutive k-mer, assuming the k-mer and leftKmer are both OK.

func (KmerCode) BitsString added in v0.6.2

func (kcode KmerCode) BitsString() string

BitsString returns code to string

func (KmerCode) Bytes

func (kcode KmerCode) Bytes() []byte

Bytes returns k-mer in []byte.

func (KmerCode) Canonical added in v0.2.1

func (kcode KmerCode) Canonical() KmerCode

Canonical returns its canonical kmer

func (KmerCode) Comp

func (kcode KmerCode) Comp() KmerCode

Comp returns KmerCode of the complement sequence.

func (KmerCode) Equal

func (kcode KmerCode) Equal(kcode2 KmerCode) bool

Equal checks wether two KmerCodes are the same.

func (KmerCode) Rev

func (kcode KmerCode) Rev() KmerCode

Rev returns KmerCode of the reverse sequence.

func (KmerCode) RevComp

func (kcode KmerCode) RevComp() KmerCode

RevComp returns KmerCode of the reverse complement sequence.

func (KmerCode) String

func (kcode KmerCode) String() string

String returns k-mer in string

type KmerCodeSlice added in v0.4.0

type KmerCodeSlice []KmerCode

KmerCodeSlice is a slice of KmerCode, for sorting

func (KmerCodeSlice) Len added in v0.4.0

func (codes KmerCodeSlice) Len() int

Len return length of the slice

func (KmerCodeSlice) Less added in v0.4.0

func (codes KmerCodeSlice) Less(i, j int) bool

Less simply compare two KmerCode

func (KmerCodeSlice) Swap added in v0.4.0

func (codes KmerCodeSlice) Swap(i, j int)

Swap swaps two elements

type ProteinIterator added in v0.18.0

type ProteinIterator struct {
	// contains filtered or unexported fields
}

ProteinIterator is a iterator for protein sequence.

func NewProteinIterator added in v0.18.0

func NewProteinIterator(s *seq.Seq, k int, codonTable int, frame int) (*ProteinIterator, error)

NewProteinIterator returns an iterator for hash of amino acids

func (*ProteinIterator) Index added in v0.18.0

func (iter *ProteinIterator) Index() int

Index returns current 0-baesd index.

func (*ProteinIterator) Next added in v0.18.0

func (iter *ProteinIterator) Next() (code uint64, ok bool)

Next return's a hash

type ProteinMinimizerSketch added in v0.18.0

type ProteinMinimizerSketch struct {
	// contains filtered or unexported fields
}

ProteinMinimizerSketch is a protein k-mer minimizer iterator

func NewProteinMinimizerSketch added in v0.18.0

func NewProteinMinimizerSketch(S *seq.Seq, k int, codonTable int, frame int, w int) (*ProteinMinimizerSketch, error)

NewProteinMinimizerSketch returns a ProteinMinimizerSketch

func (*ProteinMinimizerSketch) Index added in v0.18.0

func (s *ProteinMinimizerSketch) Index() int

Index returns current 0-baesd index.

func (*ProteinMinimizerSketch) Next added in v0.18.0

func (s *ProteinMinimizerSketch) Next() (code uint64, ok bool)

Next returns next hash value

type Reader

type Reader struct {
	Header
	// contains filtered or unexported fields
}

Reader is for reading KmerCode.

func NewReader

func NewReader(r io.Reader) (reader *Reader, err error)

NewReader returns a Reader.

func (*Reader) GetGlobalTaxid added in v0.9.0

func (reader *Reader) GetGlobalTaxid() uint32

GetGlobalTaxid returns the global taxid

func (*Reader) GetMaxHash added in v0.14.0

func (reader *Reader) GetMaxHash() uint64

GetMaxHash returns the max hash for scaling.

func (*Reader) GetScale added in v0.14.0

func (reader *Reader) GetScale() uint32

GetScale returns the scale of down-sampling

func (*Reader) GetTaxidBytesLength added in v0.9.0

func (reader *Reader) GetTaxidBytesLength() int

GetTaxidBytesLength returns number of byte to store a taxid

func (*Reader) HasGlobalTaxid added in v0.9.0

func (reader *Reader) HasGlobalTaxid() bool

HasGlobalTaxid means the file has a global taxid

func (*Reader) HasTaxidInfo added in v0.9.0

func (reader *Reader) HasTaxidInfo() bool

HasTaxidInfo means the binary file contains global taxid or taxids for all k-mers

func (*Reader) IsCanonical added in v0.9.0

func (reader *Reader) IsCanonical() bool

IsCanonical tells if the only canonical k-mers stored

func (*Reader) IsCompact added in v0.9.0

func (reader *Reader) IsCompact() bool

IsCompact tells if the k-mers are stored in a compact format

func (*Reader) IsHashed added in v0.12.0

func (reader *Reader) IsHashed() bool

IsHashed tells if ntHash values are saved.

func (*Reader) IsIncludeTaxid added in v0.9.0

func (reader *Reader) IsIncludeTaxid() bool

IsIncludeTaxid tells if every k-mer is followed by its taxid

func (*Reader) IsScaled added in v0.14.0

func (reader *Reader) IsScaled() bool

IsScaled tells if hashes is scaled

func (*Reader) IsSorted added in v0.9.0

func (reader *Reader) IsSorted() bool

IsSorted tells if the k-mers in file sorted

func (*Reader) Read

func (reader *Reader) Read() (KmerCode, error)

Read reads one KmerCode.

func (*Reader) ReadCode added in v0.8.0

func (reader *Reader) ReadCode() (uint64, error)

ReadCode reads one code.

func (*Reader) ReadCodeWithTaxid added in v0.9.0

func (reader *Reader) ReadCodeWithTaxid() (code uint64, taxid uint32, err error)

ReadCodeWithTaxid reads a code, also return taxid if having.

func (*Reader) ReadTaxid added in v0.9.0

func (reader *Reader) ReadTaxid() (taxid uint32, err error)

ReadTaxid reads on taxid

func (*Reader) ReadWithTaxid added in v0.9.0

func (reader *Reader) ReadWithTaxid() (KmerCode, uint32, error)

ReadWithTaxid reads a KmerCode, also return taxid if having.

type Sketch added in v0.14.0

type Sketch struct {
	S []byte
	// contains filtered or unexported fields
}

Sketch is a k-mer sketch iterator

func NewMinimizerSketch added in v0.14.0

func NewMinimizerSketch(S *seq.Seq, k int, w int, circular bool) (*Sketch, error)

NewMinimizerSketch returns a SyncmerSketch Iterator. It returns the minHashes in all windows of w (w>=1) bp.

func NewSyncmerSketch added in v0.14.0

func NewSyncmerSketch(S *seq.Seq, k int, s int, circular bool) (*Sketch, error)

NewSyncmerSketch returns a SyncmerSketch Iterator. 1<=s<=k.

func (*Sketch) Index added in v0.14.0

func (s *Sketch) Index() int

Index returns current 0-baesd index

func (*Sketch) Next added in v0.14.0

func (s *Sketch) Next() (uint64, bool)

Next returns next sketch

func (*Sketch) NextMinimizer added in v0.14.0

func (s *Sketch) NextMinimizer() (code uint64, ok bool)

NextMinimizer returns next minimizer.

func (*Sketch) NextSyncmer added in v0.14.0

func (s *Sketch) NextSyncmer() (code uint64, ok bool)

NextSyncmer returns next syncmer.

type Taxonomy added in v0.9.0

type Taxonomy struct {
	Nodes      map[uint32]uint32 // child -> parent
	DelNodes   map[uint32]struct{}
	MergeNodes map[uint32]uint32 // from -> to
	Names      map[uint32]string

	Ranks map[string]interface{}
	// contains filtered or unexported fields
}

Taxonomy holds relationship of taxon in a taxonomy.

func NewTaxonomy added in v0.9.0

func NewTaxonomy(file string, childColumn int, parentColumn int) (*Taxonomy, error)

NewTaxonomy only loads nodes from nodes.dmp file.

func NewTaxonomyFromNCBI added in v0.9.0

func NewTaxonomyFromNCBI(file string) (*Taxonomy, error)

NewTaxonomyFromNCBI parses nodes relationship from nodes.dmp from ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz .

func NewTaxonomyWithRank added in v0.11.0

func NewTaxonomyWithRank(file string, childColumn int, parentColumn int, rankColumn int) (*Taxonomy, error)

NewTaxonomyWithRank loads nodes and ranks from nodes.dmp file.

func NewTaxonomyWithRankFromNCBI added in v0.11.0

func NewTaxonomyWithRankFromNCBI(file string) (*Taxonomy, error)

NewTaxonomyWithRankFromNCBI parses Taxonomy from nodes.dmp from ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz .

func (*Taxonomy) AtOrBelowRank added in v0.18.2

func (t *Taxonomy) AtOrBelowRank(taxid uint32, rank string) bool

AtOrBelowRank returns whether a taxid is at or below one rank.

func (*Taxonomy) CacheLCA added in v0.9.0

func (t *Taxonomy) CacheLCA()

CacheLCA tells to cache every LCA query result

func (*Taxonomy) LCA added in v0.9.0

func (t *Taxonomy) LCA(a uint32, b uint32) uint32

LCA returns the Lowest Common Ancestor of two nodes, 0 for unknown taxid.

func (*Taxonomy) LineageNames added in v0.17.4

func (t *Taxonomy) LineageNames(taxid uint32) []string

LineageNames returns nodes' names of the the complete lineage.

func (*Taxonomy) LineageTaxIds added in v0.17.4

func (t *Taxonomy) LineageTaxIds(taxid uint32) []uint32

LineageTaxIds returns nodes' taxid of the the complete lineage.

func (*Taxonomy) LoadDeletedNodes added in v0.9.0

func (t *Taxonomy) LoadDeletedNodes(file string, column int) error

LoadDeletedNodes loads deleted nodes.

func (*Taxonomy) LoadDeletedNodesFromNCBI added in v0.9.0

func (t *Taxonomy) LoadDeletedNodesFromNCBI(file string) error

LoadDeletedNodesFromNCBI loads deleted nodes from NCBI delnodes.dmp.

func (*Taxonomy) LoadMergedNodes added in v0.9.0

func (t *Taxonomy) LoadMergedNodes(file string, oldColumn int, newColumn int) error

LoadMergedNodes loads merged nodes.

func (*Taxonomy) LoadMergedNodesFromNCBI added in v0.9.0

func (t *Taxonomy) LoadMergedNodesFromNCBI(file string) error

LoadMergedNodesFromNCBI loads merged nodes from NCBI merged.dmp.

func (*Taxonomy) LoadNames added in v0.17.4

func (t *Taxonomy) LoadNames(file string, taxidColumn int, nameColumn int, typeColumn int, _type string) error

LoadNames loads names.

func (*Taxonomy) LoadNamesFromNCBI added in v0.17.4

func (t *Taxonomy) LoadNamesFromNCBI(file string) error

LoadNamesFromNCBI loads scientific names from NCBI names.dmp

func (*Taxonomy) MaxTaxid added in v0.9.0

func (t *Taxonomy) MaxTaxid() uint32

MaxTaxid returns maximum taxid

func (*Taxonomy) Rank added in v0.11.0

func (t *Taxonomy) Rank(taxid uint32) string

Rank returns rank of a taxid.

type Writer

type Writer struct {
	Header
	// contains filtered or unexported fields
}

Writer writes KmerCode.

func NewWriter

func NewWriter(w io.Writer, k int, flag uint32) (*Writer, error)

NewWriter creates a Writer.

func (*Writer) Flush

func (writer *Writer) Flush() (err error)

Flush write the last k-mer

func (*Writer) SetGlobalTaxid added in v0.9.0

func (writer *Writer) SetGlobalTaxid(taxid uint32) error

SetGlobalTaxid sets the global taxid

func (*Writer) SetMaxHash added in v0.14.0

func (writer *Writer) SetMaxHash(maxHash uint64) error

SetMaxHash set the max hash

func (*Writer) SetMaxTaxid added in v0.9.0

func (writer *Writer) SetMaxTaxid(taxid uint32) error

SetMaxTaxid set the maxtaxid

func (*Writer) SetScale added in v0.14.0

func (writer *Writer) SetScale(scale uint32) error

SetScale set the scale

func (*Writer) Write

func (writer *Writer) Write(kcode KmerCode) (err error)

Write writes one KmerCode.

func (*Writer) WriteCode added in v0.8.0

func (writer *Writer) WriteCode(code uint64) (err error)

WriteCode writes one code

func (*Writer) WriteCodeWithTaxid added in v0.9.0

func (writer *Writer) WriteCodeWithTaxid(code uint64, taxid uint32) (err error)

WriteCodeWithTaxid writes a code and its taxid. If UnikIncludeTaxID is off, taxid will not be written.

func (*Writer) WriteHeader added in v0.4.0

func (writer *Writer) WriteHeader() (err error)

WriteHeader writes file header

func (*Writer) WriteKmer

func (writer *Writer) WriteKmer(mer []byte) error

WriteKmer writes one k-mer.

func (*Writer) WriteKmerWithTaxid added in v0.9.0

func (writer *Writer) WriteKmerWithTaxid(mer []byte, taxid uint32) error

WriteKmerWithTaxid writes one k-mer and its taxid

func (*Writer) WriteTaxid added in v0.9.0

func (writer *Writer) WriteTaxid(taxid uint32) (err error)

WriteTaxid appends taxid to the code

func (*Writer) WriteWithTaxid added in v0.9.0

func (writer *Writer) WriteWithTaxid(kcode KmerCode, taxid uint32) (err error)

WriteWithTaxid writes one KmerCode and its taxid. If UnikIncludeTaxID is off, taxid will not be written.

Directories

Path Synopsis
cmd

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL