unikmer

package module
v0.6.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 21, 2019 License: MIT Imports: 5 Imported by: 0

README

unikmer

unikmer (unique-kmer) is a golang package and a command-line toolkit for manipulating unique small k-mers (k <= 32) without frequency information.

K-mers (k <= 32) are encoded into uint64, stored in builtin map of golang in RAM, and serialized in binary format.

Table of Contents

The package

GoDoc Go Report Card

The unikmer package provides basic manipulations of unique small K-mers (without frequency information) and also provides serialization methods.

Installation
go get -u github.com/shenwei356/unikmer
Benchmark

CPU: AMD Ryzen 7 2700X Eight-Core Processor, 3.7 GHz

$ go test . -bench=Bench* -benchmem
goos: linux
goarch: amd64
pkg: github.com/shenwei356/unikmer
BenchmarkEncodeK32-16                           50000000                25.8 ns/op             0 B/op          0 allocs/op
BenchmarkEncodeFromFormerKmerK32-16             200000000               9.42 ns/op             0 B/op          0 allocs/op
BenchmarkMustEncodeFromFormerKmerK32-16         1000000000              1.95 ns/op             0 B/op          0 allocs/op
BenchmarkDecodeK32-16                           20000000                82.2 ns/op            32 B/op          1 allocs/op
BenchmarkRevK32-16                              50000000                20.2 ns/op             0 B/op          0 allocs/op
BenchmarkCompK32-16                             50000000                27.8 ns/op             0 B/op          0 allocs/op
BenchmarkRevCompK32-16                          100000000               21.9 ns/op             0 B/op          0 allocs/op

The toolkit

unikmer is a command-line toolkit providing functions including counting, format convertion, set operations and searching on unique small k-mers (k <= 32) without frequency information.

Installation
  1. Downloading executable binary files (Latest version).

  2. Via Bioconda (not available now)

     conda install unikmer
    
  3. Via Homebrew

     brew install brewsci/bio/unikmer
    
Commands
  1. Counting

     count           count k-mers from FASTA/Q sequences
     subset          extract smaller k-mers from binary file
    
  2. Format conversion

     encode          encode plain k-mer text to integer
     decode          decode encode integer to k-mer text
     view            read and output binary format to plain text
     dump            convert plain k-mer text to binary format
    
  3. Set operations

     inter           intersection of multiple binary files
     union           union of multiple binary files
     concat          concatenate multiple binary files without removing duplicates
     diff            set difference of multiple binary files
     sample          sample k-mers from binary files
     sort            sort k-mers in binary files to reduce file size
    
  4. Searching

     grep            search k-mers from binary files
     locate          locate k-mers in genome
     uniqs           mapping k-mers back to genome and find unique subsequences
    
  5. Misc

     stats           statistics of binary files
     num             quickly inspect number of k-mers in binary files
     genautocomplete generate shell autocompletion script
     help            Help about any command
     version         print version information and check for update
    
Binary file (.unik)

K-mers (represented in uint64 in RAM ) are serialized in 8-Byte (or less Bytes for shorter k-mers in compact format, or much less Bytes for sorted k-mers) arrays and optionally compressed in gzip format with extension of .unik.

Compression rate comparison

Ecoli-MG1655.fasta.gz.cr.tsv.png A.muciniphila-ATCC_BAA-835.fasta.gz.cr.tsv.png

label encoded-kmera gzip-compressedb compact-formatc sortedd comment
plain plain text
plain.gz gzipped plain text
.unik gzipped encoded k-mers in fixed-length byte array
.unik.cpt gzipped encoded k-mers in shorter fixed-length byte array
.unik.sort gzipped sorted encoded k-mers
.unik.ungz encoded k-mers in fixed-length byte array
.unik.cpt.ungz encoded k-mers in shorter fixed-length byte array
  • a One k-mer is encoded as uint64 and serialized in 8 Bytes.
  • b K-mers file is compressed in gzip format by default, users can switch on global option -C/--no-compress to output non-compressed file.
  • c One k-mer is encoded as uint64 and serialized in 8 Bytes by default. However few Bytes are needed for short k-mers, e.g., 4 Bytes are enough for 15-mers (30 bits). This makes the file more compact with smaller file size, controled by global option -c/--compact .
  • d One k-mer is encoded as uint64, all k-mers are sorted and compressed using varint-GB algorithm.
  • In all test, flag --canonical is ON when running unikmer count.
Quick Start
# memusg is for compute time and RAM usage: https://github.com/shenwei356/memusg


# counting
$ memusg -t unikmer count -k 23 Ecoli-MG1655.fasta.gz -o Ecoli-MG1655.fasta.gz.k23
elapsed time: 2.633s
peak rss: 425.98 MB

$ ls -lh Ecoli-MG1655.fasta.gz.k23.unik
-rw-r--r-- 1 shenwei shenwei 30M 9月  23 14:13 Ecoli-MG1655.fasta.gz.k23.unik



# counting (only keep the canonical k-mers)
$ memusg -t unikmer count -k 23 Ecoli-MG1655.fasta.gz -o Ecoli-MG1655.fasta.gz.k23 --canonical
elapsed time: 1.536s
peak rss: 236.05 MB

$ ls -lh Ecoli-MG1655.fasta.gz.k23.unik
-rw-r--r-- 1 shenwei shenwei 22M 9月  23 14:14 Ecoli-MG1655.fasta.gz.k23.unik



# counting (only keep the canonical k-mers and compact output)
# memusg -t unikmer count -k 23 Ecoli-IAI39.fasta.gz -o Ecoli-IAI39.fasta.gz.k23 --canonical --compact
$ memusg -t unikmer count -k 23 Ecoli-MG1655.fasta.gz -o Ecoli-MG1655.fasta.gz.k23 --canonical --compact
elapsed time: 1.540s
peak rss: 238.54 MB

$ ls -lh Ecoli-MG1655.fasta.gz.k23.unik
-rw-r--r-- 1 shenwei shenwei 19M 9月  23 14:15 Ecoli-MG1655.fasta.gz.k23.unik


# counting (only keep the canonical k-mers and sort k-mers)
# memusg -t unikmer count -k 23 Ecoli-IAI39.fasta.gz -o Ecoli-IAI39.fasta.gz.k23.sorted --canonical --compact --sort
$ memusg -t unikmer count -k 23 Ecoli-MG1655.fasta.gz -o Ecoli-MG1655.fasta.gz.k23.sorted --canonical --compact --sort
elapsed time: 2.847s
peak rss: 337.11 MB

$ ls -lh Ecoli-MG1655.fasta.gz.k23.sorted.unik
-rw-r--r-- 1 shenwei shenwei 16M 10月  6 23:23 Ecoli-MG1655.fasta.gz.k23.sorted.unik


# view
$ unikmer view Ecoli-MG1655.fasta.gz.k23.unik | head -n 3
AGCTTTTCATTCTGACTGCAACG
CCGTTGCAGTCAGAATGAAAAGC
CCCGTTGCAGTCAGAATGAAAAG



# stats
$ unikmer stats Ecoli-MG1655.fasta.gz.k23.*unik -a
file                                    k  gzipped  compact  canonical  sorted     number
Ecoli-MG1655.fasta.gz.k23.sorted.unik  23  true     true     true       true    4,546,632
Ecoli-MG1655.fasta.gz.k23.unik         23  true     true     true       🞩       4,546,632


# union
$ time unikmer union Ecoli-MG1655.fasta.gz.k23.sorted.unik Ecoli-IAI39.fasta.gz.k23.sorted.unik -o union.k23 -c -s
real    0m4.880s
user    0m5.741s
sys     0m0.140s

# concat
$ time unikmer concat Ecoli-MG1655.fasta.gz.k23.sorted.unik Ecoli-IAI39.fasta.gz.k23.sorted.unik -o concat.k23 -c
real    0m1.620s
user    0m2.820s
sys     0m0.030s

# intersection
$ time unikmer inter Ecoli-MG1655.fasta.gz.k23.sorted.unik Ecoli-IAI39.fasta.gz.k23.sorted.unik -o inter.k23 -c -s
real    0m2.881s
user    0m3.517s
sys     0m0.106s

# difference
$ time unikmer diff -j 1 Ecoli-MG1655.fasta.gz.k23.sorted.unik Ecoli-IAI39.fasta.gz.k23.sorted.unik -o diff.k23 -c -s
real    0m2.872s
user    0m2.790s
sys     0m0.080s


$ ls -lh *.unik
-rw-r--r-- 1 shenwei shenwei  47M 10月  9 22:57 concat.k23.unik
-rw-r--r-- 1 shenwei shenwei 7.1M 10月  9 22:54 diff.k23.unik
-rw-r--r-- 1 shenwei shenwei  17M 10月  9 22:51 Ecoli-IAI39.fasta.gz.k23.sorted.unik
-rw-r--r-- 1 shenwei shenwei  21M 10月  9 22:55 Ecoli-IAI39.fasta.gz.k23.unik
-rw-r--r-- 1 shenwei shenwei  16M 10月  9 22:55 Ecoli-MG1655.fasta.gz.k23.sorted.unik
-rw-r--r-- 1 shenwei shenwei  19M 10月  9 22:56 Ecoli-MG1655.fasta.gz.k23.unik
-rw-r--r-- 1 shenwei shenwei 9.1M 10月  9 22:53 inter.k23.unik
-rw-r--r-- 1 shenwei shenwei  22M 10月  9 22:53 union.k23.unik


$ unikmer stats *.unik -a -j 10
file                                    k  gzipped  compact  canonical  sorted     number
concat.k23.unik                        23  ✓        ✓        ✓          ✕       9,448,898
diff.k23.unik                          23  ✓        ✓        ✓          ✓       1,970,462
Ecoli-IAI39.fasta.gz.k23.sorted.unik   23  ✓        ✓        ✓          ✓       4,902,266
Ecoli-IAI39.fasta.gz.k23.unik          23  ✓        ✓        ✓          ✕       4,902,266
Ecoli-MG1655.fasta.gz.k23.sorted.unik  23  ✓        ✓        ✓          ✓       4,546,632
Ecoli-MG1655.fasta.gz.k23.unik         23  ✓        ✓        ✓          ✕       4,546,632
inter.k23.unik                         23  ✓        ✓        ✓          ✓       2,576,170
union.k23.unik                         23  ✓        ✓        ✓          ✓       6,872,728

Contributing

We welcome pull requests, bug fixes and issue reports.

License

MIT License

Documentation

Index

Constants

View Source
const (
	// UNIK_COMPACT means Kmers are serialized in fix-length (n = int((K + 3) / 4) ) of byte array.
	UNIK_COMPACT = 1 << iota
	// UNIK_CANONICAL means only canonical Kmers kept.
	UNIK_CANONICAL
	// UNIK_SORTED means Kmers are sorted
	UNIK_SORTED // when sorted, the serialization structure is very different
)
View Source
const MainVersion uint8 = 2

MainVersion is the main version number.

View Source
const MinorVersion uint8 = 0

MinorVersion is the minor version number.

Variables

View Source
var ErrBrokenFile = errors.New("unikmer: broken file")

ErrBrokenFile means the file is not complete.

View Source
var ErrCodeOverflow = errors.New("unikmer: code value overflow")

ErrCodeOverflow means the encode interger is bigger than 4^k

View Source
var ErrIllegalBase = errors.New("unikmer: illegal base")

ErrIllegalBase means that base beyond IUPAC symbols are detected.

View Source
var ErrInvalidFileFormat = errors.New("unikmer: invalid binary format")

ErrInvalidFileFormat means invalid file format.

View Source
var ErrKMismatch = errors.New("unikmer: K mismatch")

ErrKMismatch means K size mismatch.

View Source
var ErrKOverflow = errors.New("unikmer: K-mer size (1-32) overflow")

ErrKOverflow means K > 32.

View Source
var ErrNotConsecutiveKmers = errors.New("unikmer: not consecutive k-mers")

ErrNotConsecutiveKmers means the two k-mers are not consecutive

View Source
var Magic = [8]byte{'.', 'u', 'n', 'i', 'k', 'm', 'e', 'r'}

Magic number of binary file.

View Source
var MaxCode []uint64

MaxCode is the maxinum interger for all Ks.

Functions

func Complement

func Complement(code uint64, k int) (c uint64)

Complement returns code of complement sequence.

func Decode

func Decode(code uint64, k int) []byte

Decode converts the code to original seq

func Encode

func Encode(kmer []byte) (code uint64, err error)

Encode converts byte slice to bits.

Codes:

A    00
C    01
G    10
T    11

For degenerate bases, only the first base is kept.

M       AC     A
V       ACG    A
H       ACT    A
R       AG     A
D       AGT    A
W       AT     A
S       CG     C
B       CGT    C
Y       CT     C
K       GT     G
N       ACGT   A

func EncodeFromFormerKmer added in v0.2.1

func EncodeFromFormerKmer(kmer []byte, leftKmer []byte, leftCode uint64) (uint64, error)

EncodeFromFormerKmer encodes from the former k-mer, inspired by ntHash

func EncodeFromLatterKmer added in v0.2.1

func EncodeFromLatterKmer(kmer []byte, rightKmer []byte, rightCode uint64) (uint64, error)

EncodeFromLatterKmer encodes from the former k-mer.

func MustEncodeFromFormerKmer added in v0.2.1

func MustEncodeFromFormerKmer(kmer []byte, leftKmer []byte, leftCode uint64) (uint64, error)

MustEncodeFromFormerKmer encodes from former the k-mer, assuming the k-mer and leftKmer are both OK.

func MustEncodeFromLatterKmer added in v0.2.1

func MustEncodeFromLatterKmer(kmer []byte, rightKmer []byte, rightCode uint64) (uint64, error)

MustEncodeFromLatterKmer encodes from the latter k-mer, assuming the k-mer and rightKmer are both OK.

func PutUint64s added in v0.4.0

func PutUint64s(buf []byte, v1, v2 uint64) (ctrl byte, n int)

PutUint64s endcodes two uint64s into 2-16 bytes, and returns control byte and encoded byte length.

func RevComp added in v0.2.1

func RevComp(code uint64, k int) (c uint64)

RevComp returns code of reverse complement sequence.

func Reverse

func Reverse(code uint64, k int) (c uint64)

Reverse returns code of the reversed sequence.

func Uint64s added in v0.4.0

func Uint64s(ctrl byte, buf []byte) (values [2]uint64, n int)

Uint64s decode from encoded bytes

Types

type CodeSlice added in v0.4.0

type CodeSlice []uint64

CodeSlice is a slice of Kmer code (uint64), for sorting

func (CodeSlice) Len added in v0.4.0

func (codes CodeSlice) Len() int

Len return length of the slice

func (CodeSlice) Less added in v0.4.0

func (codes CodeSlice) Less(i, j int) bool

Less simply compare two KmerCode

func (CodeSlice) Swap added in v0.4.0

func (codes CodeSlice) Swap(i, j int)

Swap swaps two elements

type Header struct {
	MainVersion  uint8
	MinorVersion uint8
	K            int
	Flag         uint32
	Number       int64 // -1 for unknown
}

Header contains metadata

func (Header) String

func (h Header) String() string

type KmerCode

type KmerCode struct {
	Code uint64
	K    int
}

KmerCode is a struct representing a k-mer in 64-bits.

func NewKmerCode

func NewKmerCode(kmer []byte) (KmerCode, error)

NewKmerCode returns a new KmerCode struct from byte slice.

func NewKmerCodeFromFormerOne added in v0.2.1

func NewKmerCodeFromFormerOne(kmer []byte, leftKmer []byte, preKcode KmerCode) (KmerCode, error)

NewKmerCodeFromFormerOne computes KmerCode from the Former consecutive k-mer.

func NewKmerCodeMustFromFormerOne added in v0.2.1

func NewKmerCodeMustFromFormerOne(kmer []byte, leftKmer []byte, preKcode KmerCode) (KmerCode, error)

NewKmerCodeMustFromFormerOne computes KmerCode from the Former consecutive k-mer, assuming the k-mer and leftKmer are both OK.

func (KmerCode) Bytes

func (kcode KmerCode) Bytes() []byte

Bytes returns k-mer in []byte.

func (KmerCode) Canonical added in v0.2.1

func (kcode KmerCode) Canonical() KmerCode

Canonical returns its canonical kmer

func (KmerCode) Comp

func (kcode KmerCode) Comp() KmerCode

Comp returns KmerCode of the complement sequence.

func (KmerCode) Equal

func (kcode KmerCode) Equal(kcode2 KmerCode) bool

Equal checks wether two KmerCodes are the same.

func (KmerCode) Rev

func (kcode KmerCode) Rev() KmerCode

Rev returns KmerCode of the reverse sequence.

func (KmerCode) RevComp

func (kcode KmerCode) RevComp() KmerCode

RevComp returns KmerCode of the reverse complement sequence.

func (KmerCode) String

func (kcode KmerCode) String() string

String returns k-mer in string

type KmerCodeSlice added in v0.4.0

type KmerCodeSlice []KmerCode

KmerCodeSlice is a slice of KmerCode, for sorting

func (KmerCodeSlice) Len added in v0.4.0

func (codes KmerCodeSlice) Len() int

Len return length of the slice

func (KmerCodeSlice) Less added in v0.4.0

func (codes KmerCodeSlice) Less(i, j int) bool

Less simply compare two KmerCode

func (KmerCodeSlice) Swap added in v0.4.0

func (codes KmerCodeSlice) Swap(i, j int)

Swap swaps two elements

type Reader

type Reader struct {
	Header
	// contains filtered or unexported fields
}

Reader is for reading KmerCode.

func NewReader

func NewReader(r io.Reader) (reader *Reader, err error)

NewReader returns a Reader.

func (*Reader) Read

func (reader *Reader) Read() (KmerCode, error)

Read reads one KmerCode.

type Writer

type Writer struct {
	Header
	// contains filtered or unexported fields
}

Writer writes KmerCode.

func NewWriter

func NewWriter(w io.Writer, k int, flag uint32) (*Writer, error)

NewWriter creates a Writer.

func (*Writer) Flush

func (writer *Writer) Flush() (err error)

Flush write the last k-mer

func (*Writer) Write

func (writer *Writer) Write(kcode KmerCode) (err error)

Write writes one KmerCode.

func (*Writer) WriteHeader added in v0.4.0

func (writer *Writer) WriteHeader() (err error)

WriteHeader writes file header

func (*Writer) WriteKmer

func (writer *Writer) WriteKmer(mer []byte) error

WriteKmer writes one k-mer.

Directories

Path Synopsis
cmd

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL