cmd

package
v0.9.4 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Sep 18, 2023 License: MIT Imports: 57 Imported by: 0

Documentation

Index

Constants

View Source
const PosPopCountBufSize = 64 // 64 is the cache line size for most 64-bit machines.

PosPopCountBufSize defines the buffer size of byte slice feeding to pospopcount (github.com/clausecker/pospop).

Theoretically, size >240 is better, but in this scenario, we need firstly transposing the signature matrix, which is the performance bottleneck. Column size of the matrix is fixed, therefore we must control the row size to balance time of matrix transposing and popopcount.

64 is the best value for my machine (AMD ryzen 2700X).

View Source
const UnikIndexDBVersion uint8 = 4

UnikIndexDBVersion is the version of database.

Variables

View Source
var BufferSize = 65536 // os.Getpagesize()

BufferSize is size of buffer

View Source
var ErrVersionMismatch = errors.New("kmcp/index: version mismatch")

ErrVersionMismatch indicates mismatched version

View Source
var HicUreadsMinProp0 float64 = 0.1
View Source
var RootCmd = &cobra.Command{
	Use:   "kmcp",
	Short: "K-mer-based Metagenomic Classification and Profilling",
	Long: fmt.Sprintf(`
    Program: kmcp (K-mer-based Metagenomic Classification and Profiling)
    Version: v%s
  Documents: https://bioinf.shenwei.me/kmcp
Source code: https://github.com/shenwei356/kmcp

KMCP is a tool for metagenomic classification and profiling.

KMCP can also be used for:
  1. Fast sequence search against large scales of genomic datasets
     as BIGSI and COBS do.
  2. Fast assembly/genome similarity estimation as Mash and sourmash do,
     by utilizing Minimizer, FracMinHash (Scaled MinHash), or Closed Syncmers.

`, VERSION),
}

RootCmd represents the base command when called without any subcommands

View Source
var VERSION = "0.9.4"

VERSION is the version

Functions

func BinomialCoeff added in v0.9.0

func BinomialCoeff(n int, k int) float64

https://www.geeksforgeeks.org/space-and-time-efficient-binomial-coefficient/ It's slow. You can call BinomialCoeffWithCache to create a same function with cache.

func BinomialCoeffWithCache added in v0.9.0

func BinomialCoeffWithCache(bufSize int) func(n int, k int) float64

BinomialCoeffWithCache returns a BinomialCoeff function with cache. When n < bufSize, it returns cached values.

func CalcFPR added in v0.9.0

func CalcFPR(numElements uint64, numHashes int, signatureSize uint64) float64

CalcFPR is for computing the actual FPR for a bloom filter of which the orginal k-mers are less than the most one. Because in COBS, the size of bloom filters of a block is decided by the k-mers number of the genome with the most k-mers. Boom filters with less k-mers would have smaller FPR.

func CalcSignatureSize

func CalcSignatureSize(numElements uint64, numHashes int, falsePositiveRate float64) uint64

CalcSignatureSize is from https://github.com/bingmann/cobs/blob/master/cobs/util/calc_signature_size.cpp . but we can optionally roundup to 2^n.

def roundup(x):

x -= 1
x |= x >> 1
x |= x >> 2
x |= x >> 4
x |= x >> 8
x |= x >> 16
x |= x >> 32
return (x | x>>64) + 1

f=lambda ne,nh,fpr: math.ceil(-nh/(math.log(1-math.pow(fpr,1/nh)))*ne)

roundup(f(300000, 1, 0.25))

func Combinations added in v0.4.0

func Combinations(set []uint64, n int) (subsets [][]uint64)

modify from https://github.com/mxschmitt/golang-combinations/blob/master/combinations.go too slow for big n.

func Combinations2 added in v0.4.0

func Combinations2(set []uint64) [][2]uint64

Note: set should not have duplicates

func Execute

func Execute()

Execute adds all child commands to the root command sets flags appropriately. This is called by main.main(). It only needs to happen once to the rootCmd.

func IntSlice2StringSlice added in v0.5.0

func IntSlice2StringSlice(vals []int) []string

func MeanStdev added in v0.7.0

func MeanStdev(values []float64) (float64, float64)

func NewSearchResultParser added in v0.6.0

func NewSearchResultParser(file string, poolStrings *sync.Pool, poolMatches *sync.Pool, scoreField int, numFields int, bufferSize int) (chan *SearchResult, chan [2]string, error)

func QueryFPR added in v0.9.0

func QueryFPR(n int, k int, fpr float64) float64

Theorem 2 in doi:10.1038/nbt.3442 n: the number of query k-mers. k: the number of matched k-mers. fpr: the false positive of a single k-mer.

func QueryFPRWithCache added in v0.9.0

func QueryFPRWithCache(bufSize int) func(n int, k int, fpr float64) float64

QueryFPRWithCache returns a QueryFPR function with the BinomialCoeff cached.

func QueryFPRWithCacheWithConstantFPR added in v0.9.0

func QueryFPRWithCacheWithConstantFPR(bufSize int, fpr float64) func(n int, k int) float64

QueryFPRWithCache returns a QueryFPR function with cache. Here the fpr is constant, so we can easily buffer the results

func QueryFPRWithCacheWithFPRBins added in v0.9.0

func QueryFPRWithCacheWithFPRBins(bufSize int, maxFPR float64, bins int) func(n int, k int, fpr float64) float64

QueryFPRWithCacheWithFPRBins returns a QueryFPR function with cache. Since fpr, being a float, is difficult to use a slice to cache the values. But it could be achieved by splitting the maxFPR into bins. When n < bufSize and fpr <= maxFPR, it returns cached values.

Types

type GenomeInfo added in v0.9.0

type GenomeInfo struct {
	File    string
	Contigs int
	Size    int
}

type IndexQuery

type IndexQuery struct {
	// Kmers  []uint64
	Hashes  *[][]uint64 // related to database
	Hashes1 *[]uint64

	Ch chan *[]*Match // result chanel
}

IndexQuery is a query sent to multiple indices of a database.

type Match

type Match struct {
	Target     []string // target name
	TargetIdx  []uint32
	GenomeSize []uint64
	NumKmers   int // matched k-mers
	FPR        float64

	QCov         float64 // |A∩B|/|A|, coverage of query. i.e., Containment Index
	TCov         float64 // |A∩B|/|B|, coverage of target
	JaccardIndex float64 // |A∩B|/|A∪B|, i.e., JaccardIndex
}

Match is the struct of matching detail.

type MatchResult added in v0.4.0

type MatchResult struct {
	Query   string
	QLen    int
	QKmers  int
	FPR     float64
	Hits    int
	Target  string
	FragIdx int
	IdxNum  int
	GSize   uint64
	K       int
	MKmers  int
	QCov    float64
}

type MatchResult2 added in v0.7.0

type MatchResult2 struct {
	Query string
	// QLen    int
	// QKmers  int
	FPR float64
	// Hits    int
	Target string
	// FragIdx int
	// IdxNum  int
	// GSize   uint64
	// K       int
	// MKmers  int
	QCov float64

	Line *string
}

type MatchResult3 added in v0.7.0

type MatchResult3 struct {
	Query string
	// QLen    int
	// QKmers  int
	FPR float64
	// Hits    int
	Target string
	// FragIdx int
	// IdxNum  int
	// GSize   uint64
	// K       int
	// MKmers  int
	QCov float64

	Ref   string
	Begin int
	End   int
}

type Matches

type Matches []*Match

Matches is list of Matches, for sorting.

func (Matches) Len

func (ms Matches) Len() int

Len returns length of Matches.

func (Matches) Less

func (ms Matches) Less(i int, j int) bool

Less judges if element is i is less than element in j.

func (Matches) Swap

func (ms Matches) Swap(i int, j int)

Swap swaps two elements.

type Meta

type Meta struct {
	SeqID      string `json:"id"`   // sequence ID
	FragIdx    uint32 `json:"idx"`  // sequence location index
	GenomeSize uint64 `json:"gn-s"` // genome length

	Ks []int `json:"ks"` // ks

	Syncmer  bool `json:"sm"` // syncmer
	SyncmerS int  `json:"sm-s"`

	Minimizer  bool `json:"mm"` // minimizer
	MinimizerW int  `json:"mm-w"`

	SplitSeq     bool `json:"sp"` // split sequence
	SplitSize    int  `json:"sp-s"`
	SplitNum     int  `json:"sp-n"`
	SplitOverlap int  `json:"sp-o"`
}

Meta contains some meta information

func (Meta) String

func (m Meta) String() string

type Name2Idx

type Name2Idx struct {
	Name  string
	Index uint32
}

Name2Idx is a struct of name and index

type Options

type Options struct {
	NumCPUs int
	Verbose bool

	LogFile  string
	Log2File bool

	Compress         bool
	CompressionLevel int
}

Options contains the global flags

type ProfileNode added in v0.4.0

type ProfileNode struct {
	Taxid         uint32
	Rank          string
	TaxonName     string
	LineageNames  []string // complete lineage
	LineageTaxids []uint32
	Percentage    float64
}

type Query

type Query struct {
	Idx  uint64 // id for keep output in order
	ID   []byte
	Seq  *seq.Seq
	Seq2 *seq.Seq

	Ch chan *QueryResult // result chanel
}

Query strands for a query sequence.

type QueryResult

type QueryResult struct {
	QueryIdx uint64 // id for keep output in order
	QueryID  []byte
	QueryLen int

	DBId int // id of database, for getting database name with few space

	FPR float64 // fpr, p is related to database

	K        int
	NumKmers int // number of k-mers

	Matches *[]*Match // all matches
}

QueryResult is the search result of a query sequence.

type SearchOptions

type SearchOptions struct {
	LoadWholeFile bool

	UseMMap bool
	Threads int
	Verbose bool

	DeduplicateThreshold int // deduplicate k-mers only number of kmers > this threshold

	KeepUnmatched bool
	TopN          int
	TopNScores    int
	SortBy        string
	DoNotSort     bool

	MinQLen      int
	MinMatched   int
	MinQueryCov  float64
	MinTargetCov float64
	MaxFPR       float64

	FPRBufSize int

	LoadDefaultNameMap bool
	NameMap            map[string]string

	TrySingleEnd bool // when no target found for paired end reads, retry searching with Single Ends.
}

SearchOptions defines options for searching

type SearchResult added in v0.6.0

type SearchResult struct {
	QueryIdx   uint64 // sequence idx
	QuerySeqId string // sequence id, for double checking
	Matches    *[]*_Match
}

type SortByJacc added in v0.3.0

type SortByJacc struct{ Matches }

SortByJacc is used to sort matches by jaccard index.

func (SortByJacc) Less added in v0.3.0

func (ms SortByJacc) Less(i int, j int) bool

Less judges if element is i is less than element in j.

type SortByQCov

type SortByQCov struct{ Matches }

SortByQCov is used to sort matches by qcov.

type SortByTCov

type SortByTCov struct{ Matches }

SortByTCov is used to sort matches by tcov.

func (SortByTCov) Less

func (ms SortByTCov) Less(i int, j int) bool

Less judges if element is i is less than element in j.

type Target added in v0.4.0

type Target struct {
	Name string

	GenomeSize uint64

	// Counting matches in all chunks
	// some reads match multiple sites in the same genome,
	// the count should be divided by number of sites.
	Match []float64

	// sum of read (query) length
	QLen []float64

	// unique match
	UniqMatch []float64

	// unique match with high confidence
	UniqMatchHic []float64

	SumMatch        float64 // depth
	SumUniqMatch    float64
	SumUniqMatchHic float64

	FragsProp   float64 // coverage
	Coverage    float64
	Qlens       float64
	RelDepth    []float64
	RelDepthStd float64

	//
	RefName string

	// Taxonomy information
	Taxid         uint32
	Rank          string
	TaxonName     string
	LineageNames  []string
	LineageTaxids []string

	CompleteLineageNames  []string
	CompleteLineageTaxids []uint32

	Percentage float64 // relative abundance

	Stats  *stats.Quantiler // for computing percentil of qcov of unique matches
	StatsA *stats.Quantiler // for computing percentil of qcov of all matches

	Score float64
}

func (*Target) AddTaxonomy added in v0.4.0

func (t *Target) AddTaxonomy(taxdb *taxdump.Taxonomy, showRanksMap map[string]interface{}, taxid uint32)

func (Target) String added in v0.4.0

func (t Target) String() string

type Targets added in v0.4.0

type Targets []*Target

func (Targets) Len added in v0.4.0

func (t Targets) Len() int

func (Targets) Less added in v0.4.0

func (t Targets) Less(i, j int) bool

func (Targets) Swap added in v0.4.0

func (t Targets) Swap(i, j int)

type Uint64Slice added in v0.4.0

type Uint64Slice []uint64

func (Uint64Slice) Len added in v0.4.0

func (s Uint64Slice) Len() int

func (Uint64Slice) Less added in v0.4.0

func (s Uint64Slice) Less(i, j int) bool

func (*Uint64Slice) Pop added in v0.6.0

func (s *Uint64Slice) Pop() interface{}

func (*Uint64Slice) Push added in v0.6.0

func (s *Uint64Slice) Push(x interface{})

func (Uint64Slice) Swap added in v0.4.0

func (s Uint64Slice) Swap(i, j int)

type UnikFileInfo

type UnikFileInfo struct {
	Path       string
	Name       string
	GenomeSize uint64
	Index      uint32
	Indexes    uint32
	Kmers      uint64
}

UnikFileInfo store basic info of .unik file.

func (UnikFileInfo) String

func (i UnikFileInfo) String() string

type UnikFileInfoGroup

type UnikFileInfoGroup struct {
	Infos []UnikFileInfo
	Kmers uint64
}

UnikFileInfoGroup represents a slice of UnikFileInfos

func (UnikFileInfoGroup) String

func (i UnikFileInfoGroup) String() string

type UnikFileInfoGroups

type UnikFileInfoGroups []UnikFileInfoGroup

UnikFileInfoGroups is just a slice of UnikFileInfoGroup

func (UnikFileInfoGroups) Len

func (l UnikFileInfoGroups) Len() int

func (UnikFileInfoGroups) Less

func (l UnikFileInfoGroups) Less(i int, j int) bool

func (UnikFileInfoGroups) Swap

func (l UnikFileInfoGroups) Swap(i int, j int)

type UnikFileInfos

type UnikFileInfos []UnikFileInfo

UnikFileInfos is list of UnikFileInfo.

func (UnikFileInfos) Len

func (l UnikFileInfos) Len() int

func (UnikFileInfos) Less

func (l UnikFileInfos) Less(i int, j int) bool

func (UnikFileInfos) Swap

func (l UnikFileInfos) Swap(i int, j int)

type UnikFileInfosByName

type UnikFileInfosByName []UnikFileInfo

UnikFileInfosByName is used to sort infos by name and indices

func (UnikFileInfosByName) Len

func (l UnikFileInfosByName) Len() int

func (UnikFileInfosByName) Less

func (l UnikFileInfosByName) Less(i int, j int) bool

func (UnikFileInfosByName) Swap

func (l UnikFileInfosByName) Swap(i int, j int)

type UnikIndex

type UnikIndex struct {
	Options SearchOptions

	InCh chan *IndexQuery

	Path   string
	Header index.Header

	ExtraWorkers int // when #threads > 1.5 * #index files
	// contains filtered or unexported fields
}

UnikIndex defines a unik index struct.

func NewUnikIndex added in v0.8.1

func NewUnikIndex(file string, opt SearchOptions, fpr float64, nextraWorkers int, queryFPR func(n int, k int) float64) (*UnikIndex, error)

NewUnikIndex create a index from file.

func (*UnikIndex) Close

func (idx *UnikIndex) Close() error

Close closes the index.

func (*UnikIndex) String

func (idx *UnikIndex) String() string

type UnikIndexDB

type UnikIndexDB struct {
	Options SearchOptions

	DBId int // id for current database

	InCh chan *Query

	Info   UnikIndexDBInfo
	Header index.Header

	Indices []*UnikIndex

	ExtraWorkers int
	// contains filtered or unexported fields
}

UnikIndexDB is database for multiple .unik indices.

func NewUnikIndexDB

func NewUnikIndexDB(path string, opt SearchOptions, dbID int) (*UnikIndexDB, error)

NewUnikIndexDB opens and read from database directory.

func (*UnikIndexDB) Close

func (db *UnikIndexDB) Close() error

Close closes database.

func (*UnikIndexDB) CompatibleWith

func (db *UnikIndexDB) CompatibleWith(db2 *UnikIndexDB) bool

CompatibleWith has loose restric tions for enabling searching from database of different perameters.

func (*UnikIndexDB) String

func (db *UnikIndexDB) String() string

type UnikIndexDBInfo

type UnikIndexDBInfo struct {
	Version      uint8  `yaml:"version"`
	IndexVersion uint8  `yaml:"unikiVersion"`
	Alias        string `yaml:"alias"`
	K            int    `yaml:"k"`
	Ks           []int  `yaml:"ks"`
	Hashed       bool   `yaml:"hashed"`
	Canonical    bool   `yaml:"canonical"`

	Scaled     bool   `yaml:"scaled"`
	Scale      uint32 `yaml:"scale"`
	Minimizer  bool   `yaml:"minimizer"`
	MinimizerW uint32 `yaml:"minimizer-w"`
	Syncmer    bool   `yaml:"syncmer"`
	SyncmerS   uint32 `yaml:"syncmer-s"`

	SplitSeq     bool `yaml:"split-seq"`
	SplitSize    int  `yaml:"split-size"`
	SplitNum     int  `yaml:"split-num"`
	SplitOverlap int  `yaml:"split-overlap"`

	CompactSize bool `yaml:"compact-size"`

	NumHashes int      `yaml:"hashes"`
	FPR       float64  `yaml:"fpr"`
	NumNames  int      `yaml:"numNameGroups"`
	BlockSize int      `yaml:"blocksize"`
	Kmers     uint64   `yaml:"totalKmers"`
	Files     []string `yaml:"files"`

	NameMapping  map[string]string `yaml:"name-mapping,omitempty"`
	MappingNames bool              `yaml:"mapping-names,omitempty"`
	// contains filtered or unexported fields
}

UnikIndexDBInfo is the meta data of a database.

func NewUnikIndexDBInfo

func NewUnikIndexDBInfo(files []string) UnikIndexDBInfo

NewUnikIndexDBInfo creates UnikIndexDBInfo from index files, but you have to manually assign other values.

func UnikIndexDBInfoFromFile

func UnikIndexDBInfoFromFile(file string) (UnikIndexDBInfo, error)

UnikIndexDBInfoFromFile creates UnikIndexDBInfo from files.

func (UnikIndexDBInfo) Check

func (i UnikIndexDBInfo) Check() error

Check check if all index files exist.

func (UnikIndexDBInfo) CompatibleWith

func (i UnikIndexDBInfo) CompatibleWith(j UnikIndexDBInfo) bool

CompatibleWith checks whether two databases have the same parameters.

func (UnikIndexDBInfo) String

func (i UnikIndexDBInfo) String() string

func (UnikIndexDBInfo) WriteTo

func (i UnikIndexDBInfo) WriteTo(file string) (int, error)

WriteTo dumps UnikIndexDBInfo to file.

type UnikIndexDBSearchEngine

type UnikIndexDBSearchEngine struct {
	Options SearchOptions

	DBs     []*UnikIndexDB
	DBNames []string

	InCh  chan *Query // queries
	OutCh chan *QueryResult
	// contains filtered or unexported fields
}

UnikIndexDBSearchEngine search sequence on multiple database

func NewUnikIndexDBSearchEngine

func NewUnikIndexDBSearchEngine(opt SearchOptions, dbPaths ...string) (*UnikIndexDBSearchEngine, error)

NewUnikIndexDBSearchEngine returns a search engine based on multiple engines

func (*UnikIndexDBSearchEngine) Close

func (sg *UnikIndexDBSearchEngine) Close() error

Close closes the search engine.

func (*UnikIndexDBSearchEngine) Wait

func (sg *UnikIndexDBSearchEngine) Wait()

Wait waits

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL