caddcode

package
v0.0.7 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jul 31, 2015 License: MIT Imports: 10 Imported by: 0

README

caddcode

CADD is a useful annotation for genomic variants. It provides a score for every base-change in the genome (totalling 8.6 billion bases). The download from the cadd website is 79GB and moderately difficult to use to annotate a VCF. It is free for academic use, please contact the CADD creators for commercial use.

caddcode encodes the CADD phred scores into an 12GB binary file with O(1) access (and provides a means to annotate a VCF). It does this by encoding the reference base using 2 bits (0: A, 1:C, 2:G, 3:T) and the CADD phred score of a change to each of the 3 alternate alleles using 10 bits each (2^10 == 1024) for a total of 32 bits per site.

We use a memory-mapped view of the binary file to provide very fast access. This takes advantage of the OS cache especially when querying variants in genome order.

Since the max phred score is 99 and we can store up to 1024 for each base, the loss of precision is bound to under 0.05 (because we multiply phred * 10.23 on input and divide on output). So, if the real phred-score is 12.21, the recovered phred-score is guaranteed to be between 12.16 and 12.26.

download

The index and the binary file are here:

The vcfanno conf file should point to the .idx file as in the example The conf file should be updated as:

[caddidx]
file="/path/to/cadd_v1.3a.idx"
names=["cadd_phred_score"]
ops=["concat"]

annotation

The user can define a reduction function (mean, max, concat, etc) to apply to the resulting values...

  • For single-nucleotide variants, we report the single cadd score.

  • For all other events, we report the maximum CADD score at each position over the length of the event. Users can then perform operations on the resulting list of values.

We are open to suggestions on how to better handle MNPs.

testing

We have tested this extensively. In generating 20 million random phred quartets (actually triplets), the maximum difference seen between the real an decoded(encoded(data)) was 0.0488758515995. This matches well with the theoretical maximum: 1 / (2 * 10.23) == 0.04887585532746823

There are 3 positions on chromosome 3 that use an ambiguous reference and so store 4 base changes. In the file, we handle these by storing a change to C, T, G. In the code, we handle this by storing the values in the actual code so that the guarantee of a precision to within 0.05 is maintained.

standalone use

It's possible to get the cadd scores for arbitrary positions in the genome independent from vcfanno.

From this directory:

$ go build main/cadd.go

Then use E.g:

IDX=$CADD_PATH/cadd_v1.3a.idx

$ ./cadd $IDX 1 10618 C
1.075268817204301 <nil>

$ ./cadd $IDX 22 100020618 T
2015/06/25 10:47:37 22 100020618 50944546
0 requested position out of range


#check:
$ tabix -s 1 -b 2 whole_genome_SNVs.tsv.gz 21:9411195-9411195
21	9411195	A	C	-0.006942	3.231
21	9411195	A	G	-0.005108	3.257
21	9411195	A	T	0.045036	3.997

$ ./cadd $IDX 21 9411195 C
3.225806451612903 <nil>
$ ./cadd $IDX 21 9411195 G
3.225806451612903 <nil>
$ ./cadd $IDX 21 9411195 T
4.007820136852395 <nil>
$ ./cadd $IDX 21 9411195 A
0 <nil>

Note that an error is return when a position is requested beyond the end of the chromosome. The <nil> indicates that the query was successful. A query for a change the the reference results in a score of 0 with no error.

This mode is very fast. When running:

./cadd test $IDX > /dev/null

We see something like: tested 5585322 sites (641787/second)

Documentation

Index

Constants

This section is empty.

Variables

View Source
var ErrorOutofRange = errors.New("requested position out of range")

Functions

This section is empty.

Types

type Index

type Index struct {
	Chroms  []string
	Lengths []int

	MapLengths map[string]int
	// contains filtered or unexported fields
}

func Reader

func Reader(f string) Index

func (Index) At

func (i Index) At(chrom string, pos int, alt string) (float64, error)

func (Index) Get

func (i Index) Get(chrom string, pos int) (uint32, error)

func (Index) Offset

func (i Index) Offset(chrom string) (int, error)

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL