indexsplit

package
v0.2.4 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 27, 2020 License: MIT Imports: 9 Imported by: 0

README

indexsplit

indexsplit quickly generates evenly sized (by amount of data) regions across a cohort. It does this by reading the bam (or cram) index and using the file offsets as proxies for the amount of data. It sums the values in these bins across all samples. This gives a good estimate for actual reads in the region but without having to parse the bam file.

The result is a bed file with an additional column indicating the (scaled) size of data in these region across all samples. The numbers in that column will be fairly even except at the ends offsets chromosomes or for small chromosomes.

A common use of this will be to generate regions to be used to parallelize variant-calling fairly by splitting in to N regions with approximately equal amounts of data across the cohort.

On a modest laptop with an SSD, indexsplit can generate even-coverage regions in ~4 seconds for 45 bams. The time is independent of the number of regions.

When a single 16KB chunk will has more data than the determined chunk size, indexsplit will output sub-regions of that chunk even though it doesn't know the exact placement of the data within it.

Usage

goleft indexsplit -N 5000 /path/to/*.bam > regions.bed

If you want to do this from CRAM files send the .crai files and a fasta index via the --fai:

goleft indexsplit -N 8000 --fai reference.fa.fai /path/to/*.crai > regions.bed

The user is responsible for ensuring that the crai chromosome order matches the .fai order (this will be the case if the fasta was the same as used in alignment).

Documentation

Overview

Package indexsplit is used to quickly generate evenly sized (by amount of data) regions across a cohort. It does this by reading the bam (or cram) index and using the file offsets as proxies for the amount of data. It sums the values in these bins across all samples. This gives a good estimate for actual reads in the region but without having to parse the bam file.

A common use of this will be to generate regions to be use to parallelize variant calling fairly by splitting in to `N` regions with approximately equal amounts of data **across the cohort**.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func Main

func Main()

Main is called from the goleft dispatcher.

func Split

func Split(paths []string, refs []*sam.Reference, N int, probs map[string]*interval.IntTree) chan Chunk

Split takes paths of bams or crais and generates `N` `Chunks`

Types

type Chunk

type Chunk struct {
	Chrom  string
	Start  int
	End    int
	Sum    float64 // amount of data in this Chunk
	Splits int     // number of splits
}

Chunk is a region of the genome create by `Split`.

func (Chunk) String

func (c Chunk) String() string

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL