indexsplit

package

v0.2.4 Latest Latest Go to latest Published: Oct 27, 2020 License: MIT Imports: 9 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/brentp/goleft

Links

Open Source Insights

README ¶

indexsplit

indexsplit quickly generates evenly sized (by amount of data) regions across a cohort. It does this by reading the bam (or cram) index and using the file offsets as proxies for the amount of data. It sums the values in these bins across all samples. This gives a good estimate for actual reads in the region but without having to parse the bam file.

The result is a bed file with an additional column indicating the (scaled) size of data in these region across all samples. The numbers in that column will be fairly even except at the ends offsets chromosomes or for small chromosomes.

A common use of this will be to generate regions to be used to parallelize variant-calling fairly by splitting in to N regions with approximately equal amounts of data across the cohort.

On a modest laptop with an SSD, indexsplit can generate even-coverage regions in ~4 seconds for 45 bams. The time is independent of the number of regions.

When a single 16KB chunk will has more data than the determined chunk size, indexsplit will output sub-regions of that chunk even though it doesn't know the exact placement of the data within it.

Usage

goleft indexsplit -N 5000 /path/to/*.bam > regions.bed

If you want to do this from CRAM files send the .crai files and a fasta index via the --fai:

goleft indexsplit -N 8000 --fai reference.fa.fai /path/to/*.crai > regions.bed

The user is responsible for ensuring that the crai chromosome order matches the .fai order (this will be the case if the fasta was the same as used in alignment).

Documentation ¶

Overview ¶

Package indexsplit is used to quickly generate evenly sized (by amount of data) regions across a cohort. It does this by reading the bam (or cram) index and using the file offsets as proxies for the amount of data. It sums the values in these bins across all samples. This gives a good estimate for actual reads in the region but without having to parse the bam file.

A common use of this will be to generate regions to be use to parallelize variant calling fairly by splitting in to `N` regions with approximately equal amounts of data **across the cohort**.

Index ¶

func Main()
func Split(paths []string, refs []*sam.Reference, N int, ...) chan Chunk
type Chunk
- func (c Chunk) String() string

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func Main ¶

func Main()

Main is called from the goleft dispatcher.

func Split ¶

func Split(paths []string, refs []*sam.Reference, N int, probs map[string]*interval.IntTree) chan Chunk

Split takes paths of bams or crais and generates `N` `Chunks`

Types ¶

type Chunk ¶

type Chunk struct {
	Chrom  string
	Start  int
	End    int
	Sum    float64 // amount of data in this Chunk
	Splits int     // number of splits
}

Chunk is a region of the genome create by `Split`.

func (Chunk) String ¶

func (c Chunk) String() string

Source Files ¶

View all Source files

indexsplit.go

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL