gsort

package module
v0.1.4 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 21, 2020 License: MIT Imports: 18 Imported by: 0

README

gsort

Build Status

Binaries Available Here

gsort is a tool to sort genomic files according to a genomefile.

For example, for some reason, you may want to sort your VCF to have order: X,Y,2,1,3,... and you want to keep the header at the top.

As a more likely example, you may want to sort your file to match GATK order (1 ... X, Y, MT) which is not possible with any other sorting tool. With gsort one can simply place MT as the last chrom in the .genome file.

Given a genome file (lines of chrom\tlength) With this tool, you can sort a BED/VCF/GTF/... in the order dictated by that file with:

gsort --memory 1500 my.vcf.gz crazy.genome | bgzip -c > my.crazy-order.vcf.gz

where here, memory-use will be limited to 1500 megabytes.

We will use this to enforce chromosome ordering in ggd.

It will also be useful for getting your files ready for use in bedtools.

GFF parent

In GFF, the Parent attribute may refer to a row that would otherwise be sorted after it (based on the end position). But, some programs require that the row referenced in a Parent attribute be sorted first. If this is required, used the --parent flag introduced in version 0.0.6.

Performance

gsort can sort the 2 million variants in ESP in 15 seconds. It takes a few minutes to sort the ~10 million ExAC variants because of the huuuuge INFO strings in that file.

Usage

gsort will error if your genome file has 'chr' prefix and your file does not (or vice-versa).

It will write temporary files to your $TMPDIR (usually /tmp/) as needed to avoid using too much memory.

TODO

  • Specify a VCF for the genome file and pull order from the @SQ tags
  • Avoid temp file when everything can fit in memory. (more universally, last chunk can always be kept in memory).

API Documentation

-- import "github.com/brentp/gsort"

Package gsort is a library for sorting a stream of tab-delimited lines ([]bytes) (from a reader) using the amount of memory requested.

Instead of using a compare function as most sorts do, this accepts a user-defined function with signature: func(line []byte) []int where the []ints are used to determine ordering. For example if we were sorting on 2 columns, one of months and another of day of months, the function would replace "Jan" with 1 and "Feb" with 2 for the first column and just return the Atoi of the 2nd column.

func Sort
func Sort(rdr io.Reader, wtr io.Writer, preprocess Processor, memMB int) error

Sort accepts a tab-delimited io.Reader and writes to wtr using prepocess to determine ordering

type Processor
type Processor func(line []byte) []int

Processor is a function that takes a line and return a slice of ints that determine ordering

Documentation

Overview

Package gsort is a library for sorting a stream of tab-delimited lines ([]bytes) (from a reader) using the amount of memory requested.

Instead of using a compare function as most sorts do, this accepts a user-defined function with signature: `func(line []byte) []int` where the []ints are used to determine ordering. For example if we were sorting on 2 columns, one of months and another of day of months, the function would replace "Jan" with 1 and "Feb" with 2 for the first column and just return the Atoi of the 2nd column.

Header lines are assumed to start with '#'. To indicate other lines that are header lines, the user function to Sort() can return `[]int{gsort.HEADER_LINE}`.

Index

Constants

View Source
const HEADER_LINE = math.MinInt32

indicate that this is a header line, even if it doesn't have '#' prefix

Variables

This section is empty.

Functions

func Sort

func Sort(rdr io.Reader, wtr io.Writer, preprocess Processor, memMB int, chromosomeMappings map[string]string) error

Sort accepts a tab-delimited io.Reader and writes to wtr using prepocess to determine ordering

Types

type Processor

type Processor func(line []byte) []int

Processor is a function that takes a line and return a slice of ints that determine ordering

Directories

Path Synopsis
cmd

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL