fusion

package

v0.0.0-...-d966d87 Latest Latest Go to latest Published: Aug 18, 2020 License: Apache-2.0 Imports: 24 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/grailbio/bio

Links

Open Source Insights

README ¶

AF4

AF4 is an alignment free fusion detector that can work with RNA-seq data, cfRNA data and cfDNA data. It consists of two stages. In stage one, it identifies candidate fusion fragments and the corresponding fusion pairs. In the second stage, a list of filtering criteria are applied to narrow down the fusion candidates.

Method: AF4 decomposes each fragment into kmers and identifies for each kmer the corresponding genes. Then it infers the max-cover of the fragment that is best explained by a single gene or gene-pair. This effectively removes most of the fragments that are not supporting any fusions while kept real fusion fragments. These fragments are then filtered in the second stage using criteria like the minimum supporting fragments per fusion event, genomic distance of fusion genes etc.

Contact: saito@grail.com, xyang@grail.com

Installation

Currently, AF4 only supports UNIX platforms with ivy bridge CPU or newer. Our reference machines are Ubuntu16.04 with 256GiB memory and 56 to 64 CPUs.

If you have go environment set up, please use the following command directly
```
  go install github.com/grailbio/bio/cmd/bio-fusion
```
The binary bio-fusion should be ready to use, for usage, type
```
  bio-fusion -h
```

Setup go then install

Assuming your current working folder is $HOME/go_install
Obtain pre-compiled go package (this works for linux only, please check sha256 to make sure the .tar.gz file is properly downloaded)
```
  wget https://dl.google.com/go/go1.11.5.linux-amd64.tar.gz
  tar -C $HOME/go_install -xvzf go1.11.5.linux-amd64.tar.gz
```
The go binary path should be at: $HOME/go_install/go/bin/go

Set up go env

  export GOROOT=${HOME}/go_install/go
  export PATH="${HOME}/go_install/go/bin:${PATH}"

Set up go work space & install fusion package

  # create workspace assuming you want to use $HOME/workdir folder
  mkdir -p $HOME/workdir
  cd $HOME/workdir
  go mod init playground/
  go install github.com/grailbio/bio/cmd/bio-fusion

The binary bio-fusion should be ready to use, for usage, type
```
  bio-fusion -h
```

Running

bio-fusion -r1=a0.fastq.gz,b0.fastq.gz,... -r2=a1.fastq.gz,b1.fastq.gz,... -transcript=transcipt.fa [-cosmic-fusion=fusion.txt] [-fasta-output=all.fa] [-filtered-output=filtered.fa]

Flags r1=... and r2=... specify comma-separated lists of FASTQ files. In the above example, files a0.fastq.gz and a1.fastq.gz should contain paired reads from R1 and R2, respectively. The two lists must contain the same number of path names, and file-pair in the lists must contain exactly the same number of reads. If a path ends with .gz or .bz2, they will be decompressed by gzip and bz2, respectively.
Flag -transcript specifies the transcriptome. The next section describes the format of this file in more detail.
Flag -cosmic-fusion is used only in target mode. It provides curated fusion gene pairs. This file should be in COSMIC TSV format. The first line is a header, which is ignored by bio-fusion. The rest of file should contain at least one TSV column, in form "gene1/gene2", per line. The rest of the columns are ignored by AF4. For example:

Genes	Samples	Mutations	Papers
AAGAB/FOSL2	10	1	1
ABTB1/SDC2	10	1	1
ACAD10/SELP	10	1	1
ACBD6/RRP15	2	2	1

Flag -fasta-output specifies the path name of the 1st-stage output. The default value is all.fa. The output file format is described in a later section.
Flag -filtered-output specifies the path name of the 2nd-stage output. It is a subset of the fasta-output. The default value is filtered.fa.
Passing -h will show more minor flags supported by bio-fusion.

PCR duplicates, UMIs

By default, AF4 deals with 6bp dual UMIs and collapse sequences when their UMIs are within Hamming distance of two. However, when UMIs are absent, PCR duplicates are identified based on sequence similarity, where sequneces are collapsed if they are highly similar.

Reference transcriptome

The transcriptome is a FASTA file. Each key should be of form >transcriptname|genename|chr:start-end:index|len0,len1,...,lenN

The sequence should list all the exons without any separator. E.g.:

>ENST00000420190.6|SAMD11|chr1:923928-944581:49|1021,92,182,51,125,90,17
CGGAGTCTCCCAAGTCCCCGCCGGGCGGGCGCGCGCCAGTGGACGCGGGTGCACGACTGACGCGGCCCGGGCGGCGGGGCGGGGGC ...

start, end are 1-based positions in the chromosome, both closed. index is the rank of the transcript's start position within the chromosome. That is, the transcript with the smallest start position for a particular chromosome has index of 0, the 2nd smallest start position will have index of 1, so on. len0, len1, ... are length of each exon, but the exon lengths are currently unused by AF4.

Generating reference transcriptome from Gencode annotations

The AF4 transcriptome file can be generated from Gencode annotation GTF files and a reference human genome, usually hg38. The reference files can be found in the following locations:

ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_26/gencode.v26.annotation.gtf.gz
http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz # you must uncompress the file before feeding it to bio-fusion

To generate gencode.v26.250padded_separate_jns_transcripts_parsed_no_mt_no_overlap_no_pary_no_versioned.fa, used for cfRNA datasets, run the following:

bio-fusion -generate-transcriptome -exon-padding 250 -retained-exon-bases 18  -separate-junctions -output gencode.v26.250padded_separate_jns_transcripts_parsed_no_mt_no_overlap_no_pary_no_versioned.fa /path/to/gencode.v26.annotation.gtf /path/to/hg38.fa

For non-padded transcrptome used for other datasets, run the folowing:

bio-fusion -generate-transcriptome -output gencode.v26.whole_genes.fa -keep-mitochondrial-genes -keep-readthrough-transcripts -keep-pary-locus-transcripts -keep-versioned-genes -/path/to/gencode.v26.annotation.gtf /path/to/hg38.fa

Pre-generated transcriptome files used in our benchmark are found in s3://grail-publications/2019-ISMB/references.

Output format

AF4 produces two outputs, all.fa and filtered.fa. File all.fa is the output after the first stage of the program. filtered.fa is the final output that went through the built-in filters in AF4.

Each sequence in the output is represented as a FASTA format. The header consists of |-delimited fields:

Read name: the R1 name is reported here.
Comma-separated fusion gene pairs. Each genepair is in form Ga0/Ga1. Most often, this column contains just one gene pair, Ga0/Ga1, which means that the left half of the fragment matches Ga0, the right half of the fragment matches Ga1. Only when one fragment matches multiple genepairs equially well, more than one pairs are shown in this column.
Comma-separated genepair positions. Each position-pair is in form chr0:start0-end0:index0/chr1:start1-end1:index1. The positions is copied from the transcriptome. See the previous section for the meaning of start[01], end[01], and index[01]. The number of positionpairs in this column always matches the number of genepairs in the previous column.
Comma-separated numbers of bases supporting G1/G2. The number of entries in this column always matches the number of genepairs.
Range of bases on stitched fragment supporting G1/G2

The sequence is made up of the stitched readpair. When the readpair cannot be stitched (e.g., because they don't overlap), then the line contains the R1 sequence, followed by |, followed dy the reverse complement of the R2 sequence.

For example,

>E00481:58:H53VWALXX:1:1101:28574:38754:AAATCC+CTATAC 1:N:0:CTGAAGCT+ACGTCCTG|BORCS8/MEF2B|chr19:19176903-19192591:840/chr19:19145568-19170289:839|200/27|1:257/258:284

E00481:58:H53VWALXX:1:1101:28574:38754:AAATCC+CTATAC 1:N:0:CTGAAGCT+ACGTCCTG: read name
BORCS8/MEF2B: fusion gene pairs
chr19:19176903-19192591:840: BORCS8 is on chr19, position 19176903-19192591, and 840th gene from 5' region
200/27: 200 and 27 bases supporting BORCS8 and MEF2B, respectively, on the stitched fragment
1:257/258:284 : 1:257bp supports BORCS8 and 258:284bp supports MEF2B.

Running the benchmarks

References and FASTQ files used in the ISMB paper are in s3://grail-publications/2019-ISMB. Directory [benchmark] also contains helper scripts to run the benchmarks.

Simulated dataset

Simulated fusion FASTQ files are in s3://grail-publications/2019-ISMB/simulated_benchmark. First download the reference files and the FASTQ files in a local directory, then run:

bio-fusion -r1=path_to/r1.fastq.gz -r2=path_to/r2.fastq.gz -transcript=path_to/gencode.v26.250padded_separate_jns_transcripts_parsed_no_mt_no_overlap_no_pary_no_versioned.fa -umi-in-name -max-genes-per-kmer=2 -max-proximity-distance=1000 -max-proximity-genes=0 [-cosmic_fusion=path_to/all_pair_art_lod_gpair_merged.txt]

The results will be created in ./all.fa and ./filtered.fa.

CfRNA dataset

CfRNA files are in s3://grail-publications/2019-ISMB/rna_benchmark. First download the reference files and the FASTQ files in a local directory, then run:

bio-fusion -r1=path_to/r1.fastq.gz,... -r2=path_to/r2.fastq.gz,... -transcript=path_to/gencode.v26.250padded_separate_jns_transcripts_parsed_no_mt_no_overlap_no_pary_no_versioned.fa -umi-in-name -max-genes-per-kmer=2 -max-proximity-distance=1000 -max-proximity-genes=0 [-cosmic_fusion=path_to/all_pair_art_lod_gpair_merged.txt]

The results will be created in ./all.fa and ./filtered.fa.

CfDNA dataset

CfDNA files are in s3://grail-publications/2019-ISMB/titration_benchmark. First download the reference files and the FASTQ files in a local directory, then run:

bio-fusion -r1=path_to/r1.fastq.gz,... -r2=path_to/r2.fastq.gz,... -transcript=path_to/gencode.v26.whole_genes.fa -umi-in-name -max-genes-per-kmer=2 -max-proximity-distance=1000 -max-proximity-genes=0 [-cosmic_fusion=path_to/all_pair_art_lod_gpair_merged.txt]

The results will be created in ./all.fa and ./filtered.fa.

Documentation ¶

Index ¶

Constants
Variables
func CloseProximity(geneDB *GeneDB, fi FusionInfo, maxProximityDistance, maxProximityGenes int) bool
func DiscardAbundantPartners(candidatesPtr *[]Candidate, maxGenePartners int)
func FilterByMinSpan(hasUMI bool, minSpan int, candidatesPtr *[]Candidate, minReadSupport int)
func FilterDuplicates(candidatesPtr *[]Candidate, hasUMI bool)
func IsLowComplexity(seq string, lowComplexityFrac float64) bool
func LinkedByLowComplexSubstring(frag Fragment, fi FusionInfo, lowComplexityFraction float64) bool
func MaybeRemoveUMI(name, r1Seq, r2Seq string, opts Opts) (string, string, string)
func ParseTranscriptomeKey(seqName string) (ensemblID, gene, chrom string, start, end, index int, err error)
func RemoveLowComplexityReads(r1Seq, r2Seq string, stats *Stats, opts Opts) (newR1Seq, newR2Seq string)
func SortGenePair(geneDB *GeneDB, g1, g2 GeneID, order GenePairOrder) (GeneID, GeneID)
type Candidate
type CrossReadPosRange
- func (r CrossReadPosRange) Equal(other CrossReadPosRange) bool
type Fragment
- func (r *Fragment) HammingDistance(other Fragment) int
- func (r *Fragment) SubSeq(p CrossReadPosRange) string
- func (f *Fragment) UMI() string
type FusionInfo
- func DetectFusion(geneDB *GeneDB, frag Fragment, stats *Stats, opts Opts) []FusionInfo
- func (fi FusionInfo) Name(geneDB *GeneDB, opts Opts) string
type GeneDB
- func NewGeneDB(opts Opts) *GeneDB
- func (m *GeneDB) GeneIDRange() (GeneID, GeneID)
- func (m *GeneDB) GeneInfo(id GeneID) *GeneInfo
- func (m *GeneDB) GeneInfoByName(name string) *GeneInfo
- func (m *GeneDB) IsFusionPair(g1, g2 GeneID) bool
- func (m *GeneDB) PrepopulateGeneInfo(genes []GeneInfo)
- func (m *GeneDB) PrepopulateGenes(names []string)
- func (m *GeneDB) ReadFusionEvents(ctx context.Context, path string)
- func (m *GeneDB) ReadTranscriptome(ctx context.Context, fastaPath string, filter bool)
type GeneID
type GeneInfo
type GenePairOrder
type Kmer
type Opts
type Pos
- func (pos Pos) R2Off() int
- func (pos Pos) ReadType() ReadType
type PosRange
- func (r PosRange) Equal(other PosRange) bool
type ReadType
type Stats
- func (s Stats) Merge(o Stats) Stats
type Stitcher
- func NewStitcher(kmerLength int, lowComplexityFraction float64) *Stitcher
- func (s *Stitcher) FreeFragment(f Fragment)
- func (s *Stitcher) Stitch(name, r1Seq, r2Seq string, stats *Stats) Fragment

Constants ¶

View Source

const ReproduceBug = true

ReproduceBug introduces extra logic to reproduce suspicious behaviours of //bio/rna/fusion/ code.

Variables ¶

View Source

var DefaultOpts = Opts{
	UMIInRead:                    false,
	UMIInName:                    false,
	KmerLength:                   19,
	MaxGap:                       9,
	MaxHomology:                  15,
	MinSpan:                      25,
	LowComplexityFraction:        0.9,
	MaxGenesPerKmer:              5,
	MaxGeneCandidatesPerFragment: 5,
	MaxProximityDistance:         100000,
	MaxProximityGenes:            5,
	MaxGenePartners:              5,
	MinReadSupport:               2,
}

DefaultOpts sets the default values to Opts.

Functions ¶

func CloseProximity ¶

func CloseProximity(geneDB *GeneDB, fi FusionInfo,
	maxProximityDistance, maxProximityGenes int) bool

Return true if two genes in a candidate are deemed to be within close proximity of each other (Either they are within prox_dist bases of each other, or they are within prox_num genes of each other, on the same chromosome)

func DiscardAbundantPartners ¶

func DiscardAbundantPartners(
	candidatesPtr *[]Candidate,
	maxGenePartners int)

Discard calls where one of the partners is involved in numerous events

func FilterByMinSpan ¶

func FilterByMinSpan(hasUMI bool, minSpan int, candidatesPtr *[]Candidate, minReadSupport int)

FilterByMinSpan filters candidates that aren't covered minSpan bases either G1 or G2. It also performs UMI collapsing and returns all valid fragment indices. More specifically, this function scans through all candidate gene pairs, and considers it as a valid pair if there are at least a prescribed minimum number of unique fragment supporting this fusion.

func FilterDuplicates ¶

func FilterDuplicates(candidatesPtr *[]Candidate, hasUMI bool)

Filter duplicate reads that call the same event

func IsLowComplexity ¶

func IsLowComplexity(seq string, lowComplexityFrac float64) bool

IsLowComplexity returns true if input DNA is low complexity sequence, i.e. any two bases present at over lowComplexityFrac of the total sequence length.

func LinkedByLowComplexSubstring ¶

func LinkedByLowComplexSubstring(frag Fragment, fi FusionInfo, lowComplexityFraction float64) bool

Return true if the substring used to infer candidate gene pair is of low complexity.

func MaybeRemoveUMI ¶

func MaybeRemoveUMI(name, r1Seq, r2Seq string, opts Opts) (string, string, string)

MaybeRemoveUMI removes an UMI from the sequences and add add it to the name part, if the options prescribe such operations. It returns <new name, new r1 seq, new r2seq>.

func ParseTranscriptomeKey ¶

func ParseTranscriptomeKey(seqName string) (ensemblID, gene, chrom string, start, end, index int, err error)

ParseTranscriptomeKey parses a transcriptome fasta key.

Transcriptome key example: "ENST00000279783.3|OR8K1|chr11:56346039-56346998:1051|960"

func RemoveLowComplexityReads ¶

func RemoveLowComplexityReads(r1Seq, r2Seq string, stats *Stats, opts Opts) (newR1Seq, newR2Seq string)

RemoveLowComplexityReads check if r1Seq or r2Seq is low complexity, i.e., if the two most frequent nucleotide types dominate the sequence. If so, it converts them to an empty string.

func SortGenePair ¶

func SortGenePair(geneDB *GeneDB, g1, g2 GeneID, order GenePairOrder) (GeneID, GeneID)

Types ¶

type Candidate ¶

type Candidate struct {
	Frag    Fragment
	Fusions []FusionInfo
}

Candidate is a combination of a fragment and possible fusions detected for the fragment.

type CrossReadPosRange ¶

type CrossReadPosRange struct{ Start, End Pos }

CrossReadPosRange is the same as PosRange, except the range may cross a R1/R2 boundary.

func (CrossReadPosRange) Equal ¶

func (r CrossReadPosRange) Equal(other CrossReadPosRange) bool

Equal checks if the two ranges are identical.

type Fragment ¶

type Fragment struct {
	// Fragment name. It's a copy of the R1 name from the fastq.
	//
	// Example: E00469:245:HHK5TCCXY:1:1101:19634:35080:GTATCT+AGCAAT
	Name string
	// R1Seq is the sequence from the R1 fastq.  When the R1 and R2 sequences are
	// found to have overlapping regions, they are stitched together, and R1Seq
	// will store the combined sequence and R2Seq will be empty.
	R1Seq string
	// R2Seq is the sequence from the R2 fastq. It is nonempty only when the
	// stitcher fails to stitch the R1 and R2 sequences.
	R2Seq string
	// contains filtered or unexported fields
}

Fragment is a union of two (unpaired) reads (R1 & R2). Created by the Stitcher.

func (*Fragment) HammingDistance ¶

func (r *Fragment) HammingDistance(other Fragment) int

HammingDistance computes the hamming distance of sequences. If the sequences aren't of the same length, it returns a infiniteHammingDistance.

func (*Fragment) SubSeq ¶

func (r *Fragment) SubSeq(p CrossReadPosRange) string

SubSeq extracts part of the RNA sequence. The arg may cross the R1/R2 boundary, in which case this function returns a suffix of R1 plus a prefix of R2.

func (*Fragment) UMI ¶

func (f *Fragment) UMI() string

UMI extracts the UMI component from the Name field. Returns false if the name doesn't contain an UMI, or on any other error.

Caution: This methods is very slow. Don't use it in a perf-critical path.

type FusionInfo ¶

type FusionInfo struct {
	// G1ID is the ID of the gene 1. The gene info can be looked up by calling GeneDB.GeneInfo().
	G1ID GeneID
	// G2ID is the ID of the gene 2. The gene info can be looked up by calling GeneDB.GeneInfo().
	G2ID GeneID
	// G1Span is the total length of the gene1 that intersects with the fragment.
	G1Span int
	// G2Span is the total length of the gene2 that intersects with the fragment.
	G2Span int
	// JointSpan is the total length covered by either G1 or G2.
	//
	// REQUIRES: JointSpan >= max(G1Span, G2Span)
	JointSpan int
	// FusionOrder=true iff g1 is aligned before g2.
	FusionOrder bool
	// G1Range is the [min,max) range of the fragment covered by G1.
	G1Range CrossReadPosRange
	// G2Range is the [min,max) range of the fragment covered by G2.
	G2Range CrossReadPosRange
}

FusionInfo represents a fusion event between two genes.

func DetectFusion ¶

func DetectFusion(geneDB *GeneDB, frag Fragment, stats *Stats, opts Opts) []FusionInfo

DetectFusion is the toplevel entry point. It determines whether the given fragment is a fusion of two genes. It returns the list of candidate fusion events. If no event is found, it returns an empty slice.

func (FusionInfo) Name ¶

func (fi FusionInfo) Name(geneDB *GeneDB, opts Opts) string

Name returns a human-readable string that encodes key attributes of this fusion.

type GeneDB ¶

type GeneDB struct {
	// contains filtered or unexported fields
}

GeneDB is a singleton object that stores transcriptomes, kmers generated from the transcripts, and candidate fusion event pairs. Thread compatible.

func NewGeneDB ¶

func NewGeneDB(opts Opts) *GeneDB

NewGeneDB creates an empty GeneDB.

func (*GeneDB) GeneIDRange ¶

func (m *GeneDB) GeneIDRange() (GeneID, GeneID)

GeneIDRange returns the range of gene IDs registered in this object. The low end is closed, high end is open. For example, the return value of (1, 95) means this DB holds 94 genes, IDs from 1 to 94. You can use GeneInfo() to get the information about the gene.

func (*GeneDB) GeneInfo ¶

func (m *GeneDB) GeneInfo(id GeneID) *GeneInfo

GeneInfo gets the GeneInfo given an ID. It always returns a non-nil info.

REQUIRES: ID is valid.

func (*GeneDB) GeneInfoByName ¶

func (m *GeneDB) GeneInfoByName(name string) *GeneInfo

GeneInfoByName gets GeneINfo given a gene name. It returns nil if the gene is not registered.

func (*GeneDB) IsFusionPair ¶

func (m *GeneDB) IsFusionPair(g1, g2 GeneID) bool

IsFusionPair checks if the given pair of genes appear in the cosmic TSV file added by ReadFusionEvents.

REQUIRES: IDs are valid.

func (*GeneDB) PrepopulateGeneInfo ¶

func (m *GeneDB) PrepopulateGeneInfo(genes []GeneInfo)

PrepopulateGeneInfo x fills geneinfo in batch. NOT FOR GENERAL USE. It is used to populate gene DB from recordio dump.

func (*GeneDB) PrepopulateGenes ¶

func (m *GeneDB) PrepopulateGenes(names []string)

PrepopulateGenes x assigns gene IDs to genes. NOT FOR GENERAL USE. It is used only to change the genename <-> geneID assignments to reproduce the behavior of the C++ code.

func (*GeneDB) ReadFusionEvents ¶

func (m *GeneDB) ReadFusionEvents(ctx context.Context, path string)

ReadFusionEvents reads from a Cosmic TSV file the names of gene pairs that form fusions. The first column of each line must be of form "gene1/gene2", for example "ACSL3/ETV1".

func (*GeneDB) ReadTranscriptome ¶

func (m *GeneDB) ReadTranscriptome(ctx context.Context, fastaPath string, filter bool)

ReadTranscriptome reads a transcriptome reference fasta file.

type GeneID ¶

type GeneID int32

GeneID is a dense sequence number (1, 2, 3, ...) assigned to a gene (e.g., "MAPK10"). IDs are valid only within one process invocation.

Caution: this type must be signed. kmer_index uses negative geneids to indicate outlined slices.

type GeneInfo ¶

type GeneInfo struct {
	// ID is a dense sequence number (1, 2, ...). It is valid only during the current run.
	ID GeneID
	// EnsemblID is parsed from the transcriptome FASTA key. E.g., "ENST00000279783.3"
	EnsemblID string
	// Gene is parsed from the transcriptome FASTA key. E.g., "OR8K1"
	Gene string
	// Chrom is parsed from the transcriptome FASTA key. E.g., "chr11"
	Chrom string
	// Start is parsed from the transcriptome FASTA key. E.g., 56346039
	Start int
	// End is parsed from the transcriptome FASTA key. E.g., 56346998
	End int
	// Index is the rank of this gene in Chrom. Gene with the smallest Start in
	// the given Chrom will have Index of zero.
	Index int
	// FusionEvent is true if this gene appears in the cosmic TSV file added via
	// ReadFusionEvents.
	FusionEvent bool
}

GeneInfo stores the info about a gene. It is parsed out from the transcriptome fasta key.

Transcriptome key example: "ENST00000279783.3|OR8K1|chr11:56346039-56346998:1051|960"

type GenePairOrder ¶

type GenePairOrder int

GenePairOrder defines the order at which a gene pair is printed (A/B or B/A). Possible values are CosmicOrder and AlphabeticalOrder.

const (
	// Output the genes in the order that's listed in the cosmic DB.  This format
	// requires that the either <g1,g2> or <g2,g2> to be listed in the cosmic.
	CosmicOrder GenePairOrder = iota
	// Output the genes in the alphabetical order.
	AlphabeticalOrder
)

type Kmer ¶

type Kmer uint64

Kmer is a compact encoding of a sequence of ACGT, up to 32bases.

type Opts ¶

type Opts struct {
	UMIInRead bool
	UMIInName bool
	// LowComplexityFraction determines whether a fragment (or read) should be
	// dropped because it contains too many repetition of the same base types.  If
	// LowComplexityFraction of bases in a fragment are such repetitions, it is
	// dropped without further analyses.
	LowComplexityFraction float64
	// KmerLength is the length of kmer used to match DNA sequences.
	KmerLength int
	Denovo     bool
	// maxGap specifies the max gap allowed between
	// two consecutive ranges (this is to tolerate sequence errors).
	MaxGap int

	// MaxHomology is the max overlap allowed b/w genes in a fusion
	MaxHomology int
	// Min base evidence for a gene in the fusion.
	MinSpan int

	// MaxKmerFrequency the number of genes a kmer belongs to, default 5. Used to
	// be --cap flag in the C++ code.
	MaxGenesPerKmer int

	// MaxGeneCandidatesPerFragment caps the number of genes considered for each
	// fragment. Doing so will reduce the number of pairs that will be processed
	// for every read.
	//
	// TODO(saito,xyang) report read through events but differentiating this from
	// the overlapping genes
	MaxGeneCandidatesPerFragment int

	// MaxProximityDistance is the distance cutoff below which a candidate will be
	// rejected as a readthrough event
	MaxProximityDistance int

	// MaxProximityGenes is number of genes separating a gene pair (If on the same
	// chromsosome) below which they will be flagged as read-through events.
	MaxProximityGenes int

	// MaxGenePartners is the maximum number of partners a gene can have this is
	// used in the filtering stage.
	MaxGenePartners int

	// Minimum number of supporting reads required
	// to consider a fusion
	MinReadSupport int
}

type Pos ¶

type Pos int64

Pos is the position in a read or a fragment (i.e., paired reads).

When Pos refers to the position in a read, it is simply the zero-based index within the read sequence.

When Pos refers to the position in a fragment, it can be either position in R1 or in R2. A position in R2 has a value >= r2PosOffset. To get the actual index within the R2 sequence, subtract r2PosOffset from the value.

func (Pos) R2Off ¶

func (pos Pos) R2Off() int

R2Off returns the offset of this position from the start of R2.

REQUIRES: pos.Type()==R2.

func (Pos) ReadType ¶

func (pos Pos) ReadType() ReadType

ReadType returns the type of read.

type PosRange ¶

type PosRange struct{ Start, End Pos }

PosRange is a half-open range [start, end).

INVARIANT: Start and End never cross a R1/R2 boundary.

func (PosRange) Equal ¶

func (r PosRange) Equal(other PosRange) bool

Equal checks if the two ranges are identical.

type ReadType ¶

type ReadType uint8

ReadType defines the read type (R1 or R2) for a paired fragment.

const (
	// R1 means the read is either raw read from R1 fastq file, or it is a result
	// of stitching R1 and R2.
	R1 ReadType = iota
	// R2 means the read is from R2 fastq file. Note: when the read pair could be
	// stitched, the combined result will be stored in Fragment.R1Seq and
	// R2Seq. will be empty.
	R2
)

type Stats ¶

type Stats struct {
	// LowComplexityReads2 is the # of readpairs where both reads are
	// found to have low complexity.
	LowComplexityReads2 int
	// LowComplexityReads2 is the # of readpairs where one of the reads are found
	// to have low complexity.
	LowComplexityReads1 int
	// LowComplexityStitched is the # of readpairs that were successfully
	// stitched, but then found to have low complexity.
	LowComplexityReadsStitched int
	// Stitched is the # of reads successuflly stitched
	Stitched int
	// RawGenes is the total genes found during kmer lookup.
	RawGenes int
	// Genes is the total genes found during kmer lookup, after
	// Opts.MaxGeneCandidatesPerFragment cutoff.
	Genes int
	// Fragments counts the total number of fragments processed.
	Fragments int
	// FragmentsWithMatchingGenes[k] (0<=k<4) counts the total # of fragments that
	// are found to have k genes matching one of its kmers. The last element in
	// this array counts all the fragmenst with >=4 matching genes.
	FragmentsWithMatchingGenes [5]int
	// Ranges is the total # of ranges covered by any gene.
	RawRanges int
	// Ranges is the total # of ranges covered by any gene, after
	// Opts.MaxGeneCandidatesPerFragment cutoff.
	Ranges int
}

Stats represents high-level statistics during the run of the stage1 of AF4.

func (Stats) Merge ¶

func (s Stats) Merge(o Stats) Stats

Merge adds the field values of the two Stats objects and creates new Stats.

type Stitcher ¶

type Stitcher struct {
	// contains filtered or unexported fields
}

Stitcher stitches two reads (R1 and R2) and produces a Fragment. It can be used to stitch multiple read pairs. Thread compatible.

func NewStitcher ¶

func NewStitcher(kmerLength int, lowComplexityFraction float64) *Stitcher

NewStitcher creates a new stitcher. kmerLength and lowComplexityFraction should be copied from the counterparts in Opts.

func (*Stitcher) FreeFragment ¶

func (s *Stitcher) FreeFragment(f Fragment)

FreeFragment puts the fragment in a freepool. The caller must not retain any reference to the fragment after the call. The future calls to Stitch will use fragments in the freepool.

func (*Stitcher) Stitch ¶

func (s *Stitcher) Stitch(name, r1Seq, r2Seq string, stats *Stats) Fragment

Stitch tries to find a segment shared between the readpair and combine them into once sequence. Name should be the R1 name from the FASTQ.

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
cmd This is the main package for parsegencode.	This is the main package for parsegencode.
parsegencode Package parsegencode contains the required methods for parsing a gencode GTF annotation and printing out the transcripts to an output file.	Package parsegencode contains the required methods for parsing a gencode GTF annotation and printing out the transcripts to an output file.

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL