Documentation
¶
Overview ¶
Merge sort will sort a set of io.Reader by fields in parallel and return the result. Similar to how bufio.Scanner works, it is up to the reader interface to make sure that memory is copied or used per scan call as mergesort reuses the byte slices upon a new scan call.
Index ¶
Examples ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func BytesCompareDedup ¶
Simple comparison function with deduplication between files
func ScanLines ¶
ScanLines is a split function for a Scanner that returns each line of text, stripped of any trailing end-of-line marker. The returned line may be empty. The end-of-line marker is one optional carriage return followed by one mandatory newline. In regular expression notation, it is `\r?\n`. The last non-empty line of input will be returned even if it has no newline.
Types ¶
type CompareFunc ¶
Compare returns an integer comparing two byte slices lexicographically. The result will be 0 if a == b, -1 if a < b, and +1 if a > b. A nil argument is equivalent to an empty slice.
Record removal is also possible, if -2 is provided then only `a` will be used and if +2 is provided then only `b` will be used.
type FilterFunc ¶
Filter allows one to filter results just after scanning. If there is computational work needed to be done (like type conversions or looking up an index in a map) the filter is the place to do this, not in the SplitFunc.
The used() func is called after a record has been used (if one is provided). The purpose of the used func handle is to help with memory management. An example of the use case for used is to return a byte slice to a sync.Pool.
type Scanner ¶
type Scanner struct {
// contains filtered or unexported fields
}
This scanner interface is similar to the bufio.Scanner interface so as to make implementation in already existing code trivial. The input type has to be a io.Reader interface so as to avoid any collisions in the parallel sorting reusing the same bytes buffer. Effort has been made to avoid any memory copies.
func New ¶
New returns a new Scanner to read from a set of scanners which expect ordered input.
As the bulk of the sorting is done in goroutines and in the background, the context is the best method to cancel any on-going sorting functions.
Example ¶
package main import ( "context" "fmt" "io" "strings" ms "github.com/pschou/go-mergesort" ) func main() { list := []io.Reader{ strings.NewReader("b\ne\nf\ng\n"), strings.NewReader("f\nz\n"), strings.NewReader("a\nc\nd\n"), strings.NewReader("x\ny\n"), } s := ms.New(context.TODO(), list...) // s.Split(ms.ScanLines) // s.Compare(ms.BytesCompare) s.Scan() fmt.Println(s.Text()) }
Output: a
func (*Scanner) Bytes ¶
Bytes returns the most recent token generated by a call to Scanner.Scan. The underlying array may point to data that will be overwritten by a subsequent call to Scan.
func (*Scanner) Compare ¶
func (s *Scanner) Compare(compare CompareFunc)
Compare sets the compare function for the Scanner. The default compare function is BytesCompare.
Compare panics if it is called after scanning has started.
func (*Scanner) Filter ¶
func (s *Scanner) Filter(filter FilterFunc)
Filter sets the filter function for the Scanner. The default filter function is none.
Filter panics if it is called after scanning has started.
func (*Scanner) Scan ¶
Scan advances the Scanner to the next token, which will then be available through the Scanner.Bytes or Scanner.Text method. It returns false when there are no more tokens, either by reaching the end of the input or an error.
type SplitFunc ¶
SplitFunc is the signature of the split function used to tokenize the input. The arguments are an initial substring of the remaining unprocessed data and a flag, atEOF, that reports whether the [Reader] has no more data to give, and i, which is the index of the [Reader] input. The return values are the number of bytes to advance the input and the next token to return to the user, if any, plus an error, if any.
Scanning stops if the function returns an error, in which case some of the input may be discarded. If that error is [ErrFinalToken], scanning stops with no error. A non-nil token delivered with [ErrFinalToken] will be the last token, and a nil token with [ErrFinalToken] immediately stops the scanning.
Otherwise, the Scanner advances the input. If the token is not nil, the Scanner returns it to the user. If the token is nil, the Scanner reads more data and continues scanning; if there is no more data--if atEOF was true--the Scanner returns. If the data does not yet hold a complete token, for instance if it has no newline while scanning lines, a SplitFunc can return (0, nil, nil) to signal the Scanner to read more data into the slice and try again with a longer slice starting at the same point in the input.
The function is never called with an empty data slice unless atEOF is true. If atEOF is true, however, data may be non-empty and, as always, holds unprocessed text.