mergesort

package module
v0.0.0-...-cc53529 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 5, 2025 License: BSD-2-Clause Imports: 9 Imported by: 0

README

Parallel Merge Sorting Scanner

A bufio.Scanner like reader which reads from multiple io.Readers and returns an ordered list. Multiple go-routines and connecting channels are created to speed up the sorting. This way multiple file reads are done in parallel along with with each node in the binary tree doing bottom up sorting. The final sorted channel is exposed as the standard Scan(), Bytes(), and Text() that one is familiar with from bufio.Scanner.

An additional FilterFunc is provided to parallelize any computation which needs to be done to the record before it is consumed. An example of this may be a conversion of record types, and an example of such use case is doing a parseInt to create a 4 byte unsigned integer for data space optimization.

The CompareFunction can be useful in cases that records in the original file may be case insensitive ordering, in which case the compare function can be provided which ignores the case of the input byte slice to provide a ordering resultant.

import (
  ...
  ms "github.com/pschou/go-mergesort"
)

func ExampleNew() {
  s := ms.New(context.TODO(),
    strings.NewReader("a\nc\nd\n"),
    strings.NewReader("b\ne\nf\ng\n"),
    strings.NewReader("f\nz\n"),
  )

  for s.Scan() {
    fmt.Println(s.Text())
  }

  // Output:
  // a
  // b
  // c
  // d
  // e
  // f
  // g
  // z
}

Documentation

Overview

Merge sort will sort a set of io.Reader by fields in parallel and return the result. Similar to how bufio.Scanner works, it is up to the reader interface to make sure that memory is copied or used per scan call as mergesort reuses the byte slices upon a new scan call.

Index

Examples

Constants

This section is empty.

Variables

This section is empty.

Functions

func BytesCompare

func BytesCompare(a, b []byte, ai, bi int) int

Simple comparison function

func BytesCompareDedup

func BytesCompareDedup(a, b []byte, ai, bi int) (c int)

Simple comparison function with deduplication between files

func ScanLines

func ScanLines(data []byte, atEOF bool, i int) (advance int, token []byte, err error)

ScanLines is a split function for a Scanner that returns each line of text, stripped of any trailing end-of-line marker. The returned line may be empty. The end-of-line marker is one optional carriage return followed by one mandatory newline. In regular expression notation, it is `\r?\n`. The last non-empty line of input will be returned even if it has no newline.

Types

type CompareFunc

type CompareFunc func(a, b []byte, ai, bi int) int

Compare returns an integer comparing two byte slices lexicographically. The result will be 0 if a == b, -1 if a < b, and +1 if a > b. A nil argument is equivalent to an empty slice.

Record removal is also possible, if -2 is provided then only `a` will be used and if +2 is provided then only `b` will be used.

type FilterFunc

type FilterFunc func(input []byte, id int) (output []byte, used func(), err error)

Filter allows one to filter results just after scanning. If there is computational work needed to be done (like type conversions or looking up an index in a map) the filter is the place to do this, not in the SplitFunc.

The used() func is called after a record has been used (if one is provided). The purpose of the used func handle is to help with memory management. An example of the use case for used is to return a byte slice to a sync.Pool.

type Scanner

type Scanner struct {
	// contains filtered or unexported fields
}

This scanner interface is similar to the bufio.Scanner interface so as to make implementation in already existing code trivial. The input type has to be a io.Reader interface so as to avoid any collisions in the parallel sorting reusing the same bytes buffer. Effort has been made to avoid any memory copies.

func New

func New(ctx context.Context, list ...io.Reader) *Scanner

New returns a new Scanner to read from a set of scanners which expect ordered input.

As the bulk of the sorting is done in goroutines and in the background, the context is the best method to cancel any on-going sorting functions.

Example
package main

import (
	"context"
	"fmt"
	"io"
	"strings"

	ms "github.com/pschou/go-mergesort"
)

func main() {
	list := []io.Reader{
		strings.NewReader("b\ne\nf\ng\n"),
		strings.NewReader("f\nz\n"),
		strings.NewReader("a\nc\nd\n"),
		strings.NewReader("x\ny\n"),
	}

	s := ms.New(context.TODO(), list...)
	// s.Split(ms.ScanLines)
	// s.Compare(ms.BytesCompare)

	s.Scan()
	fmt.Println(s.Text())
}
Output:

a

func (*Scanner) Bytes

func (s *Scanner) Bytes() []byte

Bytes returns the most recent token generated by a call to Scanner.Scan. The underlying array may point to data that will be overwritten by a subsequent call to Scan.

func (*Scanner) Compare

func (s *Scanner) Compare(compare CompareFunc)

Compare sets the compare function for the Scanner. The default compare function is BytesCompare.

Compare panics if it is called after scanning has started.

func (*Scanner) Err

func (s *Scanner) Err() error

Err returns the first non-EOF error that was encountered by the Scanner.

func (*Scanner) Filter

func (s *Scanner) Filter(filter FilterFunc)

Filter sets the filter function for the Scanner. The default filter function is none.

Filter panics if it is called after scanning has started.

func (*Scanner) Scan

func (s *Scanner) Scan() bool

Scan advances the Scanner to the next token, which will then be available through the Scanner.Bytes or Scanner.Text method. It returns false when there are no more tokens, either by reaching the end of the input or an error.

func (*Scanner) Split

func (s *Scanner) Split(split SplitFunc)

Split sets the split function for the Scanner. The default split function is ScanLines.

Split panics if it is called after scanning has started.

func (*Scanner) Text

func (s *Scanner) Text() string

Text returns the most recent token generated by a call to Scanner.Scan as a newly allocated string holding bytes.

type SplitFunc

type SplitFunc func(data []byte, atEOF bool, i int) (advance int, token []byte, err error)

SplitFunc is the signature of the split function used to tokenize the input. The arguments are an initial substring of the remaining unprocessed data and a flag, atEOF, that reports whether the [Reader] has no more data to give, and i, which is the index of the [Reader] input. The return values are the number of bytes to advance the input and the next token to return to the user, if any, plus an error, if any.

Scanning stops if the function returns an error, in which case some of the input may be discarded. If that error is [ErrFinalToken], scanning stops with no error. A non-nil token delivered with [ErrFinalToken] will be the last token, and a nil token with [ErrFinalToken] immediately stops the scanning.

Otherwise, the Scanner advances the input. If the token is not nil, the Scanner returns it to the user. If the token is nil, the Scanner reads more data and continues scanning; if there is no more data--if atEOF was true--the Scanner returns. If the data does not yet hold a complete token, for instance if it has no newline while scanning lines, a SplitFunc can return (0, nil, nil) to signal the Scanner to read more data into the slice and try again with a longer slice starting at the same point in the input.

The function is never called with an empty data slice unless atEOF is true. If atEOF is true, however, data may be non-empty and, as always, holds unprocessed text.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL