Documentation
¶
Overview ¶
Package extsort implements an unstable external sort for all the records in a chan or iterator
Index ¶
- func NewComparisonError(cause interface{}, context string) error
- func NewDeserializationError(cause interface{}, dataSize int, context string) error
- func NewDiskError(err error, operation, path string) error
- func NewResourceError(err error, resource, context string) error
- func NewSerializationError(cause interface{}, context string) error
- func UniqStringChan(in chan string) chan string
- type CompareGeneric
- type CompareLessFuncdeprecated
- type ComparisonError
- type Config
- type ConfigError
- type DeserializationError
- type FromBytesdeprecated
- type FromBytesGeneric
- type GenericSorter
- type OrderedSorter
- type SerializationError
- type SortTypedeprecated
- type SortTypeSorterdeprecated
- type Sorter
- type StringSorter
- type ToBytesGeneric
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func NewComparisonError ¶ added in v1.2.0
NewComparisonError creates a ComparisonError
func NewDeserializationError ¶ added in v1.2.0
NewDeserializationError creates a DeserializationError
func NewDiskError ¶ added in v1.2.0
NewDiskError creates a DiskError wrapping the underlying I/O error
func NewResourceError ¶ added in v1.2.0
NewResourceError creates a ResourceError wrapping the underlying error
func NewSerializationError ¶ added in v1.2.0
NewSerializationError creates a SerializationError
func UniqStringChan ¶
UniqStringChan returns a channel that filters out consecutive duplicate strings from the input. This function assumes the input channel provides strings in sorted order and uses string equality to detect duplicates. It preserves the first occurrence of each unique string while filtering out subsequent duplicates. The function is optimized for sorted inputs where duplicates appear consecutively, making it suitable for post-processing sorted results from external sort operations.
The returned channel will be closed when the input channel is closed. This function spawns a goroutine that will terminate when the input channel is closed.
Types ¶
type CompareGeneric ¶ added in v1.3.0
CompareGeneric is a function type for comparing two items of type E. It must implement a strict weak ordering: reflexivity, antisymmetry, and transitivity. Returns a negative integer if a should be ordered before b, zero if they are equal, and a positive integer if a should be ordered after b in the final sorted output. The function must be consistent and must handle any errors by panicking. This follows the same semantics as cmp.Compare and can be implemented using cmp.Compare[T] for ordered types.
type CompareLessFunc
deprecated
type ComparisonError ¶ added in v1.2.0
type ComparisonError struct { // Cause is the original panic or error that occurred during comparison Cause interface{} // Context provides additional information about when the comparison failed Context string }
ComparisonError represents an error that occurred during item comparison
func (*ComparisonError) Error ¶ added in v1.2.0
func (e *ComparisonError) Error() string
func (*ComparisonError) Unwrap ¶ added in v1.2.0
func (e *ComparisonError) Unwrap() error
type Config ¶
type Config struct { // ChunkSize specifies the maximum number of records to store in each chunk // before writing to disk. Larger chunks use more memory but reduce I/O operations. // Default: 1,000,000 records. Must be > 1. ChunkSize int // NumWorkers controls the maximum number of goroutines used for parallel // chunk sorting and merging. More workers can improve CPU utilization on multi-core systems. // Default: 2 workers. Must be > 1. NumWorkers int // ChanBuffSize sets the buffer size for internal channels used during chunk merging. // Larger buffers can improve throughput but use more memory. // Default: 1. Must be >= 0. ChanBuffSize int // SortedChanBuffSize sets the buffer size for the output channel that delivers // sorted results. Larger buffers allow more decoupling between sorting and consumption. // Default: 1000. Must be >= 0. SortedChanBuffSize int // TempFilesDir specifies the directory for temporary files during sorting. // When empty (default), the library uses intelligent directory selection that // prefers disk-backed locations over potentially memory-backed filesystems // (like tmpfs on Linux). This helps prevent out-of-memory issues when sorting // datasets larger than available RAM. // // For production use with large datasets, it's recommended to explicitly set // this to a known disk-backed directory (such as "/var/tmp" on Unix systems) // to ensure optimal performance and avoid memory limitations. On Linux systems, // prefer "/var/tmp" over "/tmp" since "/tmp" may be mounted as tmpfs. // // Default: "" (intelligent selection). TempFilesDir string }
Config holds configuration settings for external sorting operations. All fields have sensible defaults and can be left as zero values to use defaults.
func DefaultConfig ¶
func DefaultConfig() *Config
DefaultConfig returns a Config with sensible default values optimized for general-purpose external sorting. These defaults balance memory usage, I/O efficiency, and parallelism for typical workloads.
type ConfigError ¶ added in v1.2.0
type ConfigError struct { // Field is the name of the configuration field that's invalid Field string // Value is the invalid value provided Value interface{} // Reason explains why the value is invalid Reason string }
ConfigError represents an error in configuration parameters
func (*ConfigError) Error ¶ added in v1.2.0
func (e *ConfigError) Error() string
type DeserializationError ¶ added in v1.2.0
type DeserializationError struct { // Cause is the original panic or error that occurred during deserialization Cause interface{} // DataSize is the size of the data that failed to deserialize DataSize int // Context provides additional information about what was being deserialized Context string }
DeserializationError represents an error that occurred during item deserialization (FromBytes)
func (*DeserializationError) Error ¶ added in v1.2.0
func (e *DeserializationError) Error() string
func (*DeserializationError) Unwrap ¶ added in v1.2.0
func (e *DeserializationError) Unwrap() error
type FromBytesGeneric ¶ added in v1.2.0
FromBytesGeneric is a function type for deserializing bytes back to type E. It's used during the merge phase to reconstruct items from temporary storage. The function should be the inverse of the corresponding ToBytesGeneric function. It returns an error for any deserialization failures, which will be wrapped in a DeserializationError by the external sorter.
type GenericSorter ¶ added in v1.2.0
type GenericSorter[E any] struct { // contains filtered or unexported fields }
GenericSorter implements external sorting for any type E using a divide-and-conquer approach. It reads input from a channel, splits data into chunks that fit in memory, sorts each chunk, saves them to temporary files, then merges all chunks back into a sorted output stream. The sorter uses configurable parallelism for both sorting and merging phases, and employs memory pools to reduce garbage collection pressure during operation.
func Generic ¶ added in v1.2.0
func Generic[E any](input <-chan E, fromBytes FromBytesGeneric[E], toBytes ToBytesGeneric[E], compareFunc CompareGeneric[E], config *Config) (*GenericSorter[E], <-chan E, <-chan error)
Generic creates a new external sorter for any type E and returns the sorter instance, output channel with sorted results, and error channel.
Parameters:
- input: Channel providing the data to be sorted
- fromBytes: Function to deserialize E from bytes when reading from disk
- toBytes: Function to serialize E to bytes when writing to disk
- compareFunc: Comparison function that returns negative/zero/positive for less/equal/greater
- config: Configuration options (nil uses defaults)
The sorting process:
- Reads data from input channel into memory chunks
- Sorts each chunk in parallel using the provided compareFunc
- Saves sorted chunks to temporary files using toBytes serialization
- Merges all chunks back into sorted order using fromBytes deserialization
Call Sort() on the returned sorter to begin the sorting process. Results are delivered via the output channel, errors via the error channel. On error or interruption, temporary files may remain in config.TempFilesDir.
func MockGeneric ¶ added in v1.2.0
func MockGeneric[E any](input <-chan E, fromBytes FromBytesGeneric[E], toBytes ToBytesGeneric[E], compareFunc CompareGeneric[E], config *Config, n int) (*GenericSorter[E], <-chan E, <-chan error)
MockGeneric creates an external sorter that uses in-memory storage instead of disk files. This is primarily useful for testing and benchmarking without filesystem I/O overhead. The parameter n specifies the initial capacity of the in-memory buffer. All other behavior is identical to Generic().
func (*GenericSorter[E]) Sort ¶ added in v1.2.0
func (s *GenericSorter[E]) Sort(ctx context.Context)
Sort sorts the Sorter's input chan and returns a new sorted chan, and error Chan Sort is a chunking operation that runs multiple workers asynchronously this blocks while sorting chunks and unblocks when merging NOTE: the context passed to Sort must outlive Sort() returning. Merge uses the same context and runs in a goroutine after Sort returns(). for example, if calling sort in an errGroup, you must pass the group's parent context into sort.
type OrderedSorter ¶ added in v1.2.0
type OrderedSorter[T cmp.Ordered] struct { GenericSorter[T] // contains filtered or unexported fields }
OrderedSorter provides external sorting for types that implement cmp.Ordered. It embeds GenericSorter and adds optimized byte serialization using gob encoding with a sync.Pool for buffer reuse to reduce allocations.
func Ordered ¶ added in v1.2.0
func Ordered[T cmp.Ordered](input <-chan T, config *Config) (*OrderedSorter[T], <-chan T, <-chan error)
Ordered performs external sorting on a channel of cmp.Ordered types. It returns the sorter instance, output channel with sorted results, and error channel. Uses gob encoding for serialization and the < operator for comparison.
func OrderedMock ¶ added in v1.2.0
func OrderedMock[T cmp.Ordered](input <-chan T, config *Config, n int) (*OrderedSorter[T], <-chan T, <-chan error)
OrderedMock performs external sorting with a mock implementation that limits the number of items to sort (useful for testing). Takes the same parameters as Ordered plus n which limits the number of items processed.
type SerializationError ¶ added in v1.2.0
type SerializationError struct { // Cause is the original panic or error that occurred during serialization Cause interface{} // Context provides additional information about what was being serialized Context string }
SerializationError represents an error that occurred during item serialization (ToBytes)
func (*SerializationError) Error ¶ added in v1.2.0
func (e *SerializationError) Error() string
func (*SerializationError) Unwrap ¶ added in v1.2.0
func (e *SerializationError) Unwrap() error
type SortType
deprecated
type SortType interface {
ToBytes() []byte // ToBytes used for marshaling
}
SortType defines the interface required by the extsort library to be able to sort the items
Deprecated: Use Generic() with custom types instead for new code. This interface is maintained for backward compatibility.
type SortTypeSorter
deprecated
type SortTypeSorter struct { GenericSorter[SortType] }
SortTypeSorter provides external sorting for types implementing the SortType interface, maintaining backward compatibility with the legacy interface-based API. It embeds GenericSorter[SortType] and adapts the interface methods to the generic implementation.
Deprecated: Use GenericSorter[T] instead for new code. This type is maintained for backward compatibility.
func New
deprecated
func New(input <-chan SortType, fromBytes FromBytes, lessFunc CompareLessFunc, config *Config) (*SortTypeSorter, <-chan SortType, <-chan error)
New performs external sorting on a channel of SortType items using the legacy interface-based API. It takes a FromBytes function for deserialization and a CompareLessFunc for comparison. Returns the sorter instance, output channel with sorted items, and error channel. This function provides backward compatibility with the original extsort API.
Deprecated: Use Generic() instead for new code. This function is maintained for backward compatibility.
func NewMock
deprecated
func NewMock(input <-chan SortType, fromBytes FromBytes, lessFunc CompareLessFunc, config *Config, n int) (*SortTypeSorter, <-chan SortType, <-chan error)
NewMock performs external sorting on SortType items with a mock implementation that limits the number of items to sort. Useful for testing with a controlled dataset size. The parameter n specifies the maximum number of items to process. Uses the same interface-based API as New for backward compatibility.
Deprecated: Use MockGeneric() instead for new code. This function is maintained for backward compatibility.
type Sorter ¶
type Sorter interface { // Sort performs the complete external sorting operation. // It reads from the input channel, sorts the data using temporary storage, // and delivers results to the output channel. The operation can be cancelled // or timed out using the provided context. Sort(context.Context) }
Sorter is the interface that all extsort sorter implementations must satisfy. It provides a single Sort method that performs the complete external sorting operation within the provided context, allowing for cancellation and timeout control.
type StringSorter ¶
type StringSorter struct { GenericSorter[string] }
StringSorter provides external sorting for strings, maintaining backward compatibility with the legacy string-specific API. It embeds GenericSorter[string] and uses simple byte slice conversion for serialization.
func Strings ¶
func Strings(input <-chan string, config *Config) (*StringSorter, <-chan string, <-chan error)
Strings performs external sorting on a channel of strings using lexicographic ordering. Returns the sorter instance, output channel with sorted strings, and error channel. This function provides backward compatibility with the legacy string-specific API.
func StringsMock ¶
func StringsMock(input <-chan string, config *Config, n int) (*StringSorter, <-chan string, <-chan error)
StringsMock performs external sorting on strings with a mock implementation that limits the number of strings to sort. Useful for testing with a controlled dataset size. The parameter n specifies the maximum number of strings to process.
type ToBytesGeneric ¶ added in v1.2.0
ToBytesGeneric is a function type for serializing type E to bytes. It's used during chunk saving to store items in temporary files. The function should produce deterministic output that can be read back by the corresponding FromBytesGeneric function. It returns an error for any serialization failures, which will be wrapped in a SerializationError by the external sorter.
Source Files
¶
Directories
¶
Path | Synopsis |
---|---|
Package diff provides functionality for comparing two sorted data streams and identifying differences between them.
|
Package diff provides functionality for comparing two sorted data streams and identifying differences between them. |
update_examples.go updates README.md code examples with actual example code.
|
update_examples.go updates README.md code examples with actual example code. |
custom_types
command
|
|
diff
command
|
|
diff_generic
command
|
|
diff_parallel
command
|
|
generic
command
|
|
legacy
command
|
|
strings
command
|
|
Package queue provides a generic priority queue implementation optimized for external sorting.
|
Package queue provides a generic priority queue implementation optimized for external sorting. |
Package tempfile provides an abstraction for creating virtual temporary files that are mapped to sections of a single physical file on disk.
|
Package tempfile provides an abstraction for creating virtual temporary files that are mapped to sections of a single physical file on disk. |