csvplus

package module

v0.3.3 Latest Latest Go to latest Published: May 18, 2018 License: BSD-3-Clause Imports: 12 Imported by: 1

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/maxim2266/csvplus

Links

Open Source Insights

README ¶

csvplus

Package csvplus extends the standard Go encoding/csv package with fluent interface, lazy stream processing operations, indices and joins.

The library is primarily designed for ETL-like processes. It is mostly useful in places where the more advanced searching/joining capabilities of a fully-featured SQL database are not required, but the same time the data transformations needed still include SQL-like operations.

License: BSD

Examples

Simple sequential processing:

people := csvplus.FromFile("people.csv").SelectColumns("name", "surname", "id")

err := csvplus.Take(people).
	Filter(csvplus.Like(csvplus.Row{"name": "Amelia"})).
	Map(func(row csvplus.Row) csvplus.Row { row["name"] = "Julia"; return row }).
	ToCsvFile("out.csv", "name", "surname")

if err != nil {
	return err
}

More involved example:

customers := csvplus.FromFile("people.csv").SelectColumns("id", "name", "surname")
custIndex, err := csvplus.Take(customers).UniqueIndexOn("id")

if err != nil {
	return err
}

products := csvplus.FromFile("stock.csv").SelectColumns("prod_id", "product", "price")
prodIndex, err := csvplus.Take(products).UniqueIndexOn("prod_id")

if err != nil {
	return err
}

orders := csvplus.FromFile("orders.csv").SelectColumns("cust_id", "prod_id", "qty", "ts")
iter := csvplus.Take(orders).Join(custIndex, "cust_id").Join(prodIndex)

return iter(func(row csvplus.Row) error {
	// prints lines like:
	//	John Doe bought 38 oranges for £0.03 each on 2016-09-14T08:48:22+01:00
	_, e := fmt.Printf("%s %s bought %s %ss for £%s each on %s\n",
		row["name"], row["surname"], row["qty"], row["product"], row["price"], row["ts"])
	return e
})

Design principles

The package functionality is based on the operations on the following entities:

type Row
type DataSource
type Index

Type `Row`

Row represents one row from a DataSource. It is a map from column names to the string values under those columns on the current row. The package expects a unique name assigned to every column at source. Compared to using integer indices this provides more convenience when complex transformations get applied to each row during processing.

type `DataSource`

Type DataSource represents any source of zero or more rows, like .csv file. This is a function that when invoked feeds the given callback with the data from its source, one Row at a time. The type also has a number of operations defined on it that provide for easy composition of the operations on the DataSource, forming so called fluent interface. All these operations are 'lazy', i.e. they are not performed immediately, but instead each of them returns a new DataSource.

There is also a number of convenience operations that actually invoke the DataSource function to produce a specific type of output:

IndexOn to build an index on the specified column(s);
UniqueIndexOn to build a unique index on the specified column(s);
ToCsv to serialise the DataSource to the given io.Writer in .csv format;
ToCsvFile to store the DataSource in the specified file in .csv format;
ToJSON to serialise the DataSource to the given io.Writer in JSON format;
ToJSONFile to store the DataSource in the specified file in JSON format;
ToRows to convert the DataSource to a slice of Rows.

Type `Index`

Index is a sorted collection of rows. The sorting is performed on the columns specified when the index is created. Iteration over an index yields a sorted sequence of rows. An Index can be joined with a DataSource. The type has operations for finding rows and creating sub-indices in O(log(n)) time. Another useful operation is resolving duplicates. Building an index takes O(n*log(n)) time. It should be noted that the Index building operation requires the entire dataset to be read into the memory, so certain care should be taken when indexing huge datasets. An index can also be stored to, or loaded from a disk file.

For more details see the documentation.

Project status

The project is in a usable state usually called "beta". Tested on Linux Mint 18.3 using Go version 1.10.2.

Documentation ¶

Overview ¶

Package csvplus extends the standard Go encoding/csv package with fluent interface, lazy stream processing operations, indices and joins.

Index ¶

func All(funcs ...func(Row) bool) func(Row) bool
func Any(funcs ...func(Row) bool) func(Row) bool
func Like(match Row) func(Row) bool
func Not(pred func(Row) bool) func(Row) bool
type DataSource
- func Take(src interface{ ... }) DataSource
- func TakeRows(rows []Row) DataSource
type DataSourceError
- func (e *DataSourceError) Error() string
type Index
- func LoadIndex(fileName string) (*Index, error)
type Reader
type Row
type RowFunc

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func All ¶

func All(funcs ...func(Row) bool) func(Row) bool

All is a predicate combinator that takes any number of other predicates and produces a new predicate which returns 'true' only if all the specified predicates return 'true' for the same input Row.

func Any ¶

func Any(funcs ...func(Row) bool) func(Row) bool

Any is a predicate combinator that takes any number of other predicates and produces a new predicate which returns 'true' if any the specified predicates returns 'true' for the same input Row.

func Like ¶

func Like(match Row) func(Row) bool

Like produces a predicate that returns 'true' if its input Row matches all the corresponding values from the specified 'match' Row.

func Not ¶

func Not(pred func(Row) bool) func(Row) bool

Not produces a new predicate that reverts the return value from the given predicate.

Types ¶

type DataSource ¶

type DataSource func(RowFunc) error

DataSource is the iterator type used throughout this library. The iterator calls the given RowFunc once per each row. The iteration continues until either the data source is exhausted or the supplied RowFunc returns a non-nil error, in which case the error is returned back to the caller of the iterator. A special case of io.EOF simply stops the iteration and the iterator function returns nil error.

func Take ¶

func Take(src interface {
	Iterate(fn RowFunc) error
}) DataSource

Take converts anything with Iterate() method to a DataSource.

func TakeRows ¶ added in v0.2.1

func TakeRows(rows []Row) DataSource

TakeRows converts a slice of Rows to a DataSource.

func (DataSource) Drop ¶ added in v0.3.0

func (src DataSource) Drop(n uint64) DataSource

Drop specifies the number of Rows to ignore before passing the remaining rows down the pipeline.

func (DataSource) DropColumns ¶ added in v0.3.0

func (src DataSource) DropColumns(columns ...string) DataSource

DropColumns removes the specifies columns from each row.

func (DataSource) DropWhile ¶ added in v0.3.0

func (src DataSource) DropWhile(pred func(Row) bool) DataSource

DropWhile ignores all the Rows for as long as the specified predicate is true; afterwards all the remaining Rows are passed down the pipeline.

func (DataSource) Except ¶ added in v0.3.0

func (src DataSource) Except(index *Index, columns ...string) DataSource

Except returns a table containing all the rows not in the specified Index, unchanged. The specified columns are matched against those from the index, in the order of specification. If no columns are specified then the columns list is taken from the index.

func (DataSource) Filter ¶ added in v0.3.0

func (src DataSource) Filter(pred func(Row) bool) DataSource

Filter takes a predicate which, when applied to a Row, decides if that Row should be passed down for further processing. The predicate should return 'true' to pass the Row.

func (DataSource) IndexOn ¶ added in v0.3.0

func (src DataSource) IndexOn(columns ...string) (*Index, error)

IndexOn iterates the input source, building index on the specified columns. Columns are taken from the specified list from left to the right.

func (DataSource) Join ¶ added in v0.3.0

func (src DataSource) Join(index *Index, columns ...string) DataSource

Join returns a DataSource which is a join between the current DataSource and the specified Index. The specified columns are matched against those from the index, in the order of specification. Empty 'columns' list yields a join on the columns from the Index (aka "natural join") which all must exist in the current DataSource. Each row in the resulting table contains all the columns from both the current table and the index. This is a lazy operation, the actual join is performed only when the resulting table is iterated over.

func (DataSource) Map ¶ added in v0.3.0

func (src DataSource) Map(mf func(Row) Row) DataSource

Map takes a function which gets applied to each Row when the source is iterated over. The function may return a modified input Row, or an entirely new Row.

func (DataSource) SelectColumns ¶ added in v0.3.0

func (src DataSource) SelectColumns(columns ...string) DataSource

SelectColumns leaves only the specified columns on each row. It is an error if any such column does not exist.

func (DataSource) TakeWhile ¶ added in v0.3.0

func (src DataSource) TakeWhile(pred func(Row) bool) DataSource

TakeWhile takes a predicate which gets applied to each Row upon iteration. The iteration stops when the predicate returns 'false'.

func (DataSource) ToCsv ¶ added in v0.3.0

func (src DataSource) ToCsv(out io.Writer, columns ...string) (err error)

ToCsv iterates the data source and writes the selected columns in .csv format to the given io.Writer. The data are written in the "canonical" form with the header on the first line and with all the lines having the same number of fields, using default settings for the underlying csv.Writer.

func (DataSource) ToCsvFile ¶ added in v0.3.0

func (src DataSource) ToCsvFile(name string, columns ...string) error

ToCsvFile iterates the data source and writes the selected columns in .csv format to the given file. The data are written in the "canonical" form with the header on the first line and with all the lines having the same number of fields, using default settings for the underlying csv.Writer.

func (DataSource) ToJSON ¶ added in v0.3.3

func (src DataSource) ToJSON(out io.Writer) (err error)

ToJSON iterates over the data source and writes all Rows to the given io.Writer in JSON format.

func (DataSource) ToJSONFile ¶ added in v0.3.3

func (src DataSource) ToJSONFile(name string) error

ToJSONFile iterates over the data source and writes all Rows to the given file in JSON format.

func (DataSource) ToRows ¶ added in v0.3.0

func (src DataSource) ToRows() (rows []Row, err error)

ToRows iterates the DataSource storing the result in a slice of Rows.

func (DataSource) Top ¶ added in v0.3.0

func (src DataSource) Top(n uint64) DataSource

Top specifies the number of Rows to pass down the pipeline before stopping the iteration.

func (DataSource) Transform ¶ added in v0.3.0

func (src DataSource) Transform(trans func(Row) (Row, error)) DataSource

Transform is the most generic operation on a Row. It takes a function that maps a Row to another Row and an error. Any error returned from that function stops the iteration, otherwise the returned Row, if not empty, gets passed down to the next stage of the processing pipeline.

func (DataSource) UniqueIndexOn ¶ added in v0.3.0

func (src DataSource) UniqueIndexOn(columns ...string) (*Index, error)

UniqueIndexOn iterates the input source, building unique index on the specified columns. Columns are taken from the specified list from left to the right.

func (DataSource) Validate ¶ added in v0.3.0

func (src DataSource) Validate(vf func(Row) error) DataSource

Validate takes a function which checks every Row upon iteration and returns an error if the validation fails. The iteration stops at the first error encountered.

type DataSourceError ¶

type DataSourceError struct {
	Line uint64 // counting from 1
	Err  error
}

DataSourceError is the type of the error returned from Reader.Iterate method.

func (*DataSourceError) Error ¶

func (e *DataSourceError) Error() string

Error returns a human-readable error message string.

type Index ¶

type Index struct {
	// contains filtered or unexported fields
}

Index is a sorted collection of Rows with O(log(n)) complexity of search on the indexed columns. Iteration over the Index yields a sequence of Rows sorted on the index.

func LoadIndex ¶ added in v0.2.4

func LoadIndex(fileName string) (*Index, error)

LoadIndex reads the index from the specified file.

func (*Index) Find ¶

func (index *Index) Find(values ...string) DataSource

Find returns a DataSource of all Rows from the Index that match the specified values in the indexed columns, left to the right. The number of specified values may be less than the number of the indexed columns.

func (*Index) Iterate ¶ added in v0.3.0

func (index *Index) Iterate(fn RowFunc) error

Iterate iterates over all rows of the index. The rows are sorted by the values of the columns specified when the Index was created.

func (*Index) ResolveDuplicates ¶

func (index *Index) ResolveDuplicates(resolve func(rows []Row) (Row, error)) error

ResolveDuplicates calls the specified function once per each pack of duplicates with the same key. The specified function must not modify its parameter and is expected to do one of the following:

- Select and return one row from the input list. The row will be used as the only row with its key;

- Return an empty row. The entire set of rows will be ignored;

- Return an error which will be passed back to the caller of ResolveDuplicates().

func (*Index) SubIndex ¶

func (index *Index) SubIndex(values ...string) *Index

SubIndex returns an Index containing only the rows where the values of the indexed columns match the supplied values, left to the right. The number of specified values must be less than the number of indexed columns.

func (*Index) WriteTo ¶ added in v0.2.4

func (index *Index) WriteTo(fileName string) (err error)

WriteTo writes the index to the specified file.

type Reader ¶ added in v0.3.2

type Reader struct {
	// contains filtered or unexported fields
}

Reader is iterable csv reader. The iteration is performed in its Iterate() method, which may only be invoked once per each instance of the Reader.

func FromFile ¶ added in v0.3.0

func FromFile(name string) *Reader

FromFile constructs a new csv reader bound to the specified file, with default settings.

func FromReadCloser ¶ added in v0.3.2

func FromReadCloser(input io.ReadCloser) *Reader

FromReadCloser constructs a new csv reader from the given io.ReadCloser, with default settings.

func FromReader ¶ added in v0.3.2

func FromReader(input io.Reader) *Reader

FromReader constructs a new csv reader from the given io.Reader, with default settings.

func (*Reader) AssumeHeader ¶ added in v0.3.2

func (r *Reader) AssumeHeader(spec map[string]int) *Reader

AssumeHeader sets the header for those input sources that do not have their column names specified on the first row. The header specification is a map from the assigned column names to their corresponding column indices.

func (*Reader) CommentChar ¶ added in v0.3.2

func (r *Reader) CommentChar(c rune) *Reader

CommentChar sets the symbol that starts a comment.

func (*Reader) Delimiter ¶ added in v0.3.2

func (r *Reader) Delimiter(c rune) *Reader

Delimiter sets the symbol to be used as a field delimiter.

func (*Reader) ExpectHeader ¶ added in v0.3.2

func (r *Reader) ExpectHeader(spec map[string]int) *Reader

ExpectHeader sets the header for input sources that have their column names specified on the first row. The row gets verified against this specification when the reading starts. The header specification is a map from the expected column names to their corresponding column indices. A negative value for an index means that the real value of the index will be found by searching the first row for the specified column name.

func (*Reader) Iterate ¶ added in v0.3.2

func (r *Reader) Iterate(fn RowFunc) error

Iterate reads the input row by row, converts each line to the Row type, and calls the supplied RowFunc.

func (*Reader) LazyQuotes ¶ added in v0.3.2

func (r *Reader) LazyQuotes() *Reader

LazyQuotes specifies that a quote may appear in an unquoted field and a non-doubled quote may appear in a quoted field of the input.

func (*Reader) NumFields ¶ added in v0.3.2

func (r *Reader) NumFields(n int) *Reader

NumFields sets the expected number of fields on each row of the input. It is an error if any row does not have this exact number of fields.

func (*Reader) NumFieldsAny ¶ added in v0.3.2

func (r *Reader) NumFieldsAny() *Reader

NumFieldsAny specifies that each row of the input may have different number of fields. Rows shorter than the maximum column index in the header specification will be padded with empty fields.

func (*Reader) NumFieldsAuto ¶ added in v0.3.2

func (r *Reader) NumFieldsAuto() *Reader

NumFieldsAuto specifies that the number of fields on each row must match that of the first row of the input.

func (*Reader) SelectColumns ¶ added in v0.3.2

func (r *Reader) SelectColumns(names ...string) *Reader

SelectColumns specifies the names of the columns to read from the input source. The header specification is built by searching the first row of the input for the names specified and recording the indices of those columns. It is an error if any column name is not found.

func (*Reader) TrimLeadingSpace ¶ added in v0.3.2

func (r *Reader) TrimLeadingSpace() *Reader

TrimLeadingSpace specifies that the leading white space in a field should be ignored.

type Row ¶

type Row map[string]string

Row represents one line from a data source like a .csv file.

Each Row is a map from column names to the string values under that columns on the current line. It is assumed that each column has a unique name. In a .csv file, the column names may either come from the first line of the file ("expected header"), or they can be set-up via configuration of the reader object ("assumed header").

Using meaningful column names instead of indices is usually more convenient when the columns get rearranged during the execution of the processing pipeline.

func (Row) Clone ¶

func (row Row) Clone() Row

Clone returns a copy of the current Row.

func (Row) HasColumn ¶

func (row Row) HasColumn(col string) (found bool)

HasColumn is a predicate returning 'true' when the specified column is present.

func (row Row) Header() []string

Header returns a slice of all column names, sorted via sort.Strings.

func (Row) SafeGetValue ¶

func (row Row) SafeGetValue(col, subst string) string

SafeGetValue returns the value under the specified column, if present, otherwise it returns the substitution value.

func (Row) Select ¶

func (row Row) Select(cols ...string) (Row, error)

Select takes a list of column names and returns a new Row containing only the specified columns, or an error if any column is not present.

func (Row) SelectExisting ¶

func (row Row) SelectExisting(cols ...string) Row

SelectExisting takes a list of column names and returns a new Row containing only those columns from the list that are present in the current Row.

func (Row) SelectValues ¶

func (row Row) SelectValues(cols ...string) ([]string, error)

SelectValues takes a list of column names and returns a slice of their corresponding values, or an error if any column is not present.

func (Row) String ¶

func (row Row) String() string

String returns a string representation of the Row.

func (Row) ValueAsFloat64 ¶ added in v0.2.3

func (row Row) ValueAsFloat64(column string) (res float64, err error)

ValueAsFloat64 returns the value of the given column converted to floating point type, or an error. The column must be present on the row.

func (Row) ValueAsInt ¶ added in v0.2.3

func (row Row) ValueAsInt(column string) (res int, err error)

ValueAsInt returns the value of the given column converted to integer type, or an error. The column must be present on the row.

type RowFunc ¶

type RowFunc func(Row) error

RowFunc is the function type used when iterating Rows.

Source Files ¶

View all Source files

csvplus.go

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL

README ¶

csvplus

Examples

Design principles

Type Row

type DataSource

Type Index

Project status

Documentation ¶

Overview ¶

Index ¶

Constants ¶

Variables ¶

Functions ¶

func All ¶

func Any ¶

func Like ¶

func Not ¶

Types ¶

type DataSource ¶

func Take ¶

func TakeRows ¶ added in v0.2.1

func (DataSource) Drop ¶ added in v0.3.0

func (DataSource) DropColumns ¶ added in v0.3.0

func (DataSource) DropWhile ¶ added in v0.3.0

func (DataSource) Except ¶ added in v0.3.0

func (DataSource) Filter ¶ added in v0.3.0

func (DataSource) IndexOn ¶ added in v0.3.0

func (DataSource) Join ¶ added in v0.3.0

func (DataSource) Map ¶ added in v0.3.0

func (DataSource) SelectColumns ¶ added in v0.3.0

func (DataSource) TakeWhile ¶ added in v0.3.0

func (DataSource) ToCsv ¶ added in v0.3.0

func (DataSource) ToCsvFile ¶ added in v0.3.0

func (DataSource) ToJSON ¶ added in v0.3.3

func (DataSource) ToJSONFile ¶ added in v0.3.3

func (DataSource) ToRows ¶ added in v0.3.0

func (DataSource) Top ¶ added in v0.3.0

func (DataSource) Transform ¶ added in v0.3.0

func (DataSource) UniqueIndexOn ¶ added in v0.3.0

func (DataSource) Validate ¶ added in v0.3.0

type DataSourceError ¶

func (*DataSourceError) Error ¶

type Index ¶

func LoadIndex ¶ added in v0.2.4

func (*Index) Find ¶

func (*Index) Iterate ¶ added in v0.3.0

func (*Index) ResolveDuplicates ¶

func (*Index) SubIndex ¶

func (*Index) WriteTo ¶ added in v0.2.4

type Reader ¶ added in v0.3.2

func FromFile ¶ added in v0.3.0

func FromReadCloser ¶ added in v0.3.2

func FromReader ¶ added in v0.3.2

func (*Reader) AssumeHeader ¶ added in v0.3.2

func (*Reader) CommentChar ¶ added in v0.3.2

func (*Reader) Delimiter ¶ added in v0.3.2

func (*Reader) ExpectHeader ¶ added in v0.3.2

func (*Reader) Iterate ¶ added in v0.3.2

func (*Reader) LazyQuotes ¶ added in v0.3.2

func (*Reader) NumFields ¶ added in v0.3.2

func (*Reader) NumFieldsAny ¶ added in v0.3.2

func (*Reader) NumFieldsAuto ¶ added in v0.3.2

func (*Reader) SelectColumns ¶ added in v0.3.2

func (*Reader) TrimLeadingSpace ¶ added in v0.3.2

type Row ¶

func (Row) Clone ¶

func (Row) HasColumn ¶

func (Row) Header ¶

func (Row) SafeGetValue ¶

func (Row) Select ¶

func (Row) SelectExisting ¶

func (Row) SelectValues ¶

func (Row) String ¶

func (Row) ValueAsFloat64 ¶ added in v0.2.3

func (Row) ValueAsInt ¶ added in v0.2.3

type RowFunc ¶

Source Files ¶

Type `Row`

type `DataSource`

Type `Index`