csv_cleaning/

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

Links

Open Source Insights

README ¶

CSV Cleaning and Organization

When dealing with CSV data or other forms of tabular data, you will likely want to do things like filter the data on certain fields, get subsets of the data, etc. For example, you might just be interested in all rows where the Iris Species column has a certain value, or you maybe interested in splitting the dataset into training and test sets for a machine learning algorithm. The Go data science community has produced a few great packages that can help you with these tasks.

Notes

Use encoding/csv unless there is a need to do more complicated filtering, merging, etc.
Dataframes are useful for quick filtering, subsetting, merging, etc. with your dataset in memory.
The CSV driver for databases/sql is useful for iterating over your dataset, while cleaning/organizing it, without pulling it into memory. (Windows needs gcc setup or the purego version of ql installed go get -tags purego github.com/cznic/ql)

Links

github.com/kniren/gota - Dataframes package
go-hep.org/x/hep/csvutil - CSV library and utility for databases/sql

Code Review

Create and print a dataframe from a CSV file
Filter/select/subset a dataframe
Iterate over CSV records, reading data into a struct
Register a CSV as a table, execute SQL statements on the CSV

Exercises

Exercise 1

Use Gota dataframes to read iris.csv and output three files corresponding to each Iris species (setosa.csv, versicolor, and virginica.csv), each of the three files containing only the rows corresponding to the respective species.

Template | Answer

Exercise 2

Use csvutil/csvdriver to read iris.csv, sum the float values in the first four columns, and output a processed CSV file with two columns delimited by semicolons, the first having the sum value for the row and the second having the respective species.

Template | Answer

All material is licensed under the Apache License Version 2.0, January 2004.

Directories ¶

Path	Synopsis
example1 Sample program to read in records from an example CSV file to a dataframe.	Sample program to read in records from an example CSV file to a dataframe.
example2 Sample program to create a dataframe and subsequently filter and subset the dataframe.	Sample program to create a dataframe and subsequently filter and subset the dataframe.
example3 Sample program to register of CSV file as an in-memory SQL database and execute SQL queries on the CSV.	Sample program to register of CSV file as an in-memory SQL database and execute SQL queries on the CSV.
example4 Sample program to register of CSV file as an in-memory SQL database and execute SQL queries on the CSV.	Sample program to register of CSV file as an in-memory SQL database and execute SQL queries on the CSV.
exercises
exercise1 Sample program to read in a CSV, create three filtered datasets, and save those datasets to three separate files.	Sample program to read in a CSV, create three filtered datasets, and save those datasets to three separate files.
exercise2 Sample program to register of CSV file as an in-memory SQL database, sum float columns, and output a process CSV.	Sample program to register of CSV file as an in-memory SQL database, sum float columns, and output a process CSV.
template1 Sample program to read in a CSV, create three filtered datasets, and save those datasets to three separate files.	Sample program to read in a CSV, create three filtered datasets, and save those datasets to three separate files.
template2 Sample program to register of CSV file as an in-memory SQL database, sum float columns, and output a process CSV.	Sample program to register of CSV file as an in-memory SQL database, sum float columns, and output a process CSV.

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL