data/

directory
v0.0.0-...-b3f521c Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 19, 2017 License: Apache-2.0

README

Ultimate Data

This is material for developers who have some experience with Go and statistics and want to learn how to work with data to make better decisions. We believe these classes are perfect for data analysts/scientists/engineers interested in working in Go or Go programmers interested in doing data analysis.

Ultimate Data

Design Guidelines

You must develop a design philosophy that establishes a set of guidelines. This is more important than developing a set of rules or patterns you apply blindly. Guidelines help to formulate, drive and validate decisions. You can't begin to make the best decisions without understanding the impact of your decisions. Every decision you make, every line of code you write comes with trade-offs.

What is data analysis?

Data analysis uses Datasets to make Decisions that have corresponding Actions and Consequences.

Prepare your mind

Every data analytics or data science project must begin by considering the:

  1. Decisions
  2. Actions
  3. Consequences

Before and during any data analytics project, you must be able to answer the following questions:

  • What decisions do I want to make based on the results?
  • What actions are triggered by the decisions that will be made?
  • What are the consequences of those actions?
  • What do the results need to contain?
  • What is the data required to produce a valid result?
  • How will I measure the results are valid?
  • Can the results be effectively conveyed to decision makers?
  • Am I confident in the results?

Remember, uncertainty is not a license to guess but a directive to stop.

Order of Operations

Data analytics projects should follow these steps in this order:

  1. Understand the decisions, actions and consequences involved.
  2. Understand the relevant data to be gathered and analyzed.
  3. Gather and organize the relevant data.
  4. Understand the readability and expectations for determining valid results.
  5. Determine the most interpretable process to produce the valid results.
  6. Determine how you will test the validity of the results.
  7. Develop the determined process and tests.
  8. Test the results and evaluate against your expectations.
  9. Refactor as necessary.
  10. Looks for ways to simplify, minimize and reduce.

When the results don’t meet the expectations, ask yourself if modifying the determined process or data improve the validity of the results?

  • If YES, you must re-evaluate:
    • Are such modifications warranted?
    • Can the modification be tested against the expectations?
    • Do I need to increase complexity?
    • Have I tested the most simplistic and interpretable solutions first?
  • In NO, you must re-evaluate:
    • Am I using the best determined process?
    • Am I using the best data?
    • Are my expectations incorrect?

Guidelines, Decision Making and Trade-Offs

Develop your design philosophy around these major categories in this order: Integrity, Value, Readability/Interpretability, and Performance. You must consciously and with great reason be able to explain the category you are choosing.

Note: There are exceptions to everything but when you are not sure an exception applies, follow the guidelines presented the best you can.

1) Integrity - If data science uses Datasets to make Decisions, a breakdown in integrity results in bad decisions. These decisions impact people, and therefore, making bad decisions may cause irreparable damage to real people. Nothing trumps integrity - EVER.

Rules of Integrity:

  • Error handling code is the main code.
  • You must understand the data.
  • Control the input and output of your processes.
  • You must be able to reproduce results.

2) Value - Effort without actionable results is not valuable. Just because you can produce a result, does not mean the result contains value.

Rules of Value:

  • If an action can not be taken based on a result, the result does not have value.
  • If the impact of a result can not be measured, the result does not have value.

3) Readability and Interpretability - This is about writing simple analyses that are easy to read and understand without mental exhaustion. However, this is also about avoiding unnecessary data transformations and analysis complexity that hides:

  • The cost/impact of individual steps of the analyses.
  • The underlying purpose of the data transformations and analyses.

4) Performance - This is about making your analyses run as fast as possible and produce results that minimize a given measure of error. When code is written with this as the priority, it is very difficult to write code that is readable, simple or idiomatic. If increasing the accuracy, e.g., of a given result by 0.001% takes a significant increase in effort/complexity and doesn’t produce more value or differing actions, the effort Optimization/Efficiency is not warranted.

Directories

Path Synopsis
caching
example1
All material is licensed under the Apache License Version 2.0, January 2004 http://www.apache.org/licenses/LICENSE-2.0 go build ./example1 Sample program to show how to cache data from an API.
All material is licensed under the Apache License Version 2.0, January 2004 http://www.apache.org/licenses/LICENSE-2.0 go build ./example1 Sample program to show how to cache data from an API.
example2
Sample program to save data from an API in an embedded k/v store.
Sample program to save data from an API in an embedded k/v store.
exercises/exercise1
Sample program to show how to cache data from an API, and then use that data in analyzing a dataset.
Sample program to show how to cache data from an API, and then use that data in analyzing a dataset.
exercises/template1
Sample program to show how to cache data from an API, and then use that data in analyzing a dataset.
Sample program to show how to cache data from an API, and then use that data in analyzing a dataset.
classification_kNN
example1
Sample program to profile our data set.
Sample program to profile our data set.
example2
Sample program to train and validate a kNN model with cross validation.
Sample program to train and validate a kNN model with cross validation.
exercises/exercise1
Program for finding an optimal k value for a k nearest neighbors model.
Program for finding an optimal k value for a k nearest neighbors model.
exercises/template1
Template programe for finding an optimal k value for a k nearest neighbors model.
Template programe for finding an optimal k value for a k nearest neighbors model.
classification_trees
example1
Sample program to train and validate a decision tree model with cross validation.
Sample program to train and validate a decision tree model with cross validation.
example2
Sample program to determine an optimal value of the decision tree pruning parameter.
Sample program to determine an optimal value of the decision tree pruning parameter.
exercises/exercise1
Sample program to visualize the accuracy of models with various decision tree pruning parameters.
Sample program to visualize the accuracy of models with various decision tree pruning parameters.
exercises/template1
Sample program to visualize the accuracy of models with various decision tree pruning parameters.
Sample program to visualize the accuracy of models with various decision tree pruning parameters.
csv_cleaning
example1
Sample program to read in records from an example CSV file to a dataframe.
Sample program to read in records from an example CSV file to a dataframe.
example2
Sample program to create a dataframe and subsequently filter and subset the dataframe.
Sample program to create a dataframe and subsequently filter and subset the dataframe.
example3
Sample program to register of CSV file as an in-memory SQL database and execute SQL queries on the CSV.
Sample program to register of CSV file as an in-memory SQL database and execute SQL queries on the CSV.
example4
Sample program to register of CSV file as an in-memory SQL database and execute SQL queries on the CSV.
Sample program to register of CSV file as an in-memory SQL database and execute SQL queries on the CSV.
exercises/exercise1
Sample program to read in a CSV, create three filtered datasets, and save those datasets to three separate files.
Sample program to read in a CSV, create three filtered datasets, and save those datasets to three separate files.
exercises/exercise2
Sample program to register of CSV file as an in-memory SQL database, sum float columns, and output a process CSV.
Sample program to register of CSV file as an in-memory SQL database, sum float columns, and output a process CSV.
exercises/template1
Sample program to read in a CSV, create three filtered datasets, and save those datasets to three separate files.
Sample program to read in a CSV, create three filtered datasets, and save those datasets to three separate files.
exercises/template2
Sample program to register of CSV file as an in-memory SQL database, sum float columns, and output a process CSV.
Sample program to register of CSV file as an in-memory SQL database, sum float columns, and output a process CSV.
csv_io
example1
Sample program to read in records from an example CSV file.
Sample program to read in records from an example CSV file.
example2
Sample program to read in records from an example CSV file, and catch an unexpected extra field in the data.
Sample program to read in records from an example CSV file, and catch an unexpected extra field in the data.
example3
Sample program to read in records from an example CSV file, and catch an unexpected types in a single column.
Sample program to read in records from an example CSV file, and catch an unexpected types in a single column.
example4
Sample program to save records to a CSV file.
Sample program to save records to a CSV file.
exercises/exercise1
Sample program to read in records from an example CSV file, and catch an unexpected types in any of the columns.
Sample program to read in records from an example CSV file, and catch an unexpected types in any of the columns.
exercises/exercise2
Sample program to read in records from an example CSV file, catch an unexpected types in any of the columns, and output processed data to a different CSV file.
Sample program to read in records from an example CSV file, catch an unexpected types in any of the columns, and output processed data to a different CSV file.
exercises/template1
Sample program to read in records from an example CSV file, and catch an unexpected types in any of the columns.
Sample program to read in records from an example CSV file, and catch an unexpected types in any of the columns.
exercises/template2
Sample program to read in records from an example CSV file, catch an unexpected types in any of the columns, and output processed data to a different CSV file.
Sample program to read in records from an example CSV file, catch an unexpected types in any of the columns, and output processed data to a different CSV file.
data_versioning
example1
Sample program that connects to a running instance of Pachyderm.
Sample program that connects to a running instance of Pachyderm.
example2
Sample program that creates a pachyderm data repository.
Sample program that creates a pachyderm data repository.
example3
Sample program that commits data into Pachyderm data versioning.
Sample program that commits data into Pachyderm data versioning.
example4
Sample program that gets a versioned dataset/file from Pachyderm.
Sample program that gets a versioned dataset/file from Pachyderm.
exercises/exercise1
Sample program that creates a pachyderm data repository.
Sample program that creates a pachyderm data repository.
exercises/exercise2
Sample program that commits data into pachyderm's data versioning.
Sample program that commits data into pachyderm's data versioning.
exercises/template1
Sample program that creates a pachyderm data repository.
Sample program that creates a pachyderm data repository.
exercises/template2
Sample program that commits data into pachyderm's data versioning.
Sample program that commits data into pachyderm's data versioning.
dimensionality_reduction
example1
Sample program to illustrate the calculation of principal components.
Sample program to illustrate the calculation of principal components.
example2
Sample program to visualize the impact of dimensionality reduction.
Sample program to visualize the impact of dimensionality reduction.
example3
Sample program to project iris data on to principal components.
Sample program to project iris data on to principal components.
exercises/exercise1
Sample program to project iris data on to 3 principal components.
Sample program to project iris data on to 3 principal components.
exercises/template1
Sample program to project iris data on to 3 principal components.
Sample program to project iris data on to 3 principal components.
evaluation
example1
Sample program to calculate an R^2 value.
Sample program to calculate an R^2 value.
example2
Sample program to calculate a mean absolute error.
Sample program to calculate a mean absolute error.
example3
Sample program to calculate a accuracy.
Sample program to calculate a accuracy.
example4
Sample program to calculate precision.
Sample program to calculate precision.
example5
Sample program to calculate recall.
Sample program to calculate recall.
exercises/exercise1
Sample program to calculate specificity.
Sample program to calculate specificity.
exercises/exercise2
Sample program to calculate a mean squared error.
Sample program to calculate a mean squared error.
exercises/template1
Sample program to calculate specificity.
Sample program to calculate specificity.
exercises/template2
Sample program to calculate a mean squared error.
Sample program to calculate a mean squared error.
hypothesis_testing
example1
Sample program to calculate expected values.
Sample program to calculate expected values.
example2
Sample program to calculate a chi-squared value.
Sample program to calculate a chi-squared value.
example3
Sample program to output the result of the test, based on a critical value.
Sample program to output the result of the test, based on a critical value.
integrity
example2
Sample program to compare parsing a clean CSV with Go to parsing a clean CSV with python.
Sample program to compare parsing a clean CSV with Go to parsing a clean CSV with python.
example4
Sample program to illustrate maintaining integrity with Go in the presence of messy data.
Sample program to illustrate maintaining integrity with Go in the presence of messy data.
exercises/exercise1
Sample program to illustrate maintaining integrity with Go in the presence of messy data.
Sample program to illustrate maintaining integrity with Go in the presence of messy data.
exercises/template1
Sample program to illustrate maintaining integrity with Go in the presence of messy data.
Sample program to illustrate maintaining integrity with Go in the presence of messy data.
json
example1
All material is licensed under the Apache License Version 2.0, January 2004 http://www.apache.org/licenses/LICENSE-2.0 go build ./example1 Sample program to show how to unmarshal JSON data from an API.
All material is licensed under the Apache License Version 2.0, January 2004 http://www.apache.org/licenses/LICENSE-2.0 go build ./example1 Sample program to show how to unmarshal JSON data from an API.
example2
All material is licensed under the Apache License Version 2.0, January 2004 http://www.apache.org/licenses/LICENSE-2.0 go build ./example1 Sample program to show how to save JSON data to a file.
All material is licensed under the Apache License Version 2.0, January 2004 http://www.apache.org/licenses/LICENSE-2.0 go build ./example1 Sample program to show how to save JSON data to a file.
exercises/exercise1
All material is licensed under the Apache License Version 2.0, January 2004 http://www.apache.org/licenses/LICENSE-2.0 go build ./exercise1 Sample program to show how to unmarshal JSON data from an API.
All material is licensed under the Apache License Version 2.0, January 2004 http://www.apache.org/licenses/LICENSE-2.0 go build ./exercise1 Sample program to show how to unmarshal JSON data from an API.
exercises/template1
All material is licensed under the Apache License Version 2.0, January 2004 http://www.apache.org/licenses/LICENSE-2.0 go build ./template1 Sample program to show how to unmarshal JSON data from an API.
All material is licensed under the Apache License Version 2.0, January 2004 http://www.apache.org/licenses/LICENSE-2.0 go build ./template1 Sample program to show how to unmarshal JSON data from an API.
matrices
example1
Sample program to read in records from an example CSV file and form a matrix with gonum.
Sample program to read in records from an example CSV file and form a matrix with gonum.
example2
Sample program to show modifications to matrices.
Sample program to show modifications to matrices.
example3
Sample program to access values within a matrix.
Sample program to access values within a matrix.
example4
Sample program to illustrate various ways to format matrix output.
Sample program to illustrate various ways to format matrix output.
exercises/exercise1
Sample program to read in records from a CSV file and form a matrix with gonum.
Sample program to read in records from a CSV file and form a matrix with gonum.
exercises/template1
Sample program to read in records from a CSV file and form a matrix with gonum.
Sample program to read in records from a CSV file and form a matrix with gonum.
matrix_operations
example1
Sample program to show basic matrix operations.
Sample program to show basic matrix operations.
example2
Sample program to compute the transpose, determinant, and inverse of a matrix.
Sample program to compute the transpose, determinant, and inverse of a matrix.
example3
Sample program to solve an eigenvalue/vector problem.
Sample program to solve an eigenvalue/vector problem.
example4
Sample program to compute vector and matrix norms.
Sample program to compute vector and matrix norms.
exercises/exercise1
Sample program to divide a matrix by its norm.
Sample program to divide a matrix by its norm.
exercises/template1
Sample program to divide a matrix by its norm.
Sample program to divide a matrix by its norm.
regression
example1
Sample program to profile our data set.
Sample program to profile our data set.
example2
Sample program to investigate correlations between our target and our features.
Sample program to investigate correlations between our target and our features.
example3
Sample program to create training, test, and holdout data sets.
Sample program to create training, test, and holdout data sets.
example4
Sample program to train and test a regression model.
Sample program to train and test a regression model.
example5
Sample program to validate a trained regression model on a holdout data set.
Sample program to validate a trained regression model on a holdout data set.
exercises/exercise1b
Sample program to train and test a multiple regression model.
Sample program to train and test a multiple regression model.
exercises/exercise1c
Sample program to validate a trained multiple regression model on a holdout data set.
Sample program to validate a trained multiple regression model on a holdout data set.
exercises/template1b
Sample program to train and test a multiple regression model.
Sample program to train and test a multiple regression model.
exercises/template1c
Sample program to validate a trained multiple regression model on a holdout data set.
Sample program to validate a trained multiple regression model on a holdout data set.
sql
example1
Sample program to connect to and ping a database connection.
Sample program to connect to and ping a database connection.
example2
Sample program to load the iris dataset into a database.
Sample program to load the iris dataset into a database.
example3
Sample program to retrieve results from a database.
Sample program to retrieve results from a database.
example4
Sample program to modify data in a database.
Sample program to modify data in a database.
exercises/exercise1
Sample program to retrieve results from a database.
Sample program to retrieve results from a database.
exercises/exercise2
Sample program to delete rows in a database table.
Sample program to delete rows in a database table.
exercises/template1
Sample program to retrieve results from a database.
Sample program to retrieve results from a database.
exercises/template2
Sample program to delete rows in a database table.
Sample program to delete rows in a database table.
stats_measures
example1
Sample program to calculate means, modes, and medians.
Sample program to calculate means, modes, and medians.
example2
Sample program to calculate means, modes, and medians.
Sample program to calculate means, modes, and medians.
example3
Sample program to calculate standard deviation and variance.
Sample program to calculate standard deviation and variance.
example4
Sample program to calculate quantiles
Sample program to calculate quantiles
exercises/exercise1
Sample program to calculate both central tendency and statistical dispersion measures for the iris dataset.
Sample program to calculate both central tendency and statistical dispersion measures for the iris dataset.
exercises/template1
Sample program to calculate both central tendency and statistical dispersion measures for the iris dataset.
Sample program to calculate both central tendency and statistical dispersion measures for the iris dataset.
stats_visualization
example1
Sample program to generate a histogram of a normal distribution.
Sample program to generate a histogram of a normal distribution.
example2
Sample program to generate a histogram of the iris data variables.
Sample program to generate a histogram of the iris data variables.
example3
Sample program to generate a box plot of example distributions.
Sample program to generate a box plot of example distributions.
example4
Sample program to generate box plots of the iris data variables.
Sample program to generate box plots of the iris data variables.
exercises/exercise1
Sample program to generate a box plot of diabetes bmi values.
Sample program to generate a box plot of diabetes bmi values.
exercises/exercise2
Sample program to generate a histogram of diabetes bmi values.
Sample program to generate a histogram of diabetes bmi values.
exercises/template1
Sample program to generate a box plot of diabetes bmi values.
Sample program to generate a box plot of diabetes bmi values.
exercises/template2
Sample program to generate a histogram of diabetes bmi values.
Sample program to generate a histogram of diabetes bmi values.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL