validator

package module
v0.0.0-...-96dee63 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 11, 2016 License: BSD-2-Clause Imports: 17 Imported by: 0

README

Data Models Validator

Build Status GoDoc

A validator for CSV files adhering to the Data Models specification.

Install

Download the latest binary from the releases page for your architecture: Windows, Linux, or OS X. The following examples assume the binary has been placed on your PATH with the name data-models-validator.

Usage

Run the validator by specifying the model and version of the data model you wish to check against and one or more input files (if the files are not named after the corresponding tables, you can specify the correct tables).

Validate person.csv file against the PEDSnet v2 data model:

$ data-models-validator -model pedsnet -version 2.0.0 person.csv

Validate foo.csv against the person table in the PEDSnet v2 data model:

$ data-models-validator -model pedsnet -version 2.0.0 foo.csv:person

Run the following to see the full usage:

$ data-models-validator -help

Functionality

The validator checks the following:

  • header matches fields of specified table
  • each row of data has the correct number of fields
  • data is encoded in UTF8
  • quotes within data values are escaped
  • date and datetime data is valid and properly formatted
  • integer and number (float) data is valid and fits in 32-bit types
  • required data is not left null
  • string data does not exceed defined max lengths

The validator does not check:

  • foreign key referential integrity
  • data model conventions such as correct concept usage
  • uniqueness of data across rows

Known Bugs

If the validator is run several times in quick succession, an error from the underlying data models service is thrown:

$ data-models-validator -model i2b2_pedsnet -version 2.0.0 obs-fact.csv:observation_fact
error decoding model revisions: invalid character '<' looking for beginning of value

No harm is done to any data and by waiting several minutes before re-running, the problem can be circumvented. The development team is working on a fix ASAP and you can track our progress here

Future Directions

Soon, we hope to add higher level validation checks, especially of referential integrity.

Output Examples

If the filename does not match a table in the model, a list of known tables in the model is printed. The solution is to specify the table by adding :table_name to the end of your file name:

$ data-models-validator -model pedsnet -version 2.0.0 PEDSNET_DRUG_EXPOSURE.csv
Validating against model 'pedsnet/2.0.0'
* Unknown table 'PEDSNET_DRUG_EXPOSURE'.
Choices are: care_site, concept, concept_ancestor, concept_class, concept_relationship, concept_synonym, condition_occurrence, death, domain, drug_exposure, drug_strength, fact_relationship, location, measurement, observation, observation_period, person, procedure_occurrence, provider, relationship, source_to_concept_map, visit_occurrence, visit_payer, vocabulary

$ data-models-validator -model pedsnet -version 2.0.0 PEDSNET_DRUG_EXPOSURE.csv:drug_exposure

If the header does not match the expected set of fields, the expected and actual number of fields as well as any unknown and/or missing fields found in the header are printed:

$ data-models-validator -model pedsnet -version 2.0.0 original_visit_occurrence.csv:visit_occurrence
Validating against model 'pedsnet/2.0.0'
* Evaluating 'visit_occurrence' table in 'original_visit_occurrence.csv'...
* Problem reading CSV header: line 0: [code: 201] Header does not contain the correct set of fields: (expectedLength:12, actualLength:13, unknownFields:[visit_occurrence_source_id])

If the file passes validation, success is reported:

$ data-models-validator -model pedsnet -version 2.0.0 person.csv
Validating against model 'pedsnet/2.0.0'
* Evaluating 'person' table in 'person.csv'...
* Everything looks good!

If errors are found in the data, the errors are reported. For each field in which an error is found and each type of error in that field, the number of occurrences of the error and a small random sample of actual error values (prepended by line number) are shown:

$ data-models-validator -model pedsnet -version 2.0.0 measurement.csv
Validating against model 'pedsnet/2.0.0'
* Evaluating 'measurement' table in 'measurement.csv'...
* A few issues were found
+--------------------------+------+--------------------------------+-------------+--------------------------------+
|          FIELD           | CODE |             ERROR              | OCCURRENCES |            SAMPLES             |
+--------------------------+------+--------------------------------+-------------+--------------------------------+
| measurement_source_value |  300 | Value is required              |         159 | 96897086:'' 96920063:''        |
|                          |      |                                |             | 96973571:'' 96899225:''        |
|                          |      |                                |             | 96912743:''                    |
| unit_source_value        |  300 | Value is required              |     8938919 | 56172499:'' 84397591:''        |
|                          |      |                                |             | 64597721:'' 64982471:''        |
|                          |      |                                |             | 63311022:''                    |
| value_source_value       |  203 | Value contains bare double     |           7 | 96847421:'COMPLEXITY           |
|                          |      | quotes (")                     |             | CLINICAL LABORATORY            |
|                          |      |                                |             | TESTING."' 96847441:'"THIS     |
|                          |      |                                |             | TEST WAS DEVELOPED AND ITS     |
|                          |      |                                |             | PERFORMANCE CHARACTERISTICS'   |
|                          |      |                                |             | 64833023:'"X"=30.0%'           |
|                          |      |                                |             | 96847452:'"THIS TEST           |
|                          |      |                                |             | WAS DEVELOPED AND ITS          |
|                          |      |                                |             | PERFORMANCE CHARACTERISTICS'   |
|                          |      |                                |             | 64833023:'"X"=30.0%'           |
+--------------------------+------+--------------------------------+-------------+--------------------------------+

Developers

Glide is used for vendoring dependencies so this must be installed and on the PATH.

Install the dependencies.

glide install

Build the binary.

make build

Documentation

Overview

Adapted from: https://github.com/gwenn/yacr/blob/b33898940948270a0198c7db28d6b7efc18b783e/reader.go

Index

Constants

View Source
const (
	DateLayout     = "2006-01-02"
	DatetimeLayout = "2006-01-02 15:04:05"
)

Variables

View Source
var DateValidator = &Validator{
	Name: "Date",

	Description: "Validates the input value is a valid date.",

	RequiresValue: true,

	Validate: func(s string, cxt Context) *ValidationError {
		if _, err := time.Parse(DateLayout, s); err != nil {

			if err := DatetimeValidator.Validate(s, cxt); err == nil {
				return nil
			}

			return &ValidationError{
				Err: ErrTypeMismatchDate,
			}
		}

		return nil
	},
}

DateValidator validates the raw value is date.

View Source
var DatetimeValidator = &Validator{
	Name: "Datetime",

	Description: "Validates the input value is a valid date time.",

	RequiresValue: true,

	Validate: func(s string, cxt Context) *ValidationError {
		if _, err := time.Parse(DatetimeLayout, s); err != nil {
			return &ValidationError{
				Err: ErrTypeMismatchDateTime,
			}
		}

		return nil
	},
}

DatetimeValidator validates the raw value is date.

View Source
var EncodingValidator = &Validator{
	Name: "Encoding",

	Description: "Validates a string only contains utf-8 characters.",

	RequiresValue: true,

	Validate: func(s string, cxt Context) *ValidationError {
		if !utf8.ValidString(s) {
			var bad []rune

			for i, r := range s {
				if r == utf8.RuneError {
					bs, size := utf8.DecodeRuneInString(s[i:])

					if size == 1 {
						bad = append(bad, bs)
					}
				}
			}

			return &ValidationError{
				Err: ErrBadEncoding,
				Context: Context{
					"badRunes": bad,
				},
			}
		}

		return nil
	},
}
View Source
var ErrBadEncoding = &Error{
	Code:        100,
	Description: "UTF-8 encoding required",
}
View Source
var ErrBadHeader = &Error{
	Code:        201,
	Description: "Header does not contain the correct set of fields",
}
View Source
var ErrBareQuote = &Error{
	Code:        203,
	Description: `Value contains bare double quotes (")`,
}
View Source
var ErrExtraColumns = &Error{
	Code:        202,
	Description: "Extra columns were detected in line",
}
View Source
var ErrLengthExceeded = &Error{
	Code:        302,
	Description: "Value exceeds the maximum length",
}
View Source
var ErrPrecisionExceeded = &Error{
	Code:        303,
	Description: "Numeric precision exceeded",
}
View Source
var ErrRequiredValue = &Error{
	Code:        300,
	Description: "Value is required",
}
View Source
var ErrScaleExceeded = &Error{
	Code:        304,
	Description: "Numeric scale exceeded",
}
View Source
var ErrTypeMismatch = &Error{
	Code:        301,
	Description: "Value is not the correct type",
}
View Source
var ErrTypeMismatchDate = &Error{
	Code:        307,
	Description: "Value is not a date (YYYY-MM-DD)",
}
View Source
var ErrTypeMismatchDateTime = &Error{
	Code:        308,
	Description: "Value is not a datetime (YYYY-MM-DD HH:MM:SS)",
}
View Source
var ErrTypeMismatchInt = &Error{
	Code:        305,
	Description: "Value is not an integer (int32)",
}
View Source
var ErrTypeMismatchNum = &Error{
	Code:        306,
	Description: "Value is not a number (float32)",
}
View Source
var ErrUnquotedColumn = &Error{
	Code:        205,
	Description: `Non-empty column must be quoted.`,
}
View Source
var ErrUnterminatedColumn = &Error{
	Code:        204,
	Description: `Column is not terminated with a quote.`,
}

Map of errors by code.

View Source
var EscapedQuotesValidator = &Validator{
	Name: "EscapedQoutes",

	Description: "Validates any quote characters in a string are escaped.",

	RequiresValue: true,

	Validate: func(s string, cxt Context) *ValidationError {
		i := strings.Index(s, `"`)

		for i != -1 {
			if i == len(s)-1 || s[i+1] != '"' {
				return &ValidationError{
					Err: ErrBareQuote,
				}
			} else {
				s = s[i+2:]
			}

			i = strings.Index(s, `"`)
		}

		return nil
	},
}
View Source
var IntegerValidator = &Validator{
	Name: "Integer",

	Description: "Validates the input string is a valid integer.",

	RequiresValue: true,

	Validate: func(s string, cxt Context) *ValidationError {
		if _, err := strconv.ParseInt(s, 10, 32); err != nil {
			return &ValidationError{
				Err: ErrTypeMismatchInt,
			}
		}

		return nil
	},
}

IntegerValidator validates the raw value is an integer.

View Source
var NumberValidator = &Validator{
	Name: "Number",

	Description: "Validates the input string is a valid number (float).",

	RequiresValue: true,

	Validate: func(s string, cxt Context) *ValidationError {
		if _, err := strconv.ParseFloat(s, 32); err != nil {
			return &ValidationError{
				Err: ErrTypeMismatchNum,
			}
		}

		return nil
	},
}

NumberValidator validates the raw value is a number.

View Source
var RequiredValidator = &Validator{
	Name: "Required",

	Description: "Validates the input value is not empty.",

	Validate: func(s string, cxt Context) *ValidationError {
		if s == "" {
			return &ValidationError{
				Err: ErrRequiredValue,
			}
		}

		return nil
	},
}

RequiredValidator validates the the raw value is not empty. This only applies to fields that are marked as required in the spec.

View Source
var StringLengthValidator = &Validator{
	Name: "String Length",

	Description: "Validates the input value is not longer than a pre-defined length.",

	RequiresValue: true,

	Validate: func(s string, cxt Context) *ValidationError {
		length := cxt["length"].(int)

		if len(s) > length {
			return &ValidationError{
				Err: ErrLengthExceeded,
				Context: Context{
					"maxLength": length,
				},
			}
		}

		return nil
	},
}

StringLengthValidator validates the string value does not exceed a pre-defined length.

View Source
var (

	// Full semantic version for the service.
	Version = semver.Version{
		Major: progMajor,
		Minor: progMinor,
		Patch: progPatch,
		Pre: []semver.PRVersion{{
			VersionStr: progReleaseLevel,
		}},
	}
)

Functions

This section is empty.

Types

type BoundValidator

type BoundValidator struct {
	Validator *Validator
	Context   Context
}

BoundValidator binds a validator to a context.

func Bind

func Bind(v *Validator, cxt Context) *BoundValidator

Bind returns a bound validator given a validator and context.

func BindFieldValidators

func BindFieldValidators(f *client.Field) []*BoundValidator

BindFieldValidators returns a set of validators for the field.

func (*BoundValidator) String

func (b *BoundValidator) String() string

func (*BoundValidator) Validate

func (b *BoundValidator) Validate(s string) *ValidationError

type CSVReader

type CSVReader struct {

	// If true, the scanner will continue scanning if field-level errors are
	// encountered. The error should be checked after each call to Scan to
	// handle the error.
	ContinueOnError bool
	// contains filtered or unexported fields
}

CSVReader provides an interface for reading CSV data (compatible with rfc4180 and extended with the option of having a separator other than ","). Successive calls to the Scan method will step through the 'fields', skipping the separator/newline between the fields. The EndOfRecord method tells when a field is terminated by a line break.

func DefaultCSVReader

func DefaultCSVReader(rd io.Reader) *CSVReader

DefaultReader creates a "standard" CSV reader.

func NewCSVReader

func NewCSVReader(r io.Reader, sep byte) *CSVReader

NewReader returns a new CSV scanner.

func (*CSVReader) ColumnNumber

func (s *CSVReader) ColumnNumber() int

ColumnNumber returns the column index of the current field.

func (*CSVReader) EndOfRecord

func (s *CSVReader) EndOfRecord() bool

EndOfRecord returns true when the most recent field has been terminated by a newline (not a separator).

func (*CSVReader) Err

func (s *CSVReader) Err() error

Err returns an error if one occurred during scanning.

func (*CSVReader) Line

func (s *CSVReader) Line() string

Line returns the current line as a string.

func (*CSVReader) LineNumber

func (s *CSVReader) LineNumber() int

LineNumber returns current line number.

func (*CSVReader) Read

func (s *CSVReader) Read() ([]string, error)

Read scans all fields in one line builds a slice of values.

func (*CSVReader) Scan

func (s *CSVReader) Scan() bool

func (*CSVReader) ScanLine

func (s *CSVReader) ScanLine(r []string) error

ScanLine scans all fields in one line and puts the values in the passed slice.

func (*CSVReader) Text

func (s *CSVReader) Text() string

Text returns the text of the current field.

type Context

type Context map[string]interface{}

func (Context) String

func (c Context) String() string

type Error

type Error struct {
	Code        int
	Description string
}

Error defines a specific type of error denoted by the description. A code is defined as a shorthand for the error and to act as a lookup for the error itself. Errors are classified by code:

  • 1xx: encoding related issues
  • 2xx: parse related issues
  • 3xx: value related issues

func (Error) Error

func (e Error) Error() string

type Plan

type Plan struct {
	FieldValidators map[string][]*BoundValidator
}

Plan is composed of the set of validators used to evaluate the field values.

type Reader

type Reader struct {
	Name        string
	Compression string
	// contains filtered or unexported fields
}

Reader encapsulates a stdin stream.

func Open

func Open(name, compr string) (*Reader, error)

Open a reader by name with optional compression. If no name is specified, STDIN is used.

func (*Reader) Close

func (r *Reader) Close()

Close implements the io.Closer interface.

func (*Reader) Read

func (r *Reader) Read(buf []byte) (int, error)

Read implements the io.Reader interface.

type Result

type Result struct {
	// contains filtered or unexported fields
}

Result maintains the validation results currently consisting of validation errors.

func NewResult

func NewResult() *Result

func (*Result) FieldErrors

func (r *Result) FieldErrors(f string) map[*Error][]*ValidationError

FieldErrors returns errors for field grouped by error code.

func (*Result) LineErrors

func (r *Result) LineErrors() map[*Error][]*ValidationError

LineErrors returns the line errors.

func (*Result) LogError

func (r *Result) LogError(verr *ValidationError)

LogError logs an error to the result.

type TableValidator

type TableValidator struct {
	Fields *client.Fields
	Header []string

	Plan *Plan
	// contains filtered or unexported fields
}

func New

func New(reader io.Reader, table *client.Table) *TableValidator

New takes an io.Reader and validates it against a data model table.

func (*TableValidator) Init

func (t *TableValidator) Init() error

Init initializes the validator by checking the header and compiling a set of validators for each field.

func (*TableValidator) Next

func (t *TableValidator) Next() error

Next reads the next row and validates it. Row and field level errors are logged and not returned. Errors that are returned are EOF and unexpected errors.

func (*TableValidator) Result

func (t *TableValidator) Result() *Result

Result returns the result of the validation.

func (*TableValidator) Run

func (t *TableValidator) Run() error

Run executes all of the validators for the input. All parse and validation errors are handled so the only error that should stop the validator is EOF.

type UniversalReader

type UniversalReader struct {
	// contains filtered or unexported fields
}

UniversalReader wraps an io.Reader to replace carriage returns with newlines. This is used with the csv.Reader so it can properly delimit lines.

func (*UniversalReader) Read

func (r *UniversalReader) Read(buf []byte) (int, error)

type ValidateFunc

type ValidateFunc func(value string, cxt Context) *ValidationError

type ValidationError

type ValidationError struct {
	Err     *Error
	Line    int
	Field   string
	Value   string
	Context Context
}

ValidationError is composed of an error with an optional line and and field the error is specific to. Additional context can be supplied in the context field.

func (ValidationError) Error

func (e ValidationError) Error() string

type Validator

type Validator struct {
	Name          string
	Description   string
	Validate      ValidateFunc
	RequiresValue bool
}

func (*Validator) String

func (v *Validator) String() string

Directories

Path Synopsis
cmd

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL