structured

package
v0.0.0-...-9f035b8 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 18, 2015 License: Apache-2.0 Imports: 17 Imported by: 0

Documentation

Overview

Package structured provides a high-level API for application access to an underlying cockroach datastore. Cockroach itself provides a single monolithic, sorted key-value map. Structured translates between familiar RDBMS concepts such as tables, columns and indexes to the key-value store.

Schemas

Schemas declare the structure of data for application access to an underlying Cockroach datastore. Data schemas are declared in YAML files with support for familiar RDBMS concepts such as tables, columns, and indexes. Tables in Cockroach's structured data API are semi-relational. Every row (synonymous with "tuple") in a table must have a unique primary key.

The term "schema" used here is synonymous with "database". A cockroach cluster may contain multiple schemas (or "databases"). Schemas are comprised of tables. Tables are comprised of columns. Columns store values corresponding to basic types including integer (int64), float (float64), string (utf8), blob ([]byte). Columns also store certain composite types including Time and LatLong (for location), IntegerSet (map[int64]struct{}), StringSet (map[string]struct{}), IntegerMap (map[string]int64) and StringMap (map[string]string).

Columns can be designated to form an index. Indexes include secondary indexes, unique secondary indexes, location indexes, and full-text indexes. Additional index types may be added.

An important convention is the use of both a name and a key to describe schemas, tables and columns. The name is used for all external access; the key is used internally for storing keys and values efficiently. Both are specified in the schema declaration; each (name, key) pair for a schema, table or column should be chosen to correspond; the key should be reminiscent of the name in order to aid when debugging. Keys should be short; they are limited to 1-3 characters and may not contain the '/' or ':' characters. Both names and keys must be unique according to the following rules: schemas must be unique in a Cockroach cluster; tables must be unique in a schema; columns must be unique in a table.

Cockroach configuration zones can be specified for the schema as a whole and overridden for individual tables.

YAML Schema Declaration

The general structure of schema files is as follows:

db:     <DB Name>
db_key: <DB Key>
tables: [<Table>, <Table>, ...]

Tables are specified as follows:

  • table: <Table Name> table_key: <Table Key> columns: [<Column>, <Column>, ...]

Columns are specified as follows:

  • column: <Column Name> column_key: <Column Key> type: (integer | float | string | blob | time | latlong | integerset | stringset | integermap | stringmap)> auto_increment: <start-value> foreign_key: <Table>.<Column> index: (secondary | unique | location | fulltext) interleave true on_delete: (cascade | setnull) primary_key: true scatter: true

An example YAML schema configuration file:

db:     PhotoDB
db_key: pdb
tables:
- table:     User
  table_key: us
  columns:
  - column:      ID
    column_key:  id
    type:        integer
    primary_key: true

  - column:     Name
    column_key: na
    type:       string

  - column:     Email
    column_key: em
    type:       string
    index:      uniquesecondary

- table:     Identity
  table_key: id
  columns:
  - column:      Key
    column_key:  ke
    type:        string
    primary_key: true

  - column:      UserID
    column_key:  ui
    type:        integer
    foreign_key: User.ID

Go-Centric Schema Declarations

Package structured also provides Go-centric access via support of struct field tags with the designation "roach". Data schemas may be declared in a style natural to Go, but which map directly to the canonical Cockroach YAML schema format. Methods are provided for easily querying individual items by key, ranges of items by key if key is sorted, or by secondary index if indexed.

Schemas may be defined simply by declaring Go structs with appropriate field tags and then invoking NewGoSchema(...). The struct field tags specify column properties, such as whether the column is part of the primary key, a foreign key, etc. See below for a full list of tag specifications.

import "github.com/cockroach/schema"

// User is a top-level table. User IDs are scattered, meaning a two
// byte hash of the ID from the UserID sequence is prepended to yield
// a randomly distributed keyspace.
type User struct {
	ID   int64 `roach:"pk,auto,scatter"`
	Name string
}

// Identity key takes form of "email:<valid-email-address>"
// or "phone:<valid-phone-number>". A single user may have multiple
// identities.
type Identity struct {
	Key    string `roach:"pk,scatter"`
	UserID int64  `roach:"fk=User.ID,ondelete=setnull"`
}

// Get the schema from the go struct declarations.
sm := map[string]interface{}{
  'us': User{},
  'id': Identity{},
}
s := NewGoSchema('db', sm)

Cockroach tags for struct fields have tag name "roach" followed by the column key and a comma-separated list of <key>[=<value>] specifications, where value can be optional depending on the key. The general form is as follows:

roach:"<column-key>,[<key1[=value1]>,<key2[=value2]>...]"

The <column-key> is a 1-3 letter designation for the column (e.g. 'ui' for 'UserId' and unique within a table) which maintains stable, human readable keys for debugging but which takes minimal bytes for storage. Each instance of this struct will have non-default values encoded using the column keys. Keeping these small can make a significant difference in storage efficiency.

Tag Specifications

"roach" tag specifications are as follows:

fk=<table[.column]>: (Foreign Key) specifies the field is a foreign
key. <table.column> specifies which table and column the foreign key
references. If a foreign key references an object with a composite
primary key, all fields comprising the composite primary key must be
included in the struct with "fk" tags. "column" may be omitted if
the referenced table contains a single-valued primary key.
"ondelete" optionally specifies behavior if referenced object is
deleted. Foreign keys create a secondary index. It isn't necessary
to specify secondaryindex; however, to specify that a foreign key
denotes a one-to-one relation, specify "uniqueindex". "interleave"
optionally specifies that all of the data for structs of this type
will be placed "next to" the referenced table for data locality.

fulltextindex: the column is indexed by segmenting the text into
words and indexing each word separately. The index supports phrase
queries using word position data for each indexed term.

interleave: specified with foreign keys to co-locate dependent data
within a single "entity group" (to use Google's Megastore parlance).
Interleaved data yields faster lookups when querying complete sets
of data (e.g. a comment topic and all comments posted to it), and
faster transactional writes in certain common cases.

locationindex: the column is indexed by location. The details are
implementation-dependent, but the reference impl uses S2 geometry
patches to canvas the specified location. The column type must be
schema.LatLong.

ondelete=<behavior>: the behavior in the event that the object which
a foreign key column references is deleted. The two supported values
are "cascade" and "setnull". "cascade" deletes the object with the
foreign key reference. "setnull" is less destructive, merely setting
the foreign key column to nil. If "interleave" was specified for
this foreign key, then "cascade" is the mandatory default value;
specifying "setnull" for an interleaved foreign key results in a
schema validation error.

pk: (Primary Key) specifies the field is the primary key or part of
a composite primary key. The first field with pk specified will form
the prefix of the key, each additional field with pk specified will
be appended in order. "scatter" may be included only on the first
field with pk specified.

scatter: randomizes the placement of the data within the table's
keyspace by prepending a two-byte hash of the entire primary key to
the actual key used to store the value in cockroach. "scatter" may
only be specified on the first field with "pk" specified.

secondaryindex: a secondary index on the column value. Tuples from
this table (or alternatively, instances of this struct) may be
queried by this column value using the resulting index.

auto[=<start-value>]: specifies that the value of this field
auto-increments from a monotonically increasing sequence starting
at the optional start value.

uniqueindex: a secondary index where uniqueness of the column value
is enforced.

Disconnected Mode

Client may be used in a disconnected mode, which uses internal data structures to stand in for an actual Cockroach cluster. This is useful for unittesting data schemas and migration functions.

How Data is Stored

The structured package contrives to co-locate an arbitrary number of schemas within a single Cockroach cluster without namespace collisions. An example schema:

db:     PhotoDB
db_key: pdb
tables:
- table:     User
  table_key: us
  columns:
  - column:      ID
    column_key:  id
    type:        integer
    primary_key: true

  - column:     Name
    column_key: na
    type:       string

  - column:     Email
    column_key: em
    type:       string
    index:      uniquesecondary

- table:     Identity
  table_key: id
  columns:
  - column:      Key
    column_key:  ke
    type:        string
    primary_key: true

  - column:      UserID
    column_key:  ui
    type:        integer
    foreign_key: User.ID

All data belonging to a schema is stored with keys prefixed by <db_key>. In the configuration example above, this would correspond to a key prefix of "pdb". Tables are further prefixed by the table key. A tuple from table User would have key prefix "pdb/us". The tuple's key suffix is an encoded concatenation of column values marked as part of the primary key. For a user with ID=531, the full key would be "pdb/us/<OrderedEncode(531)>" (from here on out, we'll shorten OrderedEncode to just "E"). The value stored with the key is a gob-encoded map[string]interface{}, with each column value (excepting primary keys) as an entry in the map, indexed by column key.

pdb/us/<E(529)>: <data for user 529>
pdb/us/<E(530)>: <data for user 530>
pdb/us/<E(531)>: <data for user 531>

If a primary key column has the "scatter" option specified, the encoded value for the key is additionally prefixed with the first two bytes of a hash of the entire primary key value. For example, If the hash of 531 is 1532079858. The first two bytes are 0xf2 and 0xae. The key would be "pdb/us/\xf2\xae<E(531)>". This provides an essentially random location for the columnar data of User 531 within the "pdb/us/..." keyspace. Using the "scatter" option prevents hotspots in non-uniformly distributed data. For example, primary keys generated from a monotonically-increasing sequence or from the current time. Keep in mind, however, that using "scatter" makes range scans impossible.

pdb/us/\x09\x31<E(530)>: <data for user 530>
...
pdb/us/\x50\xae<E(531)>: <data for user 531>
...
pdb/us/\xf2\xb9<E(529)>: <data for user 529>

If a foreign key column has the "interleave" option specified, the data for the table is co-located with the table referenced by the foreign key. For example, let's consider an additional table in the schema, Address, containing addresses for Users (abbreviated for concision).

  • table: Address table_key: ad columns:

  • column: ID column_key: id type: integer primary_key: true

  • column: UserID column_key: ui type: integer foreign_key: User.ID interleave: true

  • column: Street column_key: st type: string

  • column: City column_key: cy type: string

Because an address has no meaning outside of a user, there is value in co-locating keys for a user's addresses with the user data. Transactions modifying both a user and an address (on account creation, for example) are likely to be part of the same cockroach key range and therefore will be committed without requiring a distributed transaction. If "interleave" is specified, the referenced table entity's key is used as a key prefix. This makes for longer keys, but guarantees all data is proximately located. If User 531 has two addresses, with ids 35 & 56 respectively, the underlying data would look like:

pdb/us/<E(529)>: <data for user 529>
...
pdb/us/<E(531)>: <data for user 531>
pdb/us/<E(531)>/ad/<E(35)>: <data for address 35>
pdb/us/<E(531)>/ad/<E(56)>: <data for address 56>
...
pdb/us/<E(600)>: <data for user 600>
...

Indexes

Indexes provide efficient access to rows in the database by columnar data other than primary keys. In the example above, we might want to lookup a User by Email instead of ID, as in the case of login by email address. Email is not the primary key, so without an index, there would be no way to lookup a user without scanning the entire User table to find the email address in question.

A sensible alternative would be to create another table, UserByEmail, with (email, user) as a composite primary key. We could then do a range scan on UserByEmail starting at (<email>, _) to find a matching email and discover the associated user ID in order to then lookup the user for login. This is exactly how indexes work, but Cockroach takes care of all of the necessary work to create and maintain them behind the scenes.

An index on email address for users would require a unique secondary index. Unique because no two users should be allowed the same email address for login. Secondary simply means that the exact value of the email address must be supplied in order to do the lookup. Non-unique secondary indexes are also possible. For example, If User.Name were indexed, two or more users might have identical names. A non-unique secondary index yields a list of users on lookup.

Other index types include "location", which allows queries by latitude and longitude; and "fulltext", which segments string-valued columns into "words", any subset of which may be used to query the index. If fulltext were specified on User.Name, then user IDs could be queried by first, middle or last name separately.

How Index Data is Stored

As discussed in the thought experiment about creating the UserByEmail table, index data is simply stored as another table. In the example of a secondary index on User.Email, a new table with table_key 'us:em' is created. The primary key for this table is a composite of the index term (User.Email) and the primary key of the User table (User.ID).

pdb/us/<E(531)>: <data for user 531, email foo@bar.com>
pdb/us/<E(10247)>: <data for user 10247, email baz@fubar.com>
...
pdb/us:em/<E(baz@fubar.com)><E(10247)>: nil
pdb/us:em/<E(foo@bar.com)><E(531)>: nil

As can be seen above, the keyspace taken by the UserEmail index table follows the User table. In order to lookup the user with email "foo@bar.com", a range scan is initiated for the lower bound of "foo@bar.com", and keys with matching prefix are decoded to yield the list of matching user IDs.

Secondary indexes have a single term and which exactly mirrors their value. User.Email is an example of this. For email=X, the term is X. Secondary indexes contain only keys, no values. If the secondary index is unique, then a range scan is done on insert to verify that no other index term with the same prefix is already present.

Index types other than secondary, such as "location" and "fulltext", may yield multiple index terms for a column value. Full text indexes, for example, yield one term per segmented word. With a fulltext index on User.Name, "Spencer Woolley Kimball" would yield three terms: {"Spencer", "Woolley", "Kimball"}. Full text indexing also stores a special value for each term which indicates the list of positions that term appears in the source string. Further, index types which yield multiple terms store the indexed terms in a special additional "term" column (named <column_key>:t). This list of terms is used to delete terms from the index in the event the column is deleted or updated. If Spencer had user ID 1:

pdb/us/<E(1)>: {em: "spencer.kimball@gmail.com,
                na: "Spencer Woolley Kimball",
                na:t: ["kimball", "spencer", "woolley"]}
...
pdb/us:na/<E(kimball)><E(1)>: {tp: [2]}
pdb/us:na/<E(spencer)><E(1)>: {tp: [0]}
pdb/us:na/<E(woolley)><E(1)>: {tp: [1]}

Terms can and should efficiently combine multiple source ids into a list instead of requiring a separate key for every instance. This is left as future work, as it's non-trivial to do efficiently.

Index

Constants

View Source
const StructuredKeyPrefix = "/schema"

StructuredKeyPrefix is the prefix for RESTful endpoints used to interact with structured data schemas.

Variables

This section is empty.

Functions

This section is empty.

Types

type Column

type Column struct {
	// Name is used externally (i.e. in JSON or Go structs) to refer to
	// this column value.
	Name string `yaml:"column"`

	// Key should be a 1-3 character abbreviation of the Name. The key is
	// used internally for efficiency.
	Key string `yaml:"column_key"`

	// Type is one of "integer", "float", "string", "blob", "time",
	// "latlong", "integerset", "stringset", "integermap", or "stringmap".
	// Integers are int64s. Floats are float64s. Strings must be UTF8
	// encoded. Blobs are arbitrary byte arrays. If sending over JSON,
	// they should be base64 encoded. Latlong are (latitude, longitude,
	// altitude, accuracy) quadruplets, each a float64 (altitude and
	// accuracy are in meters). Integersets are a set of int64
	// values. Integermaps are map[string]int64. Stringsets and
	// stringmaps are similar, but with string values.
	Type string `yaml:"type"`

	// ForeignKey is a foreign key reference specified as
	// <table>[.<column>]. The column suffix is optional in the event
	// that the table has a non-composite primary key. Foreign keys
	// presuppose a secondary index, so specifying one isn't necessary
	// unless the foreign key is a one-to-one relation, in which case
	// a unique index should be specified.
	ForeignKey string `yaml:"foreign_key,omitempty"`

	// Index specifies that this column generates an index. The index
	// type is one of "fulltext", "location", "secondary", or "unique".
	// "fulltext" is valid only for "string"-type columns. It generates
	// a full text index by segmenting the text from a UTF8 string
	// column and indexing each word as a separate term. Full text
	// indexes support phrase searches. "location" is valid only for
	// "latlong"-type columns. It generates S2 geometry patches to
	// canvas the specified latitude/longitude location. "secondary"
	// creates an index with terms equal to this column value.
	// Secondary indexes are created automatically for foreign keys, in
	// which case their terms may be a concatenation of foreign key
	// values in the event the foreign key references a table with a
	// composite primary key. "unique" indexes are the same as
	// "secondary", except a one-to-one relation is enforced; that is,
	// only a single instance of a particular value for this column (or
	// values if part of a composite foreign key) is allowed.
	Index string `yaml:"index,omitempty"`

	// Interleave is specified with foreign keys to co-locate dependent
	// data. Interleaved data is physically proximate to data referenced
	// by this foreign key for faster lookups when querying complete
	// sets of data (e.g. a merchant and all of its inventory), and for
	// faster transactional writes in certain common cases.
	Interleave bool `yaml:"interleave,omitempty"`

	// OnDelete is specified with foreign keys to dictate database
	// behavior in the event the object which the foreign key column
	// references is deleted. The two supported values are "cascade" and
	// "setnull". "cascade" deletes the object with the foreign key
	// reference. "setnull" is less destructive, merely setting the
	// foreign key column to nil. If "interleave" was specified for this
	// foreign key, then "cascade" is the mandatory default value;
	// specifying setnull result in a schema validation error.
	OnDelete string `yaml:"ondelete,omitempty"`

	// PrimaryKey specifies this column is the primary key for the table
	// or part of a composite primary key. The order in which primary
	// key columns are declared dictates the order in which their values
	// are concatenated to form the key. This has obvious implications
	// for range queries.
	PrimaryKey bool `yaml:"primary_key,omitempty"`

	// Scatter is specified with primary keys to effectively randomize
	// the placement of the data within the table's keyspace. This is
	// accomplished by prepending a two-byte hash of the entire primary
	// key to the actual key used to store the value.
	Scatter bool `yaml:"scatter,omitempty"`

	// Auto specifies that the value of this column auto-increments from
	// a monotonically-increasing sequence starting at this field's
	// value. If Auto is nil, the column does not auto-increment.
	Auto *int64 `yaml:"auto_increment,omitempty"`
}

Column contains the schema for a column. The Key should be a shortened identifier related to Name. The Column Key is stored within the row value itself and is used to maintain human readability for debugging, as well as a stable identifier for column data. Column Keys must be unique within a Table.

type DB

type DB interface {
	PutSchema(*Schema) error
	DeleteSchema(*Schema) error
	GetSchema(string) (*Schema, error)
}

A DB interface provides methods to access a datastore using a structured data API.

func NewDB

func NewDB(kvDB *client.DB) DB

NewDB returns a key-value datastore client which connects to the Cockroach cluster via the supplied gossip instance.

type IntegerMap

type IntegerMap map[string]int64

IntegerMap is a map from string key to int64 integer value.

type IntegerSet

type IntegerSet map[int64]struct{}

IntegerSet is a set of int64 integer values.

type LatLong

type LatLong struct {
	// contains filtered or unexported fields
}

LatLong specifies a (latitude, longitude, altitude, accuracy) quadruplet with 64-bit floating point precision. Altitude and accuracy are in meters.

type RESTServer

type RESTServer struct {
	// contains filtered or unexported fields
}

A RESTServer provides a RESTful HTTP API to interact with structured data schemas.

A resource is represented using the following URL scheme:

/schema[/<schema key>][/<table key>][/<primary key>][?<name>=<value>,<name>=<value>...][limit=<int>][offset=<int>]

Some examples:

/schema/pdb/us/531 -> <data for user 531>
/schema/pdb/us/?email=andybons@gmail.com -> <data for user with email andybons@gmail.com>

A user can provide just the top-level schema key in order to introspect the schema layout:

/schema/pdb -> <schema with key pdb>

Or simply provide /schema to list all schemas within the datastore.

Results are always returned within a JSON-serialized array (even for only one result) to provide uniformity for responses and to accommodate for multiple entries that may satisfy a query. If the limit param is not provided, a maximum of 50 items is returned. TODO(andybons): Paging?

func NewRESTServer

func NewRESTServer(db DB) *RESTServer

NewRESTServer allocates and returns a new server.

func (*RESTServer) ServeHTTP

func (s *RESTServer) ServeHTTP(w http.ResponseWriter, r *http.Request)

ServeHTTP arbitrates requests to the appropriate function based on the request’s HTTP method.

type Schema

type Schema struct {
	Name   string     `yaml:"db" json:"db"`
	Key    string     `yaml:"db_key" json:"db_key"`
	Tables TableSlice `yaml:",omitempty" json:"tables,omitempty"`
	// contains filtered or unexported fields
}

Schema contains a named sequence of Table schemas. The Key should be a shortened identifier related to Name. The Schema Key is stored as part of every row key within the database, so size matters and small is better (for once). Schema Keys must be unique within a Cockroach cluster.

func NewGoSchema

func NewGoSchema(name, schemaKey string, schemaMap map[string]interface{}) (*Schema, error)

NewGoSchema returns a schema using name and key and a map from Table Key to an instance of a Go struct corresponding to the Table (the Table name is derived from the Go struct name).

func NewYAMLSchema

func NewYAMLSchema(in []byte) (*Schema, error)

NewYAMLSchema returns a schema based on the YAML input string.

func (*Schema) ToYAML

func (s *Schema) ToYAML() ([]byte, error)

ToYAML marshals the Schema into YAML.

func (*Schema) Validate

func (s *Schema) Validate() error

Validate validates the schema for consistency, correctness and completeness. Foreign keys are matched to their respective tables. Parameters are verified as valid (e.g. OnDelete can only be "cascade" or "setnull"). Refer to the source for the complete list of checks.

type StringMap

type StringMap map[string]string

StringMap is a map from string key to string value.

type StringSet

type StringSet map[string]struct{}

StringSet is a set of string values.

type Table

type Table struct {
	Name    string    `yaml:"table"`
	Key     string    `yaml:"table_key"`
	Columns []*Column `yaml:",omitempty"`
	// contains filtered or unexported fields
}

Table contains the schema for a table. The Key should be a shortened identifier related to Name. The Table Key is stored as part of every row key within the database, so shorter is better. Table Keys must be unique within a Schema.

type TableSlice

type TableSlice []*Table

TableSlice helpfully implements the sort interface.

func (TableSlice) Len

func (ts TableSlice) Len() int

func (TableSlice) Less

func (ts TableSlice) Less(i, j int) bool

func (TableSlice) Swap

func (ts TableSlice) Swap(i, j int)

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL