mgodatagen

command module
v0.5.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 14, 2018 License: MIT Imports: 18 Imported by: 0

README

Linux and macOS Build Status Windows Build Status Go Report Card codecov GoDoc

mgodatagen

A small CLI tool to quickly generate millions of pseudo-random BSON documents and insert them into a Mongodb instance. Quickly test new data structure or how your application responds when your database grows!

Features

  • Support all bson types listed in MongoDB bson types
  • Generate real data using faker
  • Create referenced fields accross collections
  • Aggregate data accross collections
  • Create sharded collection
  • Create collections in multiple databases
  • Cross-plateform. Tested on Unix / OSX / windows

Demo

installation

Download the binary from the release page

or

Build from source:

First, make sure that go is installed on your machine (see install go for details ). Then, use go get:

go get -u "github.com/feliixx/mgodatagen"

Options

Several options are available (you can see the list from mgodatagen --help):

mgodatagen version 0.5.0

Usage:
  mgodatagen

template:
      --new=<filename>         create an empty configuration file

configuration:
  -f, --file=<configfile>      JSON config file. This field is required
  -i, --indexonly              if present, mgodatagen will just try to rebuild index
  -s, --shortname              if present, JSON keys in the documents will be reduced
                               to the first two letters only ('name' => 'na')
  -a, --append                 if present, append documents to the collection without
                               removing older documents or deleting the collection
  -n, --numWorker=<nb>         number of concurrent workers inserting documents
                               in database. Default is number of CPU+1
  -b, --batchsize=<size>       bulk insert batch size (default: 1000)

connection infos:
  -h, --host=<hostname>        mongodb host to connect to (default: 127.0.0.1)
      --port=<port>            server port (default: 27017)
  -u, --username=<username>    username for authentification
  -p, --password=<password>    password for authentification

general:
      --help                   show this help message
  -v, --version                print the tool version and exit
  -q, --quiet                  quieter output

Only the configuration file need to be specified ( -f | --file flag). A basic usage of mgodatagen would be

./mgodatagen -f config.json 

If no host/port is specified, mgodatagen tries to connect to mongodb://127.0.0.1:27017.

Configuration file

The config file is an array of JSON documents, where each documents holds the configuration for a collection to create

See MongodB documentation for details on parameters:


[
  // first collection to create 
  {  
   // REQUIRED FIELDS
   // 
   "database": <string>,              // required, database name
   "collection": <string>,            // required, collection name
   "count": <int>,                    // required, number of document to insert in the collection 
   "content": {                       // required, the actual schema to generate documents   
     "fieldName1": <generator>,       // required
     "fieldName2": <generator>,       // required, see Generator below
     ...
   },
   // OPTIONAL FIELDS
   //
   // compression level (for WiredTiger engine only)
   // possible values:
   // - none
   // - snappy
   // - zlib 
   "compressionLevel": <string>,      // optional, default: snappy

   // configuration for sharded collection
   "shardConfig": {                   // optional 
      "shardCollection": <string>.<string>, // required. <database>.<collection>
      "key": <object>,                // required, shard key, eg: {"_id": "hashed"}
      "unique": <boolean>,            // optional, default: false
      "numInitialChunks": <int>       // optional 

      "collation": {                  // optional 
        "locale": <string>,
        "caseLevel": <boolean>,
        "caseFirst": <string>,
        "strength": <int>,
        "numericOrdering": <boolean>,
        "alternate": <string>,
        "maxVariable": <string>,
        "backwards": <boolean>
      }
   },

   // list of index to build
   "indexes": [                       // optional  
      {
         "name": <string>,            // required, index name
         "key": <object>,             // required, index key, eg: {"name": 1}
         "sparse": <boolean>,         // optional, default: false
         "unique": <boolean>,         // optional, default: false
         "background": <boolean>,     // optional, default: false
         "bits": <int>,               // optional, for 2d indexes only, default: 26
         "min": <double>,             // optional, for 2d indexes only, default: -180.0
         "max": <double>,             // optional, for 2d index only, default: 180.0
         "bucketSize": <double>,      // optional, for geoHaystack indexes only
         "expireAfterSeconds": <int>, // optional, for TTL indexes only
         "weights": <string>,         // optional, for text indexes only 
         "defaultLanguage": <string>, // optional, for text index only 
         "languageOverride": <string>,// optional, for text index only
         "textIndexVersion": <int>,   // optional, for text index only
         "partialFilterExpression": <object>, // optional 

         "collation": {               // optional 
           "locale": <string>,
           "caseLevel": <boolean>,
           "caseFirst": <string>,
           "strength": <int>,
           "numericOrdering": <boolean>,
           "alternate": <string>,
           "maxVariable": <string>,
           "backwards": <boolean>                
         }
   ]
  },
  // second collection to create 
  {
    ...
  }
]
Example

A set of sample config files can be found in the samples directory. To use it, make sure that you have a mongodb instance running (on 127.0.0.1:27017 for example) and run

./mgodatagen -f samples/config.json

This will insert 1000000 random documents in collections test and link of database test with the structure defined in the config file.

Generator types

Generators have a common structure:

"fieldName": {                 // required, field name in generated document
  "type": <string>,            // required, type of the field 
  "nullPercentage": <int>,     // optional, int between 0 and 100. Percentage of documents 
                               // that will have this field
  "maxDistinctValue": <int>,   // optional, maximum number of distinct values for this field
  "typeParam": ...             // specific parameters for this type
}

List of main <generator> types:

List of custom <generator> types:

List of Faker <generator> types:

String

Generate random string of a certain length. String is composed of char within this list: abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789-_

"fieldName": {
    "type": "string",          // required
    "nullPercentage": <int>,   // optional 
    "maxDistinctValue": <int>, // optional
    "unique": <bool>,          // optional, see details below 
    "minLength": <int>,        // required,  must be >= 0 
    "maxLength": <int>         // required,  must be >= minLength
}
Unique String

if unique is set to true, the field will only contains unique strings. Unique strings have a fixed length, minLength is taken as length for the string. There is 64^x possible unique string for strings of length x. This number has to be inferior or equal to the number of documents you want to generate. For example, if you want unique strings of length 3, the is 64 * 64 * 64 = 262144 possible strings

They will look like

"aaa",
"aab",
"aac",
"aad",
...
Int

Generate random int within bounds.

"fieldName": {
    "type": "int",             // required
    "nullPercentage": <int>,   // optional 
    "maxDistinctValue": <int>, // optional
    "minInt": <int>,           // required
    "maxInt": <int>            // required, must be >= minInt
}
Long

Generate random long within bounds.

"fieldName": {
    "type": "long",            // required
    "nullPercentage": <int>,   // optional 
    "maxDistinctValue": <int>, // optional
    "minLong": <long>,         // required
    "maxLong": <long>          // required, must be >= minLong
}
Double

Generate random double within bounds.

"fieldName": {
    "type": "double",          // required
    "nullPercentage": <int>,   // optional
    "maxDistinctValue": <int>, // optional 
    "minDouble": <double>,     // required
    "maxDouble": <double>      // required, must be >= minDouble
}
Decimal

Generate random decimal128

"fieldName": {
    "type": "decimal",         // required
    "nullPercentage": <int>,   // optional
    "maxDistinctValue": <int>, // optional 
}
Boolean

Generate random boolean

"fieldName": {
    "type": "boolean",         // required
    "nullPercentage": <int>,   // optional 
    "maxDistinctValue": <int>  // optional
}
ObjectId

Generate random and unique objectId

"fieldName": {
    "type": "objectId",        // required
    "nullPercentage": <int>,   // optional
    "maxDistinctValue": <int>  // optional 
}
Array

Generate a random array of bson object

"fieldName": {
    "type": "array",             // required
    "nullPercentage": <int>,     // optional
    "maxDistinctValue": <int>,   // optional
    "size": <int>,               // required, size of the array 
    "arrayContent": <generator>  // genrator use to create element to fill the array.
                                 // can be of any type scpecified in generator types
}
Object

Generate random nested object

"fieldName": {
    "type": "object",                    // required
    "nullPercentage": <int>,             // optional
    "maxDistinctValue": <int>,           // optional
    "objectContent": {                   // required, list of generator used to 
       "nestedFieldName1": <generator>,  // generate the nested document 
       "nestedFieldName2": <generator>,
       ...
    }
}
Binary

Generate random binary data of length within bounds

"fieldName": {
    "type": "binary",           // required
    "nullPercentage": <int>,    // optional 
    "maxDistinctValue": <int>,  // optional
    "minLength": <int>,         // required,  must be >= 0 
    "maxLength": <int>          // required,  must be >= minLength
}
Date

Generate a random date (stored as ISODate )

startDate and endDate are string representation of a Date following RFC3339:

format: "yyyy-MM-ddThh:mm:ss+00:00"

"fieldName": {
    "type": "date",            // required
    "nullPercentage": <int>,   // optional 
    "maxDistinctValue": <int>, // optional
    "startDate": <string>,     // required
    "endDate": <string>        // required,  must be >= startDate
}
Position

Generate a random GPS position in Decimal Degrees ( WGS 84), eg : [40.741895, -73.989308]

"fieldName": {
    "type": "position",         // required
    "nullPercentage": <int>     // optional 
    "maxDistinctValue": <int>   // optional
}
Constant

Add the same value to each document

"fieldName": {
    "type": "constant",       // required
    "nullPercentage": <int>,  // optional
    "constVal": <object>      // required, an be of any type including object and array
                              // eg: {"k": 1, "v": "val"} 
}
Autoincrement

Create an autoincremented field (type <long> or <int>)

"fieldName": {
    "type": "autoincrement",  // required
    "nullPercentage": <int>,  // optional
    "autoType": <string>,     // required, can be `int` or `long`
    "startLong": <long>,      // start value if autoType = long
    "startInt": <int>       // start value if autoType = int
}
Ref

If a field reference an other field in an other collection, you can use a ref generator.

generator in first collection:

"fieldName":{  
    "type":"ref",               // required
    "nullPercentage": <int>,    // optional
    "maxDistinctValue": <int>,  // optional
    "id": <int>,                // required, generator id used to link
                                // field between collections
    "refContent": <generator>   // required
}

generator in other collections:

"fieldName": {
    "type": "ref",              // required
    "nullPercentage": <int>,    // optional
    "maxDistinctValue": <int>,  // optional
    "id": <int>                 // required, same id as previous generator 
}
FromArray

Randomly pick value from an array as value for the field. Currently, object in the array have to be of the same type

"fieldName": {
    "type": "fromArray",      // required
    "nullPercentage": <int>,  // optional   
    "in": [                   // required. Can't be empty. An array of object of 
      <object>,               // any type, including object and array. 
      <object>
      ...
    ]
}
CountAggregator

Count documents from <database>.<collection> matching a specific query. To use a variable of the document in the query, prefix it with "$$"

For the moment, the query can't be empty or null

"fieldName": {
  "type": "countAggregator", // required
  "database": <string>,      // required, db to use to perform aggregation
  "collection": <string>,    // required, collection to use to perform aggregation
  "query": <object>          // required, query that selects which documents to count in the collection 
}

Example:

Assuming that the collection first contains:

{"_id": 1, "field1": 1, "field2": "a" }
{"_id": 2, "field1": 1, "field2": "b" }
{"_id": 3, "field1": 2, "field2": "c" }

and that the generator for collection second is:

{
  "database": "test",
  "collection": "second",
  "count": 2,
  "content": {
    "_id": {
      "type": "autoincrement",
      "autoType": "int"
      "startInt": 0
    },
    "count": {
      "type": "countAggregator",
      "database": "test",
      "collection": "first",
      "query": {
        "field1": "$$_id"
      }
    }
  }
}

The collection second will contain:

{"_id": 1, "count": 2}
{"_id": 2, "count": 1}
ValueAggregator

Get distinct values for a specific field for documents from <database>.<collection> matching a specific query. To use a variable of the document in the query, prefix it with "$$"

For the moment, the query can't be empty or null

"fieldName": {
  "type": "valueAggregator", // required
  "database": <string>,      // required, db to use to perform aggregation
  "collection": <string>,    // required, collection to use to perform aggregation
  "key": <string>,           // required, the field for which to return distinct values. 
  "query": <object>          // required, query that specifies the documents from which 
                             // to retrieve the distinct values
}

Example:

Assuming that the collection first contains:

{"_id": 1, "field1": 1, "field2": "a" }
{"_id": 2, "field1": 1, "field2": "b" }
{"_id": 3, "field1": 2, "field2": "c" }

and that the generator for collection second is:

{
  "database": "test",
  "collection": "second",
  "count": 2,
  "content": {
    "_id": {
      "type": "autoincrement",
      "autoType": "int"
      "startInt": 0
    },
    "count": {
      "type": "valueAggregator",
      "database": "test",
      "collection": "first",
      "key": "field2",
      "values": {
        "field1": "$$_id"
      }
    }
  }
}

The collection second will contain:

{"_id": 1, "values": ["a", "b"]}
{"_id": 2, "values": ["c"]}
BoundAggregator

Get lower ang higher values for a specific field for documents from <database>.<collection> matching a specific query. To use a variable of the document in the query, prefix it with "$$"

For the moment, the query can't be empty or null

"fieldName": {
  "type": "valueAggregator", // required
  "database": <string>,      // required, db to use to perform aggregation
  "collection": <string>,    // required, collection to use to perform aggregation
  "key": <string>,           // required, the field for which to return distinct values. 
  "query": <object>          // required, query that specifies the documents from which 
                             // to retrieve lower/higer value
}

Example:

Assuming that the collection first contains:

{"_id": 1, "field1": 1, "field2": "0" }
{"_id": 2, "field1": 1, "field2": "10" }
{"_id": 3, "field1": 2, "field2": "20" }
{"_id": 4, "field1": 2, "field2": "30" }
{"_id": 5, "field1": 2, "field2": "15" }
{"_id": 6, "field1": 2, "field2": "200" }

and that the generator for collection second is:

{
  "database": "test",
  "collection": "second",
  "count": 2,
  "content": {
    "_id": {
      "type": "autoincrement",
      "autoType": "int"
      "startInt": 0
    },
    "count": {
      "type": "valueAggregator",
      "database": "test",
      "collection": "first",
      "key": "field2",
      "values": {
        "field1": "$$_id"
      }
    }
  }
}

The collection second will contain:

{"_id": 1, "values": {"m": 0, "M": 10}}
{"_id": 2, "values": {"m": 15, "M": 200}}

where m is the min value, and M the max value

Faker

Generate 'real' data using Faker library

"fieldName": {
    "type": "faker",             // required
    "nullPercentage": <int>,     // optional
    "maxDistinctValue": <int>,   // optional
    "method": <string>           // faker method to use, for example: City / Email...
}

If you're building large datasets (1000000+ items) you should avoid faker generators and use main or custom generators instead, as faker generator are way slower.

Currently, only "en" locale is available

Documentation

Overview

A small CLI tool to quickly generate millions of pseudo-random BSON documents and insert them into a Mongodb instance.

Directories

Path Synopsis
Package generators used to create bson objects Relevant documentation: http://bsonspec.org/#/specification Currently supported BSON types: - string - int - long - double - boolean - date - objectId - object - array - binary data Custom types : - GPS position - constant - autoincrement - reference - from array - faker (cf https://github.com/manveru/faker) It was created as part of mgodatagen, but is standalone and may be used on its own.
Package generators used to create bson objects Relevant documentation: http://bsonspec.org/#/specification Currently supported BSON types: - string - int - long - double - boolean - date - objectId - object - array - binary data Custom types : - GPS position - constant - autoincrement - reference - from array - faker (cf https://github.com/manveru/faker) It was created as part of mgodatagen, but is standalone and may be used on its own.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL