gitcollector

package module
v0.1.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 28, 2019 License: GPL-3.0 Imports: 5 Imported by: 0

README

gitcollector GitHub version Build Status codecov GoDoc Go Report Card

gitcollector collects and stores git repositories.

gitcollector is the source{d} tool to download and update git repositories at large scale. To that end, it uses a custom repository storage file format called siva optimized for saving storage space and keeping repositories up-to-date.

Status

The project is in a preliminary stable stage and under active development.

Storing repositories using rooted repositories

A rooted repository is a bare Git repository that stores all objects from all repositories that share a common history, that is, they have the same initial commit. It is stored using the Siva file format.

Root Repository explanatory diagram

Rooted repositories have a few particularities that you should know to work with them effectively:

  • They have no HEAD reference.
  • All references are of the following form: {REFERENCE_NAME}/{REMOTE_NAME}. For example, the reference refs/heads/master of the remote foo would be /refs/heads/master/foo.
  • Each remote represents a repository that shares the common history of the rooted repository. A remote can have multiple endpoints.
  • A rooted repository is simply a repository with all the objects from all the repositories which share the same root commit.
  • The root commit for a repository is obtained following the first parent of each commit from HEAD.

Getting started

Plain command

gitcollector entry point usage is done through the subcommand download (at this time is the only subcommand):

Usage:
  gitcollector [OPTIONS] download [download-OPTIONS]

Help Options:
  -h, --help                                     Show this help message

[download command options]
          --library=                             path where download to [$GITCOLLECTOR_LIBRARY]
          --bucket=                              library bucketization level (default: 2) [$GITCOLLECTOR_LIBRARY_BUCKET]
          --tmp=                                 directory to place generated temporal files (default: /tmp) [$GITCOLLECTOR_TMP]
          --workers=                             number of workers, default to GOMAXPROCS [$GITCOLLECTOR_WORKERS]
          --half-cpu                             set the number of workers to half of the set workers [$GITCOLLECTOR_HALF_CPU]
          --no-updates                           don't allow updates on already downloaded repositories [$GITCOLLECTOR_NO_UPDATES]
          --no-forks                             github forked repositories will not be downloaded [$GITCOLLECTOR_NO_FORKS]
          --orgs=                                list of github organization names separated by comma [$GITHUB_ORGANIZATIONS]
          --excluded-repos=                      list of repos to exclude separated by comma [$GITCOLLECTOR_EXCLUDED_REPOS]
          --token=                               github token [$GITHUB_TOKEN]
          --metrics-db=                          uri to a database where metrics will be sent [$GITCOLLECTOR_METRICS_DB_URI]
          --metrics-db-table=                    table name where the metrics will be added (default: gitcollector_metrics) [$GITCOLLECTOR_METRICS_DB_TABLE]
          --metrics-sync-timeout=                timeout in seconds to send metrics (default: 30) [$GITCOLLECTOR_METRICS_SYNC]

    Log Options:
          --log-level=[info|debug|warning|error] Logging level (default: info) [$LOG_LEVEL]
          --log-format=[text|json]               log format, defaults to text on a terminal and json otherwise [$LOG_FORMAT]
          --log-fields=                          default fields for the logger, specified in json [$LOG_FIELDS]
          --log-force-format                     ignore if it is running on a terminal or not [$LOG_FORCE_FORMAT]

Usage example, --library and --orgs are always required:

gitcollector download --library=/path/to/repos/directoy --orgs=src-d

To collect repositories from several github organizations:

gitcollector download --library=/path/to/repos/directoy --orgs=src-d,bblfsh

Note that all the download command options are also configurable with environment variables.

Docker

gitcollector upload a new docker image to docker hub on each new release. To use it:

docker run --rm --name gitcollector_1 \
-e "GITHUB_ORGANIZATIONS=src-d,bblfsh" \
-e "GITHUB_TOKEN=foo" \
-v /path/to/repos/directory:/library \
srcd/gitcollector:latest

Note that you must mount a local directory into the specific container path shown in -v /path/to/repos/directory:/library. This directory is where the repositories will be downloaded into rooted repositories in siva files format.

License

GPL v3.0, see LICENSE

Documentation

Index

Constants

This section is empty.

Variables

View Source
var (
	// ErrNewJobsNotFound must be returned by a JobScheduleFn when it can't
	// find new Jobs.
	ErrNewJobsNotFound = errors.NewKind(
		"couldn't find new jobs to schedule")

	// ErrJobSource must be returned by a JobScheduleFn when the source of
	// job is closed.
	ErrJobSource = errors.NewKind("job source is closed")
)

Functions

This section is empty.

Types

type Job

type Job interface {
	// Process perform the necessary work on the job.
	Process(context.Context) error
}

Job represents a gitcollector task.

type JobScheduleFn

type JobScheduleFn func(context.Context) (Job, error)

JobScheduleFn is a function to schedule the next Job.

type MetricsCollector

type MetricsCollector interface {
	// Start starts collecting metrics.
	Start()
	// Stop stops collectingMetrincs.
	Stop(immediate bool)
	// Success registers metrics about successfully processed Job.
	Success(Job)
	// Faile register metrics about a failed processed Job.
	Fail(Job)
	// Discover register metrics about a discovered Job.
	Discover(Job)
}

MetricsCollector represents a component in charge to collect jobs metrics.

type Provider

type Provider interface {
	Start() error
	Stop() error
}

Provider interface represents a service to generate new Jobs.

type WorkerPool

type WorkerPool struct {
	// contains filtered or unexported fields
}

WorkerPool holds a pool of workers to process Jobs.

func NewWorkerPool

func NewWorkerPool(
	schedule JobScheduleFn,
	opts *WorkerPoolOpts,
) *WorkerPool

NewWorkerPool builds a new WorkerPool.

func (*WorkerPool) Close

func (wp *WorkerPool) Close()

Close stops all the workers in the pool waiting for the jobs to finish.

func (*WorkerPool) Run

func (wp *WorkerPool) Run()

Run notify workers to start.

func (*WorkerPool) SetWorkers

func (wp *WorkerPool) SetWorkers(n int)

SetWorkers set the number of Workers in the pool to n.

func (*WorkerPool) Size

func (wp *WorkerPool) Size() int

Size returns the current number of workers in the pool.

func (*WorkerPool) Stop

func (wp *WorkerPool) Stop()

Stop stops all the workers in the pool immediately.

func (*WorkerPool) Wait

func (wp *WorkerPool) Wait()

Wait waits for the workers to finish.

type WorkerPoolOpts

type WorkerPoolOpts struct {
	SchedulerCapacity  int
	ScheduleJobTimeout time.Duration
	NotWaitNewJobs     bool
	Metrics            MetricsCollector
}

WorkerPoolOpts are configuration options for a JobScheduler.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL