indexer

package

v5.1.0+incompatible Latest Latest Go to latest Published: Jul 31, 2019 License: Apache-2.0 Imports: 26 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/src-d/datasets

Links

Open Source Insights

README ¶

pga-create

Tool to create the PGA dataset.

The following commands exist:

repack - downloads latest GHTorrent MySQL dump and repacks it only with the required files (optional step).
discover - extract the needed information from GHTorrent MySQL dump on the fly. Requires only 1.5 GB of storage.
select - compile the list of repositories to clone according to various filters, such as stars or languages.
index - create the index
set-forks - add fork counts

Installation

There are 64-bit binaries for Linux, MacOS and Windows on Releases page.

Build from source

go get -v github.com/src-d/datasets/PublicGitArchive/pga-create

Obtain the list of repositories to clone

The list must be a text file with one URL per line. The paper chooses repositories on GitHub with ≥50 stars, which is equivalent to the following commands which generate list.txt:

pga-create discover
pga-create select -m 50 > repository_list.txt

Cloning repositories

You are going to need Borges and all it's dependencies: RabbitMQ and PostgreSQL. The following commands are an artificial simplified cloning scenario, please refer to Borges docs for the detailed manual.

In the first terminal execute

borges init
borges producer --source=file --file repository_list.txt

In the second terminal execute

export CONFIG_ROOT_REPOSITORIES_DIR=/path/where/repositories/will/be/stored
borges consumer

Processing repositories

To process the downloaded repositories you will need the pga-create index command, and run it querying the database populated in the previous step. This will generate a CSV with the extracted information of all those repositories.

Same environment variables as in borges can be used to configure the database access.

pga-create index --debug --logfile=pga-create-index.log

The options accepted by pga-create index are the following:

-o, --output=   csv file path with the results (default: data/index.csv)
--debug         show debug logs
--logfile=      write logs to file
--limit=        max number of repositories to process
--offset=       skip initial n repositories
--workers=      number of workers to use (defaults to number of CPUs)
--repos-file=   path to a file with a repository per line, only those will be processed
-s, --stars=    input path for the file with the numbers of stars per repository (default: data/stars.gz)
-r, --repositories= input path for the gzipped file with the repository names and identifiers (default: data/repositories.gz)

To set the SIZE field properly, it relies on the default temporary directories configuration for the core-retrieval dependency but for the CONFIG_CLEAN_TEMP_DIR environment variable which must be set to true:

CONFIG_CLEAN_TEMP_DIR=true pga-create index --debug --logfile=pga-create-index.log

NOTE: this spawns as many workers as CPUs are available in the machine. Take into account that some repositories may be considerably large and this process may take a very big amount of memory in the machine.

After being processed with index you will have a result.csv file with all the content you need. The only missing content will be the FORK_COUNT, but for that you can use the also included set-forks command.

pga-create set-forks

This will take result.csv and add the forks to it, resulting in a result_forks.csv file with the same data you had in the original CSV, only with the forks added.

Documentation ¶

Index ¶

func Index(store *model.RepositoryStore, txer repository.RootedTransactioner, ...)

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func Index ¶

func Index(
	store *model.RepositoryStore,
	txer repository.RootedTransactioner,
	outputFile string,
	workers int,
	limit uint64,
	offset uint64,
	reposList map[string]uint32,
)

Index to the given output csv file all the processed repositories in the given store.

Types ¶

This section is empty.

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
cmd
pga-create

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL