span-crossref-snapshot

command

v0.1.365 Latest Latest Go to latest Published: May 14, 2024 License: GPL-3.0 Imports: 19 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/miku/span

Links

Open Source Insights

Documentation ¶

Overview ¶

Given as single file with crossref works API messages, create a potentially smaller file, which contains only the most recent version of each document.

Works in a three stage, two pass fashion: (1) extract, (2) identify, (3) extract. Performance data point (30M compressed records, 11m33.871s):

2017/07/24 18:26:10 stage 1: 8m13.799431646s 2017/07/24 18:26:55 stage 2: 45.746997314s 2017/07/24 18:29:30 stage 3: 2m34.23537293s

$ span-crossref-snapshot -z crossref.ndj.gz -o out.ndj.gz

Anecdata. We started the new "span-crossref-sync" based workflow in 2022-05-30 and have been requesting daily slices from crossref since 2022-01-01. As of 2023-12-04 we downloaded 701 files (zstd compressed).

               sz
count         701
mean   2816941417
std    2175766872
min             0
25%    1138093994
50%    2739488108
75%    4058532166
max   13751449046

Median daily shipment of about 2.7GB. If we only consider days on which we actually saw data, that number increases to about 3GB.

               sz
count         636
mean   3104836373
std    2079246455
min           573
25%    1541247728
50%    2998517796
75%    4150155730
max   13751449046

At most 13GB per day. Total sum of downloaded data is 1.796TB compressed (3); if we recompress with (19) we get around 1.3TB of raw data, or 8.07TiB uncompressed. A typical update day contains 1-2M docs (lines).

Source Files ¶

View all Source files

main.go

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL