metha

package module
v0.1.6 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 10, 2016 License: GPL-3.0 Imports: 26 Imported by: 0

README

metha

Command line OAI-PMH incremental harvester. Data is harvested in chunks.

$ metha-sync http://export.arxiv.org/oai2
...

All downloaded files are written to

$ METHA_DIR=/tmp/harvest metha-sync -dir http://export.arxiv.org/oai2
/tmp/harvest/I29haV9kYyNodHRwOi8vZXhwb3J0LmFyeGl2Lm9yZy9vYWky

The default METHA_DIR is $HOME/.metha.

Harvesting can be CTRL-C'd any time. The data is harvested up to the last full day, so there is a small latency. The HTTP client is resilient.

Example: If the current date would be Thu Apr 21 14:28:10 CEST 2016, the harvester would request all data since the repositories earliest date and 2016-04-20 23:59:59.

You can stream records to stdout, too.

$ metha-cat http://export.arxiv.org/oai2

This will stream all harvested records to stdout. You can emit records based on datestamp as well:

$ metha-cat -from 2016-01-01 http://export.arxiv.org/oai2

This will only stream records with a datestamp equal or after 2016-01-01.

To just stream all data really fast, use find and zcat on the harvesting dir.

To display basic repository information:

$ metha-id http://export.arxiv.org/oai2

To list all harvested endpoints:

$ metha-ls

Installation

Use a release or

$ go get github.com/miku/metha/cmd/...

Harvesting Roulette

$ metha-sync $(sort -R contrib/sites.tsv | head -1)

Errors this harvester can somewhat handle

  • responses with resumption tokens that lead to empty responses
  • gzipped responses, that are not advertised as such
  • funny (illegal) control characters in XML responses
  • repositories, that won't respond unless the dates are given with the exact granualarity
  • repositories with endless token loops
  • repositories that do not support selective harvesting (use metha-sync -no-intervals URL)
  • limited repositories, metha will try up to 8 times with an exponential backoff

Documentation

Index

Constants

View Source
const (
	DefaultTimeout    = 5 * time.Minute
	DefaultMaxRetries = 8
)
View Source
const Day = 24 * time.Hour
View Source
const Version = "0.1.6"

Variables

View Source
var (
	StdClient     = Client{Doer: http.DefaultClient}
	DefaultClient = Client{Doer: CreateDoer(DefaultTimeout, DefaultMaxRetries)}
	// Example for broken XML: http://eprints.vu.edu.au/perl/oai2. Add more
	// weird things to be cleaned before XML parsing here. Another faulty:
	// http://digitalcommons.gardner-webb.edu/do/oai/?from=2016-02-29&metadataPr
	// efix=oai_dc&until=2016-03-31&verb=ListRecords. Replace control chars
	// outside XML char range.
	ControlCharReplacer = strings.NewReplacer(
		"\u0001", "", "\u0002", "", "\u0003", "",
		"\u0004", "", "\u0005", "", "\u0006", "",
		"\u0007", "", "\u0008", "", "\u0009", "",
		"\u000B", "", "\u000C", "", "\u000E", "",
		"\u000F", "", "\u0010", "", "\u0011", "",
		"\u0012", "", "\u0013", "", "\u0014", "",
		"\u0015", "", "\u0016", "", "\u0017", "",
		"\u0018", "", "\u0019", "", "\u001A", "",
		"\u001B", "", "\u001C", "", "\u001D", "",
		"\u001E", "", "\u001F", "")
)
View Source
var (
	// BaseDir is where all downloaded data is stored
	BaseDir = filepath.Join(UserHomeDir(), ".metha")

	ErrAlreadySynced       = errors.New("already synced")
	ErrInvalidEarliestDate = errors.New("invalid earliest date")
)
View Source
var (
	ErrInvalidVerb      = errors.New("invalid OAI verb")
	ErrMissingVerb      = errors.New("missing verb")
	ErrCannotGenerateID = errors.New("cannot generate ID")
	ErrMissingURL       = errors.New("missing URL")
	ErrParameterMissing = errors.New("missing required parameter")
)

Functions

func MoveAndCompress added in v0.1.6

func MoveAndCompress(src, dst string) error

MoveAndCompress will move src to dst, gzipping in the process.

func MustGlob

func MustGlob(pattern string) []string

MustGlob is like filepath.Glob, but panics on bad pattern.

func PrependSchema

func PrependSchema(s string) string

Prepends http, if missing.

func UserHomeDir

func UserHomeDir() string

UserHomeDir returns the home directory of the user.

Types

type About

type About struct {
	Body []byte `xml:",innerxml" json:"body,omitempty"`
}

About has addition record information.

func (About) GoString

func (ab About) GoString() string

Formatter for About content

type Client

type Client struct {
	Doer Doer
}

A client that can execute requests.

func CreateClient

func CreateClient(timeout time.Duration, retries int) Client

Create a client with timeout and retry properties.

func (*Client) Do

func (c *Client) Do(r *Request) (*Response, error)

Do executes a single OAIRequest. ResumptionToken handling must happen in the caller. Only Identify and GetRecord requests will return a complete response.

type Description

type Description struct {
	Body []byte `xml:",innerxml"`
}

Description holds information about a set.

func (Description) GoString

func (desc Description) GoString() string

Formatter for Description content.

type DirLaster

type DirLaster struct {
	Dir           string
	DefaultValue  string
	ExtractorFunc func(os.FileInfo) string
}

DirLaster extract the maximum value from the files of a directory. The values are extracted per file via TransformFunc, which gets a filename and returns a token. The tokens are sorted and the lexikographically largest element is returned.

func (DirLaster) Last

func (l DirLaster) Last() (string, error)

Last extracts the maximum value from a directory, given an extractor function.

type Doer

type Doer interface {
	Do(*http.Request) (*http.Response, error)
}

Doer is a minimal HTTP interface.

func CreateDoer

func CreateDoer(timeout time.Duration, retries int) Doer

CreateDoer will return http request clients with specific timeout and retry properties.

type GetRecord

type GetRecord struct {
	Record Record `xml:"record,omitempty" json:"record,omitempty"`
}

GetRecord returns a single record.

type Harvest

type Harvest struct {
	BaseURL string
	Format  string
	Set     string
	From    string
	Until   string

	MaxRequests                int
	DisableSelectiveHarvesting bool
	CleanBeforeDecode          bool
	SkipBroken                 bool
	MaxEmptyResponses          int

	Identify *Identify
	Started  time.Time

	// protects the (rare) case, where we are in the process of renaming
	// harvested files and get a termination signal at the same time.
	sync.Mutex
}

Harvest contains parameters for a mass-download. MaxRequests and CleanBeforeDecode are switches to handle broken token implementations and funny chars in responses. Some repos do not support selective harvesting (e.g. zvdd.org/oai2). Set "DisableSelectiveHarvesting" to try to grab metadata from these repositories. Set "SkipBroken" to ignore errors. From and Until must always be given with 2006-01-02 layout. TODO(miku): make zero type work (lazily run identify).

func NewHarvest

func NewHarvest(baseURL string) (*Harvest, error)

func (*Harvest) DateLayout

func (h *Harvest) DateLayout() string

DateLayout converts the repository endpoints advertised granularity to Go date format strings.

func (*Harvest) Dir

func (h *Harvest) Dir() string

Dir returns the absolute path to the harvesting directory.

func (*Harvest) Files

func (h *Harvest) Files() []string

files returns all already harvested files (no temporary files).

func (*Harvest) MkdirAll

func (h *Harvest) MkdirAll() error

MkdirAll creates necessary directories.

func (*Harvest) Run

func (h *Harvest) Run() error
type Header struct {
	Status     string   `xml:"status,attr" json:"status,omitempty"`
	Identifier string   `xml:"identifier,omitempty" json:"identifier,omitempty"`
	DateStamp  string   `xml:"datestamp,omitempty" json:"datestamp,omitempty"`
	SetSpec    []string `xml:"setSpec,omitempty" json:"setSpec,omitempty"`
}

A Header is part of other requests.

type Identify

type Identify struct {
	RepositoryName    string        `xml:"repositoryName,omitempty" json:"repositoryName,omitempty"`
	BaseURL           string        `xml:"baseURL,omitempty" json:"baseURL,omitempty"`
	ProtocolVersion   string        `xml:"protocolVersion,omitempty" json:"protocolVersion,omitempty"`
	AdminEmail        []string      `xml:"adminEmail,omitempty" json:"adminEmail,omitempty"`
	EarliestDatestamp string        `xml:"earliestDatestamp,omitempty" json:"earliestDatestamp,omitempty"`
	DeletedRecord     string        `xml:"deletedRecord,omitempty" json:"deletedRecord,omitempty"`
	Granularity       string        `xml:"granularity,omitempty" json:"granularity,omitempty"`
	Description       []Description `xml:"description,omitempty" json:"description,omitempty"`
}

Identify reports information about a repository.

type Interval

type Interval struct {
	Begin time.Time
	End   time.Time
}

Interval represents a span of time.

func (Interval) MonthlyIntervals

func (iv Interval) MonthlyIntervals() []Interval

MonthlyIntervals segments a given interval into montly chunks.

type Laster

type Laster interface {
	Last() (string, error)
}

Extracts some maximum value as string.

type ListIdentifiers

type ListIdentifiers struct {
	Headers         []Header `xml:"header,omitempty" json:"header,omitempty"`
	ResumptionToken string   `xml:"resumptionToken,omitempty" json:"resumptionToken,omitempty"`
}

ListIdentifiers lists headers only.

type ListMetadataFormats

type ListMetadataFormats struct {
	MetadataFormat []MetadataFormat `xml:"metadataFormat,omitempty" json:"metadataFormat,omitempty"`
}

ListMetadataFormats lists supported metadata formats.

type ListRecords

type ListRecords struct {
	Records         []Record `xml:"record" json:"record"`
	ResumptionToken string   `xml:"resumptionToken" json:"resumptionToken"`
}

ListRecords lists records.

type ListSets

type ListSets struct {
	Set             []Set  `xml:"set,omitempty"  json:"set,omitempty"`
	ResumptionToken string `xml:"resumptionToken,omitempty" json:"resumptionToken,omitempty"`
}

ListSets lists available sets. TODO(miku): resumptiontoken can have expiration date, etc.

type Metadata

type Metadata struct {
	Body []byte `xml:",innerxml"`
}

Metadata contains the actual metadata, conforming to various schemas.

func (Metadata) GoString

func (md Metadata) GoString() string

Formatter for Metadata content.

func (Metadata) MarshalJSON

func (md Metadata) MarshalJSON() ([]byte, error)

type MetadataFormat

type MetadataFormat struct {
	MetadataPrefix    string `xml:"metadataPrefix,omitempty" json:"metadataPrefix,omitempty"`
	Schema            string `xml:"schema,omitempty" json:"schema,omitempty"`
	MetadataNamespace string `xml:"metadataNamespace,omitempty" json:"metadataNamespace,omitempty"`
}

MetadataFormat holds information about a format.

type MultiError

type MultiError struct {
	Errors []error
}

func (*MultiError) Error

func (e *MultiError) Error() string

type OAIError

type OAIError struct {
	Code    string `xml:"code,attr" json:"code,omitempty"`
	Message string `xml:",chardata" json:"message,omitempty"`
}

An OAI protocol error.

func (OAIError) Error

func (e OAIError) Error() string

Error formats code and message.

type Record

type Record struct {
	Header   Header   `xml:"header,omitempty" json:"header,omitempty"`
	Metadata Metadata `xml:"metadata,omitempty" json:"metadata,omitempty"`
	About    About    `xml:"about,omitempty" json:"about,omitempty"`
}

Record represents a single record.

type Repository

type Repository struct {
	BaseURL string
}

func (Repository) Formats

func (r Repository) Formats() ([]MetadataFormat, error)

func (Repository) Sets

func (r Repository) Sets() ([]Set, error)

type Request

type Request struct {
	BaseURL           string
	Verb              string
	Identifier        string
	MetadataPrefix    string
	From              string
	Until             string
	Set               string
	ResumptionToken   string
	CleanBeforeDecode bool
}

A Request can express any request, that can be sent to an OAI server. Not all combination of values will yield valid requests.

func (*Request) URL

func (r *Request) URL() (*url.URL, error)

URL returns the URL for a given request. Invalid verbs and missing parameters are reported here.

type RequestNode

type RequestNode struct {
	Verb           string `xml:"verb,attr" json:"verb,omitempty"`
	Set            string `xml:"set,attr" json:"set,omitempty"`
	MetadataPrefix string `xml:"metadataPrefix,attr" json:"metadataPrefix,omitempty"`
}

RequestNode carries the request information into the response.

type Response

type Response struct {
	ResponseDate string      `xml:"responseDate,omitempty" json:"responseDate,omitempty"`
	Request      RequestNode `xml:"request,omitempty" json:"request,omitempty"`
	Error        OAIError    `xml:"error,omitempty" json:"error,omitempty"`

	GetRecord           GetRecord           `xml:"GetRecord,omitempty" json:"GetRecord,omitempty"`
	Identify            Identify            `xml:"Identify,omitempty" json:"Identify,omitempty"`
	ListIdentifiers     ListIdentifiers     `xml:"ListIdentifiers,omitempty" json:"ListIdentifiers,omitempty"`
	ListMetadataFormats ListMetadataFormats `xml:"ListMetadataFormats,omitempty" json:"ListMetadataFormats,omitempty"`
	ListRecords         ListRecords         `xml:"ListRecords,omitempty" json:"ListRecords,omitempty"`
	ListSets            ListSets            `xml:"ListSets,omitempty" json:"ListSets,omitempty"`
}

Response is the envelope. It can hold any OAI response kind.

func Do

func Do(r *Request) (*Response, error)

Do is a shortcut for DefaultClient.Do.

func (*Response) GetResumptionToken

func (response *Response) GetResumptionToken() string

GetResumptionToken returns the resumption token or an empty string if it does not have a token

func (*Response) HasResumptionToken

func (response *Response) HasResumptionToken() bool

HasResumptionToken determines if the request has a ResumptionToken.

type Set

type Set struct {
	SetSpec        string      `xml:"setSpec,omitempty" json:"setSpec,omitempty"`
	SetName        string      `xml:"setName,omitempty" json:"setName,omitempty"`
	SetDescription Description `xml:"setDescription,omitempty" json:"setDescription,omitempty"`
}

A Set has a spec, name and description.

type Values

type Values struct {
	url.Values
}

func NewValues

func NewValues() Values

func (Values) EncodeVerbatim

func (v Values) EncodeVerbatim() string

EncodeVerbatim is like Encode(), but does not escape the keys and values.

Directories

Path Synopsis
cmd
metha-cat command
metha-files command
metha-id command
metha-ls command
metha-sync command

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL