oaimi

package module
v0.2.10 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 10, 2015 License: GPL-3.0 Imports: 19 Imported by: 0

README

README

The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) is a low-barrier mechanism for repository interoperability. https://www.openarchives.org/pmh/

No frills OAI harvesting. It acts as cache and will take care of incrementally retrieving new records.

Build Status

Installation

$ go get github.com/miku/oaimi/cmd/oaimi

There are deb and rpm packages as well.

Usage

Show repository information:

$ oaimi -id http://digital.ub.uni-duesseldorf.de/oai
{
  "formats": [
    {
      "prefix": "oai_dc",
      "schema": "http://www.openarchives.org/OAI/2.0/oai_dc.xsd"
    },
    ...
    {
      "prefix": "epicur",
      "schema": "http://www.persistent-identifier.de/xepicur/version1.0/xepicur.xsd"
    }
  ],
  "identify": {
    "name": "Visual Library Server der Universitäts- und Landesbibliothek Düsseldorf",
    "url": "http://digital.ub.uni-duesseldorf.de/oai/",
    "version": "2.0",
    "email": "docserv@uni-duesseldorf.de",
    "earliest": "2008-04-18T07:54:14Z",
    "delete": "no",
    "granularity": "YYYY-MM-DDThh:mm:ssZ"
  },
  "sets": [
    {
      "spec": "ulbdvester",
      "name": "Sammlung Vester (DFG)"
    },
    ...
    {
      "spec": "ulbd_rsh",
      "name": "RSH"
    }
  ]
}

Harvest the complete repository into a single file (default format is oai_dc, might take a few minutes on first run):

$ oaimi -verbose http://digital.ub.uni-duesseldorf.de/oai > metadata.xml

Harvest only a slice (e.g. set ulbdvester in format epicur for 2010 only):

$ oaimi -set ulbdvester -prefix epicur -from 2010-01-01 \
        -until 2010-12-31 http://digital.ub.uni-duesseldorf.de/oai > slice.xml

Harvest, and add an artificial root element, so the result gets a bit more valid XML:

$ oaimi -root records http://digital.ub.uni-duesseldorf.de/oai > withroot.xml

To list the harvested files, run:

$ ls $(oaimi -dirname http://digital.ub.uni-duesseldorf.de/oai)

Add any parameter to see the resulting cache dir:

$ ls $(oaimi -dirname -set ulbdvester -prefix epicur -from 2010-01-01 \
             -until 2010-12-31 http://digital.ub.uni-duesseldorf.de/oai)

To remove all cached files:

$ rm -rf $(oaimi -dirname http://digital.ub.uni-duesseldorf.de/oai)

Play well with others:

$ oaimi http://acceda.ulpgc.es/oai/request | \
    xmlcutty -path /Response/ListRecords/record/metadata -root collection | \
    xmllint --format -

<?xml version="1.0"?>
<collection>
  <metadata>
    <oai_dc:dc xmlns:oai_dc="ht...... dc.xsd">
      <dc:title>Elementos m&#xED;ticos y paralelos estructurales en la ...</dc:title>
...

Options:

$ oaimi -h
Usage of oaimi:
  -cache string
      oaimi cache dir (default "/Users/tir/.oaimicache")
  -dirname
      show shard directory for request
  -from string
      OAI from
  -id
      show repository info
  -prefix string
      OAI metadataPrefix (default "oai_dc")
  -root string
      name of artificial root element tag to use
  -set string
      OAI set
  -until string
      OAI until (default "2015-11-30")
  -v  prints current program version
  -verbose
      more output

Experimental oaimi-id and oaimi-sync for identifying or harvesting in parallel:

$ oaimi-id -h
Usage of oaimi-id:
  -timeout duration
      deadline for requests (default 30m0s)
  -v  prints current program version
  -verbose
      be verbose
  -w int
      requests in parallel (default 8)

$ oaimi-sync
Usage of oaimi-sync:
  -cache string
      where to cache responses (default "/Users/tir/.oaimicache")
  -v  prints current program version
  -verbose
      be verbose
  -w int
      requests in parallel (default 8)

How it works

The harvesting is performed in chunks (weekly at the moment). The raw data is downloaded and appended to a single temporary file per source, set, prefix and month. Once a month has been harvested successfully, the temporary file is moved below a cache dir. In short: The cache dir will not contain partial files.

If you request the data for a given data source, oaimi will try to reuse the cache and only harvest not yet cached data. The output file is the concatenated content for the requested date range. The output is no valid XML because a root element is missing. You can add a custom root element with the -root flag.

The value proposition of oaimi is that you get a single file containing the raw data for a specific source with a single command and that incremental updates are relatively cheap - at most the last 7 days need to be fetched.

For the moment, any further processing must happen in the client (like handling deletions).

More Docs: http://godoc.org/github.com/miku/oaimi

Similar projects

More sites

Distributions

Over 2038 repositories.

Miscellaneous

License

  • GPLv3
  • This project uses ioutil2, Copyright 2012, Google Inc. All rights reserved. Use of this source code is governed by a BSD-style license.

Documentation

Overview

Copyright 2015 by Leipzig University Library, http://ub.uni-leipzig.de
                  The Finc Authors, http://finc.info
                  Martin Czygan, <martin.czygan@uni-leipzig.de>

This file is part of some open source application.

Some open source application is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

Some open source application is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with Foobar. If not, see <http://www.gnu.org/licenses/>.

@license GPL-3.0+ <http://spdx.org/licenses/GPL-3.0+>

+build linux darwin

Package oaimi implements a few helpers to mirror OAI repositories. The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) is a low- barrier mechanism for repository interoperability.

This project aims to make it simple to create a local, single file view of the repository metadata. It comes with a command line tool, called `oaimi`.

Basic usage:

   $ oaimi http://digitalcommons.unmc.edu/do/oai/ > metadata.xml

Copyright 2015 by Leipzig University Library, http://ub.uni-leipzig.de
                  The Finc Authors, http://finc.info
                  Martin Czygan, <martin.czygan@uni-leipzig.de>

This file is part of some open source application.

Some open source application is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

Some open source application is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with Foobar. If not, see <http://www.gnu.org/licenses/>.

@license GPL-3.0+ <http://spdx.org/licenses/GPL-3.0+>

Copyright 2015 by Leipzig University Library, http://ub.uni-leipzig.de
                  The Finc Authors, http://finc.info
                  Martin Czygan, <martin.czygan@uni-leipzig.de>

This file is part of some open source application.

Some open source application is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

Some open source application is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with Foobar. If not, see <http://www.gnu.org/licenses/>.

@license GPL-3.0+ <http://spdx.org/licenses/GPL-3.0+>

Copyright 2012, Google Inc. All rights reserved. Use of this source code is governed by a BSD-style license that can be found in the LICENSE file.

Copyright 2015 by Leipzig University Library, http://ub.uni-leipzig.de
                  The Finc Authors, http://finc.info
                  Martin Czygan, <martin.czygan@uni-leipzig.de>

This file is part of some open source application.

Some open source application is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

Some open source application is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with Foobar. If not, see <http://www.gnu.org/licenses/>.

@license GPL-3.0+ <http://spdx.org/licenses/GPL-3.0+>

Copyright 2015 by Leipzig University Library, http://ub.uni-leipzig.de
                  The Finc Authors, http://finc.info
                  Martin Czygan, <martin.czygan@uni-leipzig.de>

This file is part of some open source application.

Some open source application is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

Some open source application is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with Foobar. If not, see <http://www.gnu.org/licenses/>.

@license GPL-3.0+ <http://spdx.org/licenses/GPL-3.0+>

Index

Constants

View Source
const CompressThreshold = 1024
View Source
const Version = "0.2.10"

Version

Variables

View Source
var (
	ErrFileNotWriteable = errors.New("not opened for writing")
	ErrFileNotReadable  = errors.New("not opened for reading")
)
View Source
var (
	ErrNoEndpoint         = errors.New("an endpoint is required")
	ErrNoVerb             = errors.New("no verb")
	ErrBadVerb            = errors.New("bad verb")
	ErrCannotCreatePath   = errors.New("cannot create path")
	ErrNoHost             = errors.New("no host")
	ErrMissingFromOrUntil = errors.New("missing from or until")
	// ErrTooManyRequests might be encountered with broken resumptiontoken implementations.
	ErrTooManyRequests = errors.New("too many requests")

	// Verbose logs actions
	Verbose = false
	// UserAgent to use for requests
	UserAgent = fmt.Sprintf("oaimi/%s (https://github.com/miku/oaimi)", Version)
	// DefaultEarliestDate is used, if the repository does not supply one.
	DefaultEarliestDate = time.Date(1970, 1, 1, 0, 0, 0, 0, time.UTC)
	// CutoffDate is used, if the repository reports some earliest date, but which looks unrealistic like year 0007.
	CutoffDate = time.Date(1458, 1, 1, 0, 0, 0, 0, time.UTC)
	// DefaultFormat should be supported by most endpoints.
	DefaultFormat = "oai_dc"
	// DefaultCacheDir
	DefaultCacheDir = ".oaimicache"
	// DefaultClient should suffice for most use cases.
	DefaultClient = NewClient()
	// OAIVerbMap (4. Protocol Requests and Responses)
	OAIVerbMap = map[string]bool{
		"Identify":            true,
		"ListIdentifiers":     true,
		"ListSets":            true,
		"ListMetadataFormats": true,
		"ListRecords":         true,
		"GetRecord":           true,
	}
)

Functions

func WriteFileAtomic added in v0.2.1

func WriteFileAtomic(filename string, data []byte, perm os.FileMode) error

Write file to temp and atomically move when everything else succeeds.

Types

type BatchingClient added in v0.2.1

type BatchingClient struct {
	// MaxRequests, zero means no limit. Default of 1024 will prevent endless
	// loop due to broken resumptionToken implementations (e.g.
	// http://goo.gl/KFb9iM).
	MaxRequests int
	// contains filtered or unexported fields
}

BatchingClient takes a single OAI request but will do more the one HTTP request to fulfill it, if necessary.

func NewBatchingClient added in v0.2.1

func NewBatchingClient() BatchingClient

NewBatchingClient returns a client that batches HTTP requests and uses a resilient HTTP client.

func (*BatchingClient) Do added in v0.2.1

func (c *BatchingClient) Do(req Request) (resp Response, err error)

Do will turn a single request into a single response by combining many responses into a single one. This is potentially very memory consuming.

type CachingClient added in v0.2.1

type CachingClient struct {
	// RootTag is an optional root element.
	RootTag string
	// NameSpaces allow to add custom XML namespace declarations to the root element.
	NameSpaces map[string]string
	// CacheDir stores the directory, where all the downloads go.
	CacheDir string
	// contains filtered or unexported fields
}

CachingClient will write XML to a given writer. This client encapsulates cache logic which helps to make subsequent requests fast. A root element is optional.

func NewCachingClient added in v0.2.1

func NewCachingClient(w io.Writer) CachingClient

NewCachingClient creates a new client, with a default location for cached files. All XML responses will be written to the given io.Writer.

func NewCachingClientDir added in v0.2.1

func NewCachingClientDir(w io.Writer, dir string) CachingClient

NewCachingClient creates a new client, with a default location for cached files. All XML responses will be written to the given io.Writer.

func (CachingClient) Do added in v0.2.1

func (c CachingClient) Do(req Request) error

Do executes a given request. If the request is not yet cached, the content is retrieved and persisted. Requests are internally split up into weekly windows to reduce load and to latency in case of errors.

func (CachingClient) RequestCacheDir added in v0.2.1

func (c CachingClient) RequestCacheDir(req Request) (string, error)

RequestCacheDir returns the cache directory for a given request.

type Client added in v0.2.1

type Client struct {
	// contains filtered or unexported fields
}

Client is a simple client, that can turn a OAI request into a OAI response.

func NewClient added in v0.2.1

func NewClient() Client

NewClient create a default client with resilient HTTP client.

func NewClientDoer added in v0.2.1

func NewClientDoer(doer HttpRequestDoer) Client

NewClient creates a new OAI client with a user supplied http client, e.g. pester.Client, http.DefaultClient.

func (Client) Do added in v0.2.1

func (c Client) Do(req Request) (Response, error)

Do takes an OAI request and turns it into at most one single OAI response.

type HttpRequestDoer added in v0.2.1

type HttpRequestDoer interface {
	Do(*http.Request) (*http.Response, error)
}

HttpRequestDoer lets us use pester, DefaultClient or other HTTP client implementations interchangably.

type Identify added in v0.2.4

type Identify struct {
	Name              string `xml:"repositoryName,omitempty" json:"name,omitempty"`
	URL               string `xml:"baseURL,omitempty" json:"url,omitempty"`
	Version           string `xml:"protocolVersion,omitempty" json:"version,omitempty"`
	AdminEmail        string `xml:"adminEmail,omitempty" json:"email,omitempty"`
	EarliestDatestamp string `xml:"earliestDatestamp,omitempty" json:"earliest,omitempty"`
	DeletePolicy      string `xml:"deletedRecord,omitempty" json:"delete,omitempty"`
	Granularity       string `xml:"granularity,omitempty" json:"granularity,omitempty"`
	Description       struct {
		Friends    []string `xml:"friends>baseURL,omitempty" json:"friends,omitempty"`
		Identifier struct {
			Scheme               string `xml:"scheme,omitempty" json:"scheme,omitempty"`
			RepositoryIdentifier string `xml:"repositoryIdentifier,omitempty" json:"repositoryIdentifier,omitempty"`
			Delimiter            string `xml:"delimiter,omitempty" json:"delimiter,omitempty"`
			SampleIdentifier     string `xml:"sampleIdentifier,omitempty" json:"sampleIdentifier,omitempty"`
		} `xml:"oai-identifier,omitempty" json:"identifier,omitempty"`
	} `xml:"description,omitempty" json:"description,omitempty"`
}

Identify response.

type ListIdentifiers added in v0.2.4

type ListIdentifiers struct {
	Header []header        `xml:"header"`
	Token  resumptionToken `xml:"resumptionToken"`
}

ListIdentifiers response.

type ListMetadataFormats added in v0.2.4

type ListMetadataFormats struct {
	xml.Name `xml:"ListMetadataFormats" json:"formats"`
	Formats  []struct {
		Prefix string `xml:"metadataPrefix" json:"prefix"`
		Schema string `xml:"schema" json:"schema"`
	} `xml:"metadataFormat" json:"format"`
}

ListMetadataFormats response.

type ListRecords added in v0.2.4

type ListRecords struct {
	Records []struct {
		Header   header `xml:"header"`
		Metadata struct {
			Verbatim string `xml:",innerxml"`
		} `xml:"metadata"`
	} `xml:"record"`
	Token resumptionToken `xml:"resumptionToken"`
}

ListRecords response.

type ListSets added in v0.2.4

type ListSets struct {
	Sets []struct {
		Spec        string `xml:"setSpec" json:"spec,omitempty"`
		Name        string `xml:"setName" json:"name,omitempty"`
		Description string `xml:"setDescription>dc>description" json:"description,omitempty"`
	} `xml:"set" json:"set"`
	Token resumptionToken `xml:"resumptionToken"`
}

ListSets response.

type MaybeCompressedFile added in v0.2.9

type MaybeCompressedFile struct {
	// contains filtered or unexported fields
}

func CreateMaybeCompressedFile added in v0.2.9

func CreateMaybeCompressedFile(filename string) *MaybeCompressedFile

CreateMaybeCompressedFile creates a file, that may be compressed, if a certain amount of data is written to it.

func OpenMaybeCompressedFile added in v0.2.9

func OpenMaybeCompressedFile(filename string) (*MaybeCompressedFile, error)

OpenMaybeCompressedFile returns a file, that may be transparently decompressed on the fly.

func (*MaybeCompressedFile) Close added in v0.2.9

func (f *MaybeCompressedFile) Close() error

func (*MaybeCompressedFile) Name added in v0.2.9

func (f *MaybeCompressedFile) Name() string

func (*MaybeCompressedFile) Read added in v0.2.9

func (f *MaybeCompressedFile) Read(p []byte) (n int, err error)

func (*MaybeCompressedFile) Write added in v0.2.9

func (f *MaybeCompressedFile) Write(p []byte) (n int, err error)

type OAIError

type OAIError struct {
	Code    string
	Message string
}

OAIError wraps OAI error codes and messages.

func (OAIError) Error

func (e OAIError) Error() string

Error to satisfy interface.

type RepositoryInfo added in v0.1.9

type RepositoryInfo struct {
	Endpoint string              `json:"endpoint,omitempty"`
	Elapsed  float64             `json:"elapsed,omitempty"`
	About    Identify            `json:"about,omitempty"`
	Formats  ListMetadataFormats `json:"formats,omitempty"`
	Sets     ListSets            `json:"sets,omitempty"`
	Errors   []error             `json:"errors,omitempty"`
}

RepositoryInfo holds some information about the repository.

func AboutEndpoint added in v0.2.4

func AboutEndpoint(endpoint string, timeout time.Duration) (*RepositoryInfo, error)

AboutEndpoint returns information about a repository. Execution time limited by timeout.

func (RepositoryInfo) MarshalJSON added in v0.2.4

func (ri RepositoryInfo) MarshalJSON() ([]byte, error)

MarshalJSON formats the RepositoryInfo a bit terser than the default serialization.

type Request

type Request struct {
	Endpoint        string
	Verb            string
	From            time.Time
	Until           time.Time
	Set             string
	Prefix          string
	Identifier      string
	ResumptionToken string
}

Request can hold any parameter, that you want to send to an OAI server.

func (*Request) URL

func (r *Request) URL() (s string, err error)

URL returns the absolute URL for a given request. Catches basic errors like missing endpoint or bad verb.

func (*Request) UseDefaults added in v0.2.1

func (r *Request) UseDefaults()

UseDefaults will fill in default values for From, Until and Prefix if they are missing.

type Response

type Response struct {
	xml.Name `xml:"response"`
	Date     string `xml:"responseDate"`
	Request  struct {
		Verb     string `xml:"verb,attr"`
		Endpoint string `xml:",chardata"`
	} `xml:"request,omitempty"`
	Error struct {
		Code    string `xml:"code,attr"`
		Message string `xml:",chardata"`
	} `xml:"error"`
	ListIdentifiers     ListIdentifiers     `xml:"ListIdentifiers,omitempty"`
	ListMetadataFormats ListMetadataFormats `xml:"ListMetadataFormats,omitempty" json:"sets"`
	ListSets            ListSets            `xml:"ListSets,omitempty" json:"sets"`
	ListRecords         ListRecords         `xml:"ListRecords,omitempty"`
	Identify            Identify            `xml:"Identify,omitempty" json:"identity,omitempty"`
}

Response can hold most answers to an request to a OAI server.

type TimeShiftFunc added in v0.2.1

type TimeShiftFunc func(time.Time) time.Time

type Window added in v0.2.1

type Window struct {
	From  time.Time
	Until time.Time
}

Window represent a span of time, from and until including.

func (Window) Monthly added in v0.2.1

func (w Window) Monthly() []Window

func (Window) Weekly added in v0.2.1

func (w Window) Weekly() []Window

type WriterClient added in v0.2.1

type WriterClient struct {
	// RootTag is used as synthetic root element.
	RootTag string
	// MaxRequests, zero means no limit. Default of 4096 will prevent endless
	// loop due to broken resumptionToken implementations (e.g.
	// http://goo.gl/KFb9iM). Zero means no limit.
	MaxRequests int
	// contains filtered or unexported fields
}

WriterClient can execute requests, but writes results to a given writer.

func NewWriterClient added in v0.2.1

func NewWriterClient(w io.Writer) WriterClient

func (WriterClient) Do added in v0.2.1

func (c WriterClient) Do(req Request) error

Do will execute a request and write all XML to the writer.

Directories

Path Synopsis
cmd

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL