sentry

command module
v0.0.0-...-20d8710 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 2, 2018 License: AGPL-3.0 Imports: 25 Imported by: 0

README

Sentry

GitHub Slack License Codecov CI

Sentry is a parallelized web crawler written in Go that writes urls, links, & response headers to a Postgres database, then stores the response itself on amazon S3. It keeps a list of “sources”, which use simple string comparison to keep it from wandering outside of designated domains or url paths.

The big difference from other crawlers is a tunable “stale duration”, which will tell the crawler to capture an updated snapshot of the page if the time since the last GET request is older than the stale duration. This gives it a continual “watching” property.

Sentry holds a separate stream of scraping for any url that looks like a dataset. So when it encounters urls that look like https://foo.com/file.csv, it assumes that file ending may be a static asset, and places that url on a separate thread for archiving.

Copyright (C) 2017 Data Together

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, version 3.0.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Getting Involved

We would love involvement from more people! If you notice any errors or would like to submit changes, please see our Contributing Guidelines.

We use GitHub issues for tracking bugs and feature requests and Pull Requests (PRs) for submitting changes

Usage

Though it has mostly been used with the Data Together webapp, sentry is a stand-alone web crawler and can be used on its own. It currently requires a somewhat elaborate infrastructure and, for instance, it can not simply be fed a job over the command line.

At present, sentry reads crawling instructions directly from a Postgres database (see the schema file for details of the database structure), and places crawled resources in an S3 bucket. For every domain to be crawled, create a record in the sources table with crawl set to true. Sentry will crawl that domain repeatedly. Resources will be hashed and stored on S3, where they can be retrieved by content or any other service capable of reverse-engineering the identifying hash. Other storage backends are planned (see roadmap, below), and if you are interested in helping to develop them please contact us!

Installation and Configuration

Docker installation

To get started developing using Docker and Docker Compose, run:

$ git clone git@github.com:datatogether/webapp.git
$ cd webapp
$ docker-compose up
Manual installation
  1. Install Go language
  2. Download and build repository
    export GOPATH=$(go env GOPATH)
    mkdir -pv $GOPATH
    cd $GOPATH
    
    go get github.com/datatogether/sentry
    cd github.com/datatogether/sentry
    go install
    
  3. Configure Postgres server and then set connection URL
    export POSTGRES_DB_URL=postgres://[USERNAME_HERE]:[PASSWORD_HERE]@localhost:[PORT]/[DB_NAME]
    
  4. Run sentry
    $GOPATH/bin/sentry
    
  5. Configure S3 buckets [TODO]
    • on production
    • on development (how do you work with them in development env?)

Roadmap

Two major changes to sentry will make it much more generally usable:

  • we plan to shift the storage backend from S3 to IPFS. Once this is accomplished, any local or remote IPFS node can be used as a storage node.
  • we are considering additional mechanisms for adidng crawls to sentry's queue. This should make sentry distinctly more flexible.

In parallel to building this tool, we have engaged in efforts to map the landscape of similar projects:

👀 See: Comparison of web archiving software

Documentation

Overview

Sentry is a service for archiving URL's with a clean, current, auditable trail of digital provenance. It's main job is to issue GET requests to a given URL, record all information related to the request namely: timestamp, HTTP response headers, sha256 of response body, content-type sniff, content size and download time. If desirable, sentry can also store the response, currently to amazon S3.

Sentry archives urls in two ways:

  • using a configurable built-in web crawler

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL