clamber

module
v0.0.0-...-df4f15f Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 20, 2019 License: Apache-2.0

README

clamber

Build Status codecov.io Code Coverage Go Report Card Release GoDoc

Fast & efficient web crawler providing an API which provides a means of bidirectional path finding.

Distributed version is in progress. Standalone version is here

The infrastructure stack I have chosen to use is purely AWS because I wanted a project where I could apply all the technologies covered in the AWS DevOps Engineer Certification syllabus.

If I were to choose a stack based on what I believe to be the most appropriate, I would use the following:

  • AWS/Azure/GCP
  • Apache Kafka
  • Dgraph
  • Kubernetes (Preferably PaaS)
  • EFK
  • Prometheus, Alertmanager & Grafana
  • Drop the Page Store and store the Page data in Dgraph.

Software Design

Infrastructure Design

Stay tuned...

Endpoints

Takes a URL, depth, allow_external_links, checks Page Database to see if we already have the info. If we do, query and return it. If not, initiate recursive crawl.

/search will take the following query parameters:

Parameter Type Stability Description
url string Tested starting url for sitemap
depth int Tested -1 is infinite. If you specified 10, that would be your max depth to crawl.
display_depth int Experimental how deep a depth to return in JSON
allow_external_links bool Not Yet Implemented whether to crawl external links or not (Not yet implemented)

Sample response:

{
    "query": {
      "url": "https://example.com",
      "depth": 1, 
      "display_depth": 10,
      "allow_external_links": false
    },
    "status": {
      "message": "5 pages found at a depth of 1.",
      "code": "200"
    },
    "results": [
        {
            "URL": "https://example.com",
            "timestamp": "<time>",
            "links": [
                {
                    "URL": "https://example.com/about",
                    "timestamp": "<time>",
                    "links": []
                },
                {
                    "URL": "https://example.com/contact",
                    "timestamp": "<time>",
                    "links": []
                },
                {
                    "URL": "https://example.com/faq",
                    "timestamp": "<time>",
                    "links": []
                },
                {
                    "URL": "https://example.com/offices",
                    "timestamp": "<time>",
                    "links": []
                }
            ]
        }
    ]
}

url depth startUrl parentUrl

Directories

Path Synopsis
api module
app Module
clamber
cmd/api
Package app serves the clamber package as an API.
Package app serves the clamber package as an API.
common module
pkg
crawl
Package app provides the clamber crawling package.
Package app provides the clamber crawling package.
service
app Module

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL