scrapify

package module
v0.0.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Sep 16, 2024 License: MIT Imports: 3 Imported by: 0

README

Scrapify

Scrapify is a flexible and decoupled Go package for scraping paginated web pages. It is designed to work with any type of page that lists items and supports pagination. The package allows users to provide their own scraping logic through a callback mechanism, making it versatile and easy to integrate into various projects.

Features

  • Flexible Scraping: Use custom scraper implementations by providing an IScraper interface.
  • Paginated Requests: Automatically handles pagination and scraping of multiple pages.
  • Callback Processing: Users can define a callback function to process scraped data.
  • Configurable Request Intervals: Set the interval between requests to manage load and avoid rate limiting.

Installation

To use Scrapify in your Go project, first install the package with:

go get github.com/ricardocastanho/scrapify

Usage

Here is a basic example of how to use Scrapify:

1- Define Your Scraper Implementation

Implement the IScraper interface with your custom scraping logic.

package main

import (
    "context"
    "fmt"
    "github.com/ricardocastanho/scrapify"
)

type ExampleScraper struct{}

func (e ExampleScraper) GetUrls(ctx context.Context, url string) ([]string, []string) {
    // Implement URL extraction logic here
    return []string{"url1", "url2"}, []string{"nextPageUrl"}
}

func (e ExampleScraper) GetData(ctx context.Context, ch chan<- string, data *string, url string) {
    // Implement data extraction logic here
    *data = "Example data from " + url
    ch <- *data
}

2- Create and Run the Scraper

Instantiate the Scraper with your ScraperStrategy and a callback function.

package main

import (
    "context"
    "fmt"
    "time"
    "github.com/ricardocastanho/scrapify"
)

func main() {
    logger := &Logger{} // Define or use an existing Logger implementation

    strategy := []scrapify.ScraperStrategy[string]{
        {
            Scraper: ExampleScraper{},
            Url:     "https://example.com",
        },
    }

    callback := func(data string) {
        fmt.Println("Processed data:", data)
    }

    scraper := scrapify.NewScraper(logger, strategy, callback, time.Second*2)
    scraper.Run(context.Background())
}

API

type Scraper[T any]

Scraper is the main struct for orchestrating the scraping process.

  • func NewScraper[T any](logger *Logger, s []ScraperStrategy[T], callback func(T), interval time.Duration) *Scraper[T]: Creates a new Scraper instance.

  • func (s *Scraper[T]) Run(ctx context.Context): Starts the scraping process.

  • func (s *Scraper[T]) getData(ctx context.Context): Handles data extraction and processing.

  • func (s *Scraper[T]) runScraper(ctx context.Context, strategy ScraperStrategy[T]): Executes the scraping logic for each strategy.

type IScraper[T any]

IScraper is an interface for implementing custom scraping logic.

  • GetUrls(ctx context.Context, url string) ([]string, []string): Returns the URLs of the current page and the next pages.

  • GetData(ctx context.Context, ch chan<- T, data *T, url string): Performs the data scraping for a given URL and sends the result to the channel.

Contributing

Feel free to open issues or submit pull requests if you have suggestions or improvements.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type IScraper

type IScraper[T any] interface {
	// GetUrls retrieves the URLs from the current page and the URLs of the next pages for pagination.
	GetUrls(ctx context.Context, url string) ([]string, []string)

	// GetData scrapes the data from a given URL and sends it to the provided channel.
	GetData(ctx context.Context, ch chan<- T, data *T, url string)
}

IScraper is an interface that defines the methods required for any scraper implementation. T is a generic type representing the data being scraped.

type Scraper

type Scraper[T any] struct {
	// contains filtered or unexported fields
}

Scraper represents the main structure that coordinates scraping jobs across multiple strategies. It manages the scraping process, handles concurrency, and invokes a user-defined callback when data is scraped.

func NewScraper

func NewScraper[T any](s []ScraperStrategy[T], callback func(T), requestDelay time.Duration) *Scraper[T]

NewScraper creates a new Scraper instance. logger is used for logging, s is the list of strategies to run, callback is the function that processes scraped data, and requestDelay is the optional delay between requests.

func (*Scraper[T]) Run

func (s *Scraper[T]) Run(ctx context.Context)

Run starts the entire scraping process by running each strategy and managing concurrency. It waits for all scraping jobs to complete before closing the channels.

type ScraperJob

type ScraperJob[T any] struct {
	// contains filtered or unexported fields
}

ScraperJob represents a job containing the scraper and a list of URLs to process. T is the type of data being scraped.

type ScraperStrategy

type ScraperStrategy[T any] struct {
	Scraper IScraper[T] // The scraper implementation used to scrape data from the target URL.
	Url     string      // The URL to start scraping from.
}

ScraperStrategy defines the strategy for scraping a specific URL with a given scraper implementation. T represents the type of data being scraped.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL