pdf2go

package module
v0.1.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 28, 2023 License: MIT Imports: 6 Imported by: 0

README

pdf2go

Go wrapper for a Poppler PDF rendering library. A simple Golang module for converting PDF to text and html.

Go Report Card GoDoc Build Status codecov

Installation

go get github.com/rudolfoborges/pdf2go

Poppler Installation

Needs poppler to be installed on your system.

Mac OS
brew install poppler
Ubuntu
sudo apt-get install poppler-utils
Centos
sudo yum install poppler-utils
Windows

Download the latest version from poppler

Usage

Extract Text
package main

import (
    "fmt"
    "github.com/rudolfoborges/pdf2go"
)

func main() {
    pdf, err := pdf2go.New("path/to/file.pdf", pdf2go.Config{
        LogLevel: pdf2go.LogLevelError,
    })

    if err != nil {
        panic(err)
    }

    text, err := pdf.Text()
    if err != nil {
        panic(err)
    }

    fmt.Println(text)

    pages, err := pdf.Pages()

    if err != nil {
        panic(err)
    }

    for _, page := range pages {
        fmt.Println(page.Text())
    }
}
Extract Html
package main

import (
    "fmt"
    "github.com/rudolfoborges/pdf2go"
)

func main() {
    pdf, err := pdf2go.New("path/to/file.pdf", pdf2go.Config{
        LogLevel: pdf2go.LogLevelError,
    })

    if err != nil {
        panic(err)
    }

    html, err := pdf.Html()
    if err != nil {
        panic(err)
    }

    fmt.Println(html)

    pages, err := pdf.Pages()

    if err != nil {
        panic(err)
    }

    for _, page := range pages {
        fmt.Println(page.Html())
    }
}
More examples on examples folder

Next Steps

  • Add image extraction
  • Extract image from specific page

Documentation

Index

Constants

View Source
const (
	LogLevelError = "error"
	LogLevelDegub = "debug"
)

LogLevelDegub is the debug log level.

Variables

View Source
var (
	ErrInvalidPath        = errors.New("Invalid pdf path")
	ErrInvalidPagesNumber = errors.New("Invalid pages number. It must be greater than 0")
)

Functions

This section is empty.

Types

type Config

type Config struct {
	// Logger is the logger used by the PDFReader.
	// If nil, the DefaultLogger is used.
	Logger Logger

	// LogLevel is the log level used by the logger.
	// It can be "error" or "debug".
	LogLevel string
}

Config is the configuration used by the PDFReader. It is used to configure the logger and the log level.

type DefaultLogger

type DefaultLogger struct {
	// contains filtered or unexported fields
}

DefaultLogger is the default logger.

func NewDefaultLogger

func NewDefaultLogger(logLevel string) *DefaultLogger

NewDefaultLogger creates a new DefaultLogger.

func (*DefaultLogger) Debugf

func (l *DefaultLogger) Debugf(format string, args ...interface{})

Debugf logs a debug message. It logs only if the level is "debug".

func (*DefaultLogger) Errorf

func (l *DefaultLogger) Errorf(format string, args ...interface{})

Errorf logs an error message.

type Logger

type Logger interface {
	// Debugf logs a debug message.
	Debugf(format string, args ...interface{})

	// Errorf logs an error message.
	Errorf(format string, args ...interface{})
}

Logger is the interface that wraps the Debugf method.

type PDFReader

type PDFReader struct {
	// contains filtered or unexported fields
}

PDFReader represents a PDF file.

func New

func New(path string, config Config) (*PDFReader, error)

New creates a new PDFReader. It returns an error if the PDFReader cannot be created. The path argument is the path to the PDF file.

func NewPDFReader

func NewPDFReader(path string, logger Logger) (*PDFReader, error)

NewPDFReader creates a new PDFReader. It returns an error if the PDFReader cannot be created. The path argument is the path to the PDF file.

func (*PDFReader) Author

func (r *PDFReader) Author() string

Author returns the author of the PDF file or an empty string if the author is not defined.

func (*PDFReader) CreationDate

func (r *PDFReader) CreationDate() string

CreationDate returns the creation date of the PDF file or an empty string if the creation date is not defined.

func (*PDFReader) Encrypted

func (r *PDFReader) Encrypted() bool

Encrypted returns true if the PDF file is encrypted.

func (*PDFReader) Html

func (r *PDFReader) Html() (string, error)

Html returns the html of all pages from the PDF file. It returns an error if the html cannot be extracted.

func (*PDFReader) Pages

func (r *PDFReader) Pages() ([]*Page, error)

Pages returns a slice of pointers to Page structs. Each Page struct contains the page number and the text and html extractors.

func (*PDFReader) PagesNumber

func (r *PDFReader) PagesNumber() int

PagesNumber returns the number of pages in the PDF file associated with the PDFReader.

func (*PDFReader) Text

func (r *PDFReader) Text() (string, error)

Text returns the text of all pages from the PDF file. It returns an error if the text cannot be extracted.

func (*PDFReader) Title

func (r *PDFReader) Title() string

Title returns the title of the PDF file or an empty string if the title is not defined.

type Page

type Page struct {
	Number        int
	TextExtractor core.Extractor
	HtmlExtractor core.Extractor
	// contains filtered or unexported fields
}

Page represents a page from the PDF file.

func NewPage

func NewPage(number int, textExtractor, htmlExtractor core.Extractor, logger Logger) *Page

NewPage creates a new Page.

func (*Page) Html

func (p *Page) Html() (string, error)

Html returns the html from the PDF file. It returns an error if the html cannot be extracted.

func (*Page) Text

func (p *Page) Text() (string, error)

Text returns the text from the PDF file. It returns an error if the text cannot be extracted.

Directories

Path Synopsis
internal

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL