pdf2go

package module

v0.1.1 Latest Latest Go to latest Published: May 28, 2023 License: MIT Imports: 6 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/rudolfoborges/pdf2go

Links

Open Source Insights

README ¶

pdf2go

Go wrapper for a Poppler PDF rendering library. A simple Golang module for converting PDF to text and html.

Installation

go get github.com/rudolfoborges/pdf2go

Poppler Installation

Needs poppler to be installed on your system.

Mac OS

brew install poppler

Ubuntu

sudo apt-get install poppler-utils

Centos

sudo yum install poppler-utils

Windows

Download the latest version from poppler

Usage

Extract Text

package main

import (
    "fmt"
    "github.com/rudolfoborges/pdf2go"
)

func main() {
    pdf, err := pdf2go.New("path/to/file.pdf", pdf2go.Config{
        LogLevel: pdf2go.LogLevelError,
    })

    if err != nil {
        panic(err)
    }

    text, err := pdf.Text()
    if err != nil {
        panic(err)
    }

    fmt.Println(text)

    pages, err := pdf.Pages()

    if err != nil {
        panic(err)
    }

    for _, page := range pages {
        fmt.Println(page.Text())
    }
}

Extract Html

package main

import (
    "fmt"
    "github.com/rudolfoborges/pdf2go"
)

func main() {
    pdf, err := pdf2go.New("path/to/file.pdf", pdf2go.Config{
        LogLevel: pdf2go.LogLevelError,
    })

    if err != nil {
        panic(err)
    }

    html, err := pdf.Html()
    if err != nil {
        panic(err)
    }

    fmt.Println(html)

    pages, err := pdf.Pages()

    if err != nil {
        panic(err)
    }

    for _, page := range pages {
        fmt.Println(page.Html())
    }
}

More examples on examples folder

Next Steps

Add image extraction
Extract image from specific page

Documentation ¶

Index ¶

Constants
Variables
type Config
type DefaultLogger
- func NewDefaultLogger(logLevel string) *DefaultLogger
- func (l *DefaultLogger) Debugf(format string, args ...interface{})
- func (l *DefaultLogger) Errorf(format string, args ...interface{})
type Logger
type PDFReader
- func New(path string, config Config) (*PDFReader, error)
- func NewPDFReader(path string, logger Logger) (*PDFReader, error)
type Page
- func NewPage(number int, textExtractor, htmlExtractor core.Extractor, logger Logger) *Page
- func (p *Page) Html() (string, error)
- func (p *Page) Text() (string, error)

Constants ¶

View Source

const (
	LogLevelError = "error"
	LogLevelDegub = "debug"
)

LogLevelDegub is the debug log level.

Variables ¶

View Source

var (
	ErrInvalidPath        = errors.New("Invalid pdf path")
	ErrInvalidPagesNumber = errors.New("Invalid pages number. It must be greater than 0")
)

Functions ¶

This section is empty.

Types ¶

type Config ¶

type Config struct {
	// Logger is the logger used by the PDFReader.
	// If nil, the DefaultLogger is used.
	Logger Logger

	// LogLevel is the log level used by the logger.
	// It can be "error" or "debug".
	LogLevel string
}

Config is the configuration used by the PDFReader. It is used to configure the logger and the log level.

type DefaultLogger ¶

type DefaultLogger struct {
	// contains filtered or unexported fields
}

DefaultLogger is the default logger.

func NewDefaultLogger ¶

func NewDefaultLogger(logLevel string) *DefaultLogger

NewDefaultLogger creates a new DefaultLogger.

func (*DefaultLogger) Debugf ¶

func (l *DefaultLogger) Debugf(format string, args ...interface{})

Debugf logs a debug message. It logs only if the level is "debug".

func (*DefaultLogger) Errorf ¶

func (l *DefaultLogger) Errorf(format string, args ...interface{})

Errorf logs an error message.

type Logger ¶

type Logger interface {
	// Debugf logs a debug message.
	Debugf(format string, args ...interface{})

	// Errorf logs an error message.
	Errorf(format string, args ...interface{})
}

Logger is the interface that wraps the Debugf method.

type PDFReader ¶

type PDFReader struct {
	// contains filtered or unexported fields
}

PDFReader represents a PDF file.

func New ¶

func New(path string, config Config) (*PDFReader, error)

New creates a new PDFReader. It returns an error if the PDFReader cannot be created. The path argument is the path to the PDF file.

func NewPDFReader ¶

func NewPDFReader(path string, logger Logger) (*PDFReader, error)

NewPDFReader creates a new PDFReader. It returns an error if the PDFReader cannot be created. The path argument is the path to the PDF file.

func (*PDFReader) Author ¶

func (r *PDFReader) Author() string

Author returns the author of the PDF file or an empty string if the author is not defined.

func (*PDFReader) CreationDate ¶

func (r *PDFReader) CreationDate() string

CreationDate returns the creation date of the PDF file or an empty string if the creation date is not defined.

func (*PDFReader) Encrypted ¶

func (r *PDFReader) Encrypted() bool

Encrypted returns true if the PDF file is encrypted.

func (*PDFReader) Html ¶

func (r *PDFReader) Html() (string, error)

Html returns the html of all pages from the PDF file. It returns an error if the html cannot be extracted.

func (*PDFReader) Pages ¶

func (r *PDFReader) Pages() ([]*Page, error)

Pages returns a slice of pointers to Page structs. Each Page struct contains the page number and the text and html extractors.

func (*PDFReader) PagesNumber ¶

func (r *PDFReader) PagesNumber() int

PagesNumber returns the number of pages in the PDF file associated with the PDFReader.

func (*PDFReader) Text ¶

func (r *PDFReader) Text() (string, error)

Text returns the text of all pages from the PDF file. It returns an error if the text cannot be extracted.

func (*PDFReader) Title ¶

func (r *PDFReader) Title() string

Title returns the title of the PDF file or an empty string if the title is not defined.

type Page ¶

type Page struct {
	Number        int
	TextExtractor core.Extractor
	HtmlExtractor core.Extractor
	// contains filtered or unexported fields
}

Page represents a page from the PDF file.

func NewPage ¶

func NewPage(number int, textExtractor, htmlExtractor core.Extractor, logger Logger) *Page

NewPage creates a new Page.

func (*Page) Html ¶

func (p *Page) Html() (string, error)

Html returns the html from the PDF file. It returns an error if the html cannot be extracted.

func (*Page) Text ¶

func (p *Page) Text() (string, error)

Text returns the text from the PDF file. It returns an error if the text cannot be extracted.

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
examples
html
text
internal
core
core/pdf_info
core/pdf_to_html
core/pdf_to_text

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL