extractor

package

v3.0.0-...-55e877b Latest Latest Go to latest Published: Jul 29, 2023 License: GPL-2.0 Imports: 23 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

gitee.com/zhaobingss/unipdf.git

Links

Open Source Insights

Documentation ¶

Overview ¶

Package extractor is used for quickly extracting PDF content through a simple interface. Currently offers functionality for extracting textual content.

Index ¶

type Extractor
type Font
type ImageExtractOptions
type ImageMark
type Options
type PageFonts
type PageImages
type PageText
type PageTextOptions
type RenderMode
type TableCell
type TextMark
- func (_cced TextMark) String() string
- func (_ffd *TextMark) TableInfo() (*TextTable, [][]int)
type TextMarkArray
type TextTable

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type Extractor ¶

type Extractor struct {
	// contains filtered or unexported fields
}

Extractor stores and offers functionality for extracting content from PDF pages.

func New ¶

func New(page *_ec.PdfPage) (*Extractor, error)

New returns an Extractor instance for extracting content from the input PDF page.

func NewFromContents ¶

func NewFromContents(contents string, resources *_ec.PdfPageResources) (*Extractor, error)

NewFromContents creates a new extractor from contents and page resources.

func NewWithOptions ¶

func NewWithOptions(page *_ec.PdfPage, options *Options) (*Extractor, error)

NewWithOptions an Extractor instance for extracting content from the input PDF page with options.

func (*Extractor) ExtractFonts ¶

func (_fcg *Extractor) ExtractFonts(previousPageFonts *PageFonts) (*PageFonts, error)

ExtractFonts returns all font information from the page extractor, including font name, font type, the raw data of the embedded font file (if embedded), font descriptor and more.

The argument `previousPageFonts` is used when trying to build a complete font catalog for multiple pages or the entire document. The entries from `previousPageFonts` are added to the returned result unless already included in the page, i.e. no duplicate entries.

NOTE: If previousPageFonts is nil, all fonts from the page will be returned. Use it when building up a full list of fonts for a document or page range.

func (*Extractor) ExtractPageImages ¶

func (_fca *Extractor) ExtractPageImages(options *ImageExtractOptions) (*PageImages, error)

ExtractPageImages returns the image contents of the page extractor, including data and position, size information for each image. A set of options to control page image extraction can be passed in. The options parameter can be nil for the default options. By default, inline stencil masks are not extracted.

func (*Extractor) ExtractPageText ¶

func (_cdd *Extractor) ExtractPageText() (*PageText, int, int, error)

ExtractPageText returns the text contents of `e` (an Extractor for a page) as a PageText. TODO(peterwilliams97): The stats complicate this function signature and aren't very useful.

Replace with a function like Extract() (*PageText, error)

func (*Extractor) ExtractText ¶

func (_acb *Extractor) ExtractText() (string, error)

ExtractText processes and extracts all text data in content streams and returns as a string. It takes into account character encodings in the PDF file, which are decoded by CharcodeBytesToUnicode. Characters that can't be decoded are replaced with MissingCodeRune ('\ufffd' = �).

func (*Extractor) ExtractTextWithStats ¶

func (_cde *Extractor) ExtractTextWithStats() (_bge string, _efa int, _afe int, _gca error)

ExtractTextWithStats works like ExtractText but returns the number of characters in the output (`numChars`) and the number of characters that were not decoded (`numMisses`).

type Font ¶

type Font struct {
	PdfFont *_ec.PdfFont

	// FontName represents Font Name from font properties.
	FontName string

	// FontType represents Font Subtype entry in the font dictionary inside page resources.
	// Examples : type0, Type1, MMType1, Type3, TrueType, CIDFont.
	FontType string

	// ToUnicode is true if font provides a `ToUnicode` mapping.
	ToUnicode bool

	// IsCID is true if underlying font is a composite font.
	// Composite font is represented by a font dictionary whose Subtype is `Type0`
	IsCID bool

	// IsSimple is true if font is simple font.
	// A simple font is limited to only 8 bit (255) character codes.
	IsSimple bool

	// FontData represents the raw data of the embedded font file.
	// It can have format TrueType (TTF), PostScript Font (PFB) or Compact Font Format (CCF).
	// FontData value can be indicates from `FontFile`, `FontFile2` or `FontFile3` inside Font Descriptor.
	// At most, only one of `FontFile`, `FontFile2` or `FontFile3` will be FontData value.
	FontData []byte

	// FontFileName is a name representing the font. it has format:
	// (Font Name) + (Font Type Extension), example: helvetica.ttf.
	FontFileName string

	// FontDescriptor represents metrics and other attributes inside font properties from PDF Structure (Font Descriptor).
	FontDescriptor *_ec.PdfFontDescriptor
}

Font represents the font properties on a PDF page.

type ImageExtractOptions ¶

type ImageExtractOptions struct{ IncludeInlineStencilMasks bool }

ImageExtractOptions contains options for controlling image extraction from PDF pages.

type ImageMark ¶

type ImageMark struct {
	Image *_ec.Image

	// Dimensions of the image as displayed in the PDF.
	Width  float64
	Height float64

	// Position of the image in PDF coordinates (lower left corner).
	X float64
	Y float64

	// Angle in degrees, if rotated.
	Angle float64
}

ImageMark represents an image drawn on a page and its position in device coordinates. All coordinates are in device coordinates.

type Options ¶

type Options struct {

	// DisableDocumentTags specifies whether to use the document tags during list extraction.
	DisableDocumentTags bool

	// ApplyCropBox will extract page text based on page cropbox if set to `true`.
	ApplyCropBox bool

	// UseSimplerExtractionProcess will skip topological text ordering and table processing.
	//
	// NOTE: While normally the extra processing is beneficial, it can also lead to problems when it does not work.
	// Thus it is a flag to allow the user to control this process.
	//
	// Skipping some extraction processes would also lead to the reduced processing time.
	UseSimplerExtractionProcess bool
}

Options extractor options.

type PageFonts ¶

type PageFonts struct{ Fonts []Font }

PageFonts represents extracted fonts on a PDF page.

type PageImages ¶

type PageImages struct{ Images []ImageMark }

PageImages represents extracted images on a PDF page with spatial information: display position and size.

type PageText ¶

type PageText struct {
	// contains filtered or unexported fields
}

PageText represents the layout of text on a device page.

func (*PageText) ApplyArea ¶

func (_faa *PageText) ApplyArea(bbox _ec.PdfRectangle)

ApplyArea processes the page text only within the specified area `bbox`. Each time ApplyArea is called, it updates the result set in `pt`. Can be called multiple times in a row with different bounding boxes.

func (*PageText) GetContentStreamOps ¶

func (_fcag *PageText) GetContentStreamOps() *_fcb.ContentStreamOperations

GetContentStreamOps returns the contentStreamOps field of `pt`.

func (PageText) List ¶

func (_cfce PageText) List() lists

List returns all the list objects detected on the page. It detects all the bullet point Lists from a given pdf page and builds a slice of bullet list objects. A given bullet list object has a tree structure. Each bullet point list is extracted with the text content it contains and all the sub lists found under it as children in the tree. The rest content of the pdf is ignored and only text in the bullet point lists are extracted. The list extraction is done in two ways. 1. If the document is tagged then the lists are extracted using the tags provided in the document. 2. Otherwise the bullet lists are extracted from the raw text using regex matching. By default the document tag is used if available. However this can be disabled using `DisableDocumentTags` in the `Options` object. Sometimes disabling document tags option might give a better bullet list extraction if the document was tagged incorrectly.

    options := &Options{
	     DisableDocumentTags: false, // this means use document tag if available
    }
    ex, err := NewWithOptions(page, options)
    // handle error
    pageText, _, _, err := ex.ExtractPageText()
    // handle error
    lists := pageText.List()
    txt := lists.Text()

func (PageText) Marks ¶

func (_ggg PageText) Marks() *TextMarkArray

Marks returns the TextMark collection for a page. It represents all the text on the page.

func (PageText) String ¶

func (_aea PageText) String() string

String returns a string describing `pt`.

func (PageText) Tables ¶

func (_dec PageText) Tables() []TextTable

Tables returns the tables extracted from the page.

func (PageText) Text ¶

func (_bcbe PageText) Text() string

Text returns the extracted page text.

func (PageText) ToText ¶

func (_fffee PageText) ToText() string

ToText returns the page text as a single string. Deprecated: This function is deprecated and will be removed in a future major version. Please use Text() instead.

type PageTextOptions ¶

type PageTextOptions struct {
	// contains filtered or unexported fields
}

PageTextOptions holds various options available in extraction process.

type RenderMode ¶

type RenderMode int

RenderMode specifies the text rendering mode (Tmode), which determines whether showing text shall cause glyph outlines to be stroked, filled, used as a clipping boundary, or some combination of the three. Stroking, filling, and clipping shall have the same effects for a text object as they do for a path object (see 8.5.3, "Path-Painting Operators" and 8.5.4, "Clipping Path Operators").

const (
	RenderModeStroke RenderMode = 1 << iota
	RenderModeFill
	RenderModeClip
)

type TableCell ¶

type TableCell struct {
	_ec.PdfRectangle

	// Text is the extracted text.
	Text string

	// Marks returns the TextMarks corresponding to the text in Text.
	Marks TextMarkArray
}

TableCell is a cell in a TextTable.

type TextMark ¶

type TextMark struct {

	// Text is the extracted text.
	Text string

	// Original is the text in the PDF. It has not been decoded like `Text`.
	Original string

	// BBox is the bounding box of the text.
	BBox _ec.PdfRectangle

	// Font is the font the text was drawn with.
	Font *_ec.PdfFont

	// FontSize is the font size the text was drawn with.
	FontSize float64

	// Offset is the offset of the start of TextMark.Text in the extracted text. If you do this
	//
	//	text, textMarks := pageText.Text(), pageText.Marks()
	//	marks := textMarks.Elements()
	//
	// then marks[i].Offset is the offset of marks[i].Text in text.
	Offset int

	// Meta is set true for spaces and line breaks that we insert in the extracted text. We insert
	// spaces (line breaks) when we see characters that are over a threshold horizontal (vertical)
	//
	//	distance  apart. See wordJoiner (lineJoiner) in PageText.computeViews().
	Meta bool

	// FillColor is the fill color of the text.
	// The color is nil for spaces and line breaks (i.e. the Meta field is true).
	FillColor _gcg.Color

	// StrokeColor is the stroke color of the text.
	// The color is nil for spaces and line breaks (i.e. the Meta field is true).
	StrokeColor _gcg.Color

	// Orientation is the text orientation
	Orientation int

	// DirectObject is the underlying PdfObject (Text Object) that represents the visible texts. This is introduced to get
	// a simple access to the TextObject in case editing or replacment of some text is needed. E.g during redaction.
	DirectObject _c.PdfObject

	// ObjString is a decoded string operand of a text-showing operator. It has the same value as `Text` attribute except
	// when many glyphs are represented with the same Text Object that contains multiple length string operand in which case
	// ObjString spans more than one character string that falls in different TextMark objects.
	ObjString []string
	Tw        float64
	Th        float64
	Tc        float64
	Index     int
	// contains filtered or unexported fields
}

TextMark represents extracted text on a page with information regarding both textual content, formatting (font and size) and positioning. It is the smallest unit of text on a PDF page, typically a single character.

getBBox() in test_text.go shows how to compute bounding boxes of substrings of extracted text. The following code extracts the text on PDF page `page` into `text` then finds the bounding box `bbox` of substring `term` in `text`.

ex, _ := New(page)
// handle errors
pageText, _, _, err := ex.ExtractPageText()
// handle errors
text := pageText.Text()
textMarks := pageText.Marks()

	start := strings.Index(text, term)
 end := start + len(term)
 spanMarks, err := textMarks.RangeOffset(start, end)
 // handle errors
 bbox, ok := spanMarks.BBox()
 // handle errors

func (TextMark) String ¶

func (_cced TextMark) String() string

String returns a string describing `tm`.

func (*TextMark) TableInfo ¶

func (_ffd *TextMark) TableInfo() (*TextTable, [][]int)

TableInfo gets table information of the textmark `tm`.

type TextMarkArray ¶

type TextMarkArray struct {
	// contains filtered or unexported fields
}

TextMarkArray is a collection of TextMarks.

func (*TextMarkArray) Append ¶

func (_cegg *TextMarkArray) Append(mark TextMark)

Append appends `mark` to the mark array.

func (*TextMarkArray) BBox ¶

func (_fdeb *TextMarkArray) BBox() (_ec.PdfRectangle, bool)

BBox returns the smallest axis-aligned rectangle that encloses all the TextMarks in `ma`.

func (*TextMarkArray) Elements ¶

func (_gafa *TextMarkArray) Elements() []TextMark

Elements returns the TextMarks in `ma`.

func (*TextMarkArray) Len ¶

func (_cgdb *TextMarkArray) Len() int

Len returns the number of TextMarks in `ma`.

func (*TextMarkArray) RangeOffset ¶

func (_cgf *TextMarkArray) RangeOffset(start, end int) (*TextMarkArray, error)

RangeOffset returns the TextMarks in `ma` that overlap text[start:end] in the extracted text. These are tm: `start` <= tm.Offset + len(tm.Text) && tm.Offset < `end` where `start` and `end` are offsets in the extracted text. NOTE: TextMarks can contain multiple characters. e.g. "ffi" for the ﬃ ligature so the first and last elements of the returned TextMarkArray may only partially overlap text[start:end].

func (TextMarkArray) String ¶

func (_bgdcf TextMarkArray) String() string

String returns a string describing `ma`.

type TextTable ¶

type TextTable struct {
	_ec.PdfRectangle
	W, H  int
	Cells [][]TableCell
}

TextTable represents a table. Cells are ordered top-to-bottom, left-to-right. Cells[y] is the (0-offset) y'th row in the table. Cells[y][x] is the (0-offset) x'th column in the table.

Source Files ¶

View all Source files

extractor.go

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL