epub

package
v0.12.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 4, 2026 License: MIT Imports: 13 Imported by: 0

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func CleanForNLP

func CleanForNLP(s string) string

output of pandoc can contain:

—Si hombres salvajes venir —<2060>dijo<2060>— ellos comerme a mí, vos salvaros.

Those <2060> characters are U+2060 WORD JOINER — an invisible zero-width character used in typesetting to prevent line breaks around em-dashes

Unicode defines U+2060

HTML/XHTML can express it as &#x2060;

Epub (XHTML inside a zip) inherits it

Pandoc converts to plain text, keeps the raw character

Your terminal/vim shows it as <2060> because it has no glyph

There are several Unicode categories of invisible/typesetting characters that are all noise for NLP. Unicode defines a formal category Cf (Format characters) that covers most of these.

func FindOPFPath added in v0.9.0

func FindOPFPath(r *zip.Reader) (path string, err error)

FindOPFPath locates the OPF file in the EPUB archive by reading META-INF/container.xml.

Types

type Book added in v0.7.0

type Book struct {
	Package PackageXML
	// contains filtered or unexported fields
}

Book represents a parsed EPUB. It holds a reference to an open *zip.Reader but does not own its lifecycle. Callers MUST ensure the underlying zip source (e.g., os.File or memory buffer) remains open and valid for the entire lifetime of the Book instance. CRITICAL: If you built this from a *zip.ReadCloser (e.g., zip.OpenReader), do NOT call Close() on it until you are completely finished calling all methods on Book (like Text()).

func New added in v0.7.0

func New(z *zip.Reader) (*Book, error)

New creates a new Book from an existing zip reader. It parses the metadata immediately. The provided *zip.Reader must remain open to allow subsequent calls to methods like Text() to lazily read files.

Example usage with a file:

r, _ := zip.OpenReader("file.epub")
book, _ := epub.New(&r.Reader) // Pass the embedded Reader
text, _ := book.Text()         // Must happen BEFORE r.Close()
r.Close()

func (*Book) Labels added in v0.7.0

func (b *Book) Labels() map[string]string

Labels extracts DC metadata as a map of prefix → raw value. Only non-empty fields are included. Values are returned without normalization — the caller or storage layer is responsible for sanitization before persistence.

func (*Book) Text added in v0.7.0

func (b *Book) Text() (string, error)

Text extracts the plain text content of the EPUB in reading order.

type ContainerXML

type ContainerXML struct {
	Rootfiles []struct {
		FullPath string `xml:"full-path,attr"`
	} `xml:"rootfiles>rootfile"`
}

ContainerXML represents META-INF/container.xml

type Creator

type Creator struct {
	Value  string `xml:",chardata"`
	Role   string `xml:"role,attr"`
	FileAs string `xml:"file-as,attr"`
}

type Date

type Date struct {
	Value string `xml:",chardata"`
	Event string `xml:"event,attr"`
}

type Item added in v0.7.0

type Item struct {
	ID   string `xml:"id,attr"`
	Href string `xml:"href,attr"` // Path relative to OPF
}

type ItemRef added in v0.7.0

type ItemRef struct {
	IDRef string `xml:"idref,attr"` // References an ID in the Manifest
}

type Manifest added in v0.7.0

type Manifest struct {
	Items []Item `xml:"item"`
}

Manifest lists *all* resources (HTML, CSS, Images, Fonts) in the EPUB. Think of it as an inventory. Each item has a unique 'id' and a 'href' (path).

<manifest>

<item id="intro" href="intro.xhtml" media-type="application/xhtml+xml"/>
<item id="chap1" href="chapter1.xhtml" media-type="application/xhtml+xml"/>
<item id="css" href="style.css" media-type="text/css"/>

</manifest>

type Metadata

type Metadata struct {
	Titles       []string  `xml:"title"`
	Creators     []Creator `xml:"creator"`
	Contributors []Creator `xml:"contributor"`
	Dates        []Date    `xml:"date"`
	Language     []string  `xml:"language"`
	Description  []string  `xml:"description"`
}

type PackageXML

type PackageXML struct {
	Metadata Metadata `xml:"metadata"`
	Manifest Manifest `xml:"manifest"`
	Spine    Spine    `xml:"spine"`
}

PackageXML represents the OPF file content (<package> element). It acts as the "brain" of the EPUB, linking everything together.

type Spine added in v0.7.0

type Spine struct {
	ItemRefs []ItemRef `xml:"itemref"`
}

Spine defines the linear *reading order* of the book. It does NOT contain file paths directly. Instead, it contains 'itemref' elements that point to 'id's in the Manifest.

This allows the book to re-use the same content file in different places if needed, or just separates the "order" logic from the "file" logic.

<spine>

<itemref idref="intro"/>  <!-- First, read the item with id="intro" -->
<itemref idref="chap1"/>  <!-- Then, read the item with id="chap1" -->

</spine>

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL