epub

package

v0.12.0 Latest Latest Go to latest Published: May 4, 2026 License: MIT Imports: 13 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/revelaction/segrob

Links

Open Source Insights

Documentation ¶

Index ¶

func CleanForNLP(s string) string
func FindOPFPath(r *zip.Reader) (path string, err error)
type Book
- func New(z *zip.Reader) (*Book, error)
- func (b *Book) Labels() map[string]string
- func (b *Book) Text() (string, error)
type ContainerXML
type Creator
type Date
type Item
type ItemRef
type Manifest
type Metadata
type PackageXML
type Spine

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func CleanForNLP ¶

func CleanForNLP(s string) string

output of pandoc can contain:

—Si hombres salvajes venir —<2060>dijo<2060>— ellos comerme a mí, vos salvaros.

Those <2060> characters are U+2060 WORD JOINER — an invisible zero-width character used in typesetting to prevent line breaks around em-dashes

Unicode defines U+2060

↓

HTML/XHTML can express it as ⁠

↓

Epub (XHTML inside a zip) inherits it

↓

Pandoc converts to plain text, keeps the raw character

↓

Your terminal/vim shows it as <2060> because it has no glyph

There are several Unicode categories of invisible/typesetting characters that are all noise for NLP. Unicode defines a formal category Cf (Format characters) that covers most of these.

func FindOPFPath ¶ added in v0.9.0

func FindOPFPath(r *zip.Reader) (path string, err error)

FindOPFPath locates the OPF file in the EPUB archive by reading META-INF/container.xml.

Types ¶

type Book ¶ added in v0.7.0

type Book struct {
	Package PackageXML
	// contains filtered or unexported fields
}

Book represents a parsed EPUB. It holds a reference to an open *zip.Reader but does not own its lifecycle. Callers MUST ensure the underlying zip source (e.g., os.File or memory buffer) remains open and valid for the entire lifetime of the Book instance. CRITICAL: If you built this from a *zip.ReadCloser (e.g., zip.OpenReader), do NOT call Close() on it until you are completely finished calling all methods on Book (like Text()).

func New ¶ added in v0.7.0

func New(z *zip.Reader) (*Book, error)

New creates a new Book from an existing zip reader. It parses the metadata immediately. The provided *zip.Reader must remain open to allow subsequent calls to methods like Text() to lazily read files.

Example usage with a file:

r, _ := zip.OpenReader("file.epub")
book, _ := epub.New(&r.Reader) // Pass the embedded Reader
text, _ := book.Text()         // Must happen BEFORE r.Close()
r.Close()

func (*Book) Labels ¶ added in v0.7.0

func (b *Book) Labels() map[string]string

Labels extracts DC metadata as a map of prefix → raw value. Only non-empty fields are included. Values are returned without normalization — the caller or storage layer is responsible for sanitization before persistence.

func (*Book) Text ¶ added in v0.7.0

func (b *Book) Text() (string, error)

Text extracts the plain text content of the EPUB in reading order.

type ContainerXML ¶

type ContainerXML struct {
	Rootfiles []struct {
		FullPath string `xml:"full-path,attr"`
	} `xml:"rootfiles>rootfile"`
}

ContainerXML represents META-INF/container.xml

type Creator ¶

type Creator struct {
	Value  string `xml:",chardata"`
	Role   string `xml:"role,attr"`
	FileAs string `xml:"file-as,attr"`
}

type Date ¶

type Date struct {
	Value string `xml:",chardata"`
	Event string `xml:"event,attr"`
}

type Item ¶ added in v0.7.0

type Item struct {
	ID   string `xml:"id,attr"`
	Href string `xml:"href,attr"` // Path relative to OPF
}

type ItemRef ¶ added in v0.7.0

type ItemRef struct {
	IDRef string `xml:"idref,attr"` // References an ID in the Manifest
}

type Manifest ¶ added in v0.7.0

type Manifest struct {
	Items []Item `xml:"item"`
}

Manifest lists *all* resources (HTML, CSS, Images, Fonts) in the EPUB. Think of it as an inventory. Each item has a unique 'id' and a 'href' (path).

<item id="intro" href="intro.xhtml" media-type="application/xhtml+xml"/>
<item id="chap1" href="chapter1.xhtml" media-type="application/xhtml+xml"/>
<item id="css" href="style.css" media-type="text/css"/>

</manifest>

type Metadata ¶

type Metadata struct {
	Titles       []string  `xml:"title"`
	Creators     []Creator `xml:"creator"`
	Contributors []Creator `xml:"contributor"`
	Dates        []Date    `xml:"date"`
	Language     []string  `xml:"language"`
	Description  []string  `xml:"description"`
}

type PackageXML ¶

type PackageXML struct {
	Metadata Metadata `xml:"metadata"`
	Manifest Manifest `xml:"manifest"`
	Spine    Spine    `xml:"spine"`
}

PackageXML represents the OPF file content (<package> element). It acts as the "brain" of the EPUB, linking everything together.

type Spine ¶ added in v0.7.0

type Spine struct {
	ItemRefs []ItemRef `xml:"itemref"`
}

Spine defines the linear *reading order* of the book. It does NOT contain file paths directly. Instead, it contains 'itemref' elements that point to 'id's in the Manifest.

This allows the book to re-use the same content file in different places if needed, or just separates the "order" logic from the "file" logic.

<spine>

<itemref idref="intro"/>  <!-- First, read the item with id="intro" -->
<itemref idref="chap1"/>  <!-- Then, read the item with id="chap1" -->

</spine>

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL