Documentation
¶
Index ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func CleanForNLP ¶
output of pandoc can contain:
—Si hombres salvajes venir —<2060>dijo<2060>— ellos comerme a mí, vos salvaros.
Those <2060> characters are U+2060 WORD JOINER — an invisible zero-width character used in typesetting to prevent line breaks around em-dashes
Unicode defines U+2060
↓
HTML/XHTML can express it as ⁠
↓
Epub (XHTML inside a zip) inherits it
↓
Pandoc converts to plain text, keeps the raw character
↓
Your terminal/vim shows it as <2060> because it has no glyph
There are several Unicode categories of invisible/typesetting characters that are all noise for NLP. Unicode defines a formal category Cf (Format characters) that covers most of these.
Types ¶
type Book ¶ added in v0.7.0
type Book struct {
Package PackageXML
// contains filtered or unexported fields
}
Book represents a parsed EPUB. It holds a reference to an open *zip.Reader but does not own its lifecycle. Callers MUST ensure the underlying zip source (e.g., os.File or memory buffer) remains open and valid for the entire lifetime of the Book instance. CRITICAL: If you built this from a *zip.ReadCloser (e.g., zip.OpenReader), do NOT call Close() on it until you are completely finished calling all methods on Book (like Text()).
func New ¶ added in v0.7.0
New creates a new Book from an existing zip reader. It parses the metadata immediately. The provided *zip.Reader must remain open to allow subsequent calls to methods like Text() to lazily read files.
Example usage with a file:
r, _ := zip.OpenReader("file.epub")
book, _ := epub.New(&r.Reader) // Pass the embedded Reader
text, _ := book.Text() // Must happen BEFORE r.Close()
r.Close()
type ContainerXML ¶
type ContainerXML struct {
Rootfiles []struct {
FullPath string `xml:"full-path,attr"`
} `xml:"rootfiles>rootfile"`
}
ContainerXML represents META-INF/container.xml
type ItemRef ¶ added in v0.7.0
type ItemRef struct {
IDRef string `xml:"idref,attr"` // References an ID in the Manifest
}
type Manifest ¶ added in v0.7.0
type Manifest struct {
Items []Item `xml:"item"`
}
Manifest lists *all* resources (HTML, CSS, Images, Fonts) in the EPUB. Think of it as an inventory. Each item has a unique 'id' and a 'href' (path).
<manifest>
<item id="intro" href="intro.xhtml" media-type="application/xhtml+xml"/> <item id="chap1" href="chapter1.xhtml" media-type="application/xhtml+xml"/> <item id="css" href="style.css" media-type="text/css"/>
</manifest>
type PackageXML ¶
type PackageXML struct {
Metadata Metadata `xml:"metadata"`
Manifest Manifest `xml:"manifest"`
Spine Spine `xml:"spine"`
}
PackageXML represents the OPF file content (<package> element). It acts as the "brain" of the EPUB, linking everything together.
type Spine ¶ added in v0.7.0
type Spine struct {
ItemRefs []ItemRef `xml:"itemref"`
}
Spine defines the linear *reading order* of the book. It does NOT contain file paths directly. Instead, it contains 'itemref' elements that point to 'id's in the Manifest.
This allows the book to re-use the same content file in different places if needed, or just separates the "order" logic from the "file" logic.
<spine>
<itemref idref="intro"/> <!-- First, read the item with id="intro" --> <itemref idref="chap1"/> <!-- Then, read the item with id="chap1" -->
</spine>