Documentation
¶
Overview ¶
Package epub provides a pure-Go library for reading and parsing ePub 2 and ePub 3 files.
It extracts metadata (Dublin Core), table of contents (NCX and Nav), spine-ordered chapters with lazy content loading, cover images, and landmarks. DRM-protected files are detected and rejected with ErrDRMProtected.
Opening an ePub ¶
Use Open to open a file by path, or NewReader to read from an io.ReaderAt:
book, err := epub.Open("book.epub")
if err != nil {
log.Fatal(err)
}
defer book.Close()
Metadata ¶
The Book.Metadata method returns a Metadata struct containing titles, authors, language, identifiers (ISBN/UUID), publisher, date, description, subjects, and more:
md := book.Metadata() fmt.Println(md.Titles[0])
Table of Contents ¶
The Book.TOC method returns a tree of TOCItem entries. Each item includes a spine index range indicating which spine items it covers:
for _, item := range book.TOC() {
fmt.Println(item.Title, item.Href)
}
Chapters ¶
Chapters are returned in spine order via Book.Chapters. Content is loaded lazily; call Chapter.RawContent for raw XHTML, Chapter.TextContent for plain text, or Chapter.BodyHTML for sanitised inner HTML with rewritten image paths:
for _, ch := range book.Chapters() {
text, _ := ch.TextContent()
fmt.Println(ch.Title, len(text))
}
Use Book.ContentChapters to exclude Project Gutenberg license pages.
Cover Image ¶
Book.Cover attempts multiple strategies (ePub 3 properties, ePub 2 meta, guide reference, manifest heuristic, first spine item) to locate the cover:
cover, err := book.Cover()
if err == nil {
os.WriteFile("cover.jpg", cover.Data, 0644)
}
Error Handling ¶
The package defines sentinel errors for common failure cases:
- ErrDRMProtected – the file is DRM encrypted
- ErrInvalidEPub – structural validation failed
- ErrInvalidChapter – a Chapter handle is invalid
- ErrFileNotFound – a requested file is not in the archive
- ErrNoCover – no cover image could be detected
If no table of contents is present, Book.TOC returns an empty slice and Book.HasTOC returns false.
Index ¶
- Variables
- type Author
- type Book
- func (b *Book) Chapters() []Chapter
- func (b *Book) Close() error
- func (b *Book) ContentChapters() []Chapter
- func (b *Book) Cover() (CoverImage, error)
- func (b *Book) HasTOC() bool
- func (b *Book) Landmarks() []TOCItem
- func (b *Book) Metadata() Metadata
- func (b *Book) ReadFile(name string) ([]byte, error)
- func (b *Book) TOC() []TOCItem
- func (b *Book) Warnings() []string
- type Chapter
- type CoverImage
- type Identifier
- type Metadata
- type TOCItem
Examples ¶
Constants ¶
This section is empty.
Variables ¶
var ( // ErrDRMProtected indicates the ePub file is protected by DRM // (e.g., Adobe ADEPT, Apple FairPlay, Readium LCP) and cannot be read. ErrDRMProtected = errors.New("epub: file is DRM protected") // ErrInvalidEPub indicates the file is not a valid ePub // (e.g., missing container.xml and no .opf file found). ErrInvalidEPub = errors.New("epub: invalid ePub file") // ErrInvalidChapter indicates a Chapter handle is invalid // (for example, a zero-value Chapter without an associated Book). ErrInvalidChapter = errors.New("epub: invalid chapter handle") // ErrFileNotFound indicates the requested file does not exist // in the ePub archive. ErrFileNotFound = errors.New("epub: file not found in archive") // ErrNoCover indicates no cover image could be detected // using any of the supported strategies. ErrNoCover = errors.New("epub: no cover image found") )
Sentinel errors returned by the epub package.
Functions ¶
This section is empty.
Types ¶
type Author ¶
type Author struct {
// Name is the display name of the author (dc:creator text content).
Name string
// FileAs is the opf:file-as attribute value (e.g., "Dickens, Charles").
FileAs string
// Role is the opf:role attribute value (e.g., "aut", "edt", "trl").
Role string
}
Author represents a dc:creator entry with optional file-as and role attributes.
type Book ¶
type Book struct {
// contains filtered or unexported fields
}
Book is the main public API type for reading ePub files. Use Open or NewReader to create a Book instance.
A Book is not safe for concurrent use by multiple goroutines.
func NewReader ¶
NewReader creates a Book from an io.ReaderAt with the given size. The caller is responsible for the lifetime of r; Close only cleans up internal state.
Example ¶
package main
import (
"github.com/simp-lee/epub"
)
func main() {
// NewReader works with any io.ReaderAt, such as an *os.File or bytes.Reader.
// f, _ := os.Open("book.epub")
// info, _ := f.Stat()
// book, err := epub.NewReader(f, info.Size())
_ = epub.NewReader // placeholder — see Open example for full usage
}
Output:
func Open ¶
Open opens an ePub file at the given path. The caller must call Close when done reading from the book.
Example ¶
package main
import (
"fmt"
"log"
"github.com/simp-lee/epub"
)
func main() {
book, err := epub.Open("testdata/book.epub")
if err != nil {
log.Fatal(err)
}
defer book.Close()
md := book.Metadata()
fmt.Println(md.Titles[0])
}
Output:
func (*Book) Chapters ¶
Chapters returns the chapters in spine order. Each Chapter is a lightweight handle; content is loaded lazily when RawContent, TextContent, or BodyHTML is called. Title is derived from the TOC by matching Href (ignoring fragment). The result is cached after the first call.
Note: IsLicense is not populated by Chapters(). Call ContentChapters() to trigger Gutenberg license detection; after that call, the cached chapters returned by Chapters() will also have IsLicense set.
Example ¶
package main
import (
"fmt"
"log"
"github.com/simp-lee/epub"
)
func main() {
book, err := epub.Open("testdata/book.epub")
if err != nil {
log.Fatal(err)
}
defer book.Close()
for _, ch := range book.Chapters() {
text, err := ch.TextContent()
if err != nil {
continue
}
fmt.Printf("%-20s %d chars\n", ch.Title, len(text))
}
}
Output:
func (*Book) Close ¶
Close releases resources held by the Book. When the Book was created via Open, Close closes the underlying file. Close is idempotent.
func (*Book) ContentChapters ¶
ContentChapters returns the chapters in spine order, excluding any detected Project Gutenberg license pages (IsLicense == true). On the first call, it reads every chapter file to perform license detection; subsequent calls use the cached result. After this call, Chapters() also returns chapters with IsLicense correctly set.
func (*Book) Cover ¶
func (b *Book) Cover() (CoverImage, error)
Cover detects and returns the cover image using multiple strategies. Strategies are tried in priority order:
- ePub 3 manifest item with properties="cover-image"
- ePub 2 <meta name="cover" content="ID"/> → manifest lookup
- <guide> reference type="cover" → parse XHTML for first <img>
- Manifest item whose ID or href contains "cover" with image/* media-type
- First spine item's XHTML → first <img>
Returns ErrNoCover if no strategy succeeds.
Example ¶
package main
import (
"fmt"
"log"
"github.com/simp-lee/epub"
)
func main() {
book, err := epub.Open("testdata/book.epub")
if err != nil {
log.Fatal(err)
}
defer book.Close()
cover, err := book.Cover()
if err != nil {
fmt.Println("no cover found")
return
}
fmt.Printf("Cover: %s (%s, %d bytes)\n", cover.Path, cover.MediaType, len(cover.Data))
}
Output:
func (*Book) Landmarks ¶
Landmarks returns the landmarks from an ePub 3 nav document. Returns nil for ePub 2 files or when no landmarks are present.
func (*Book) Metadata ¶
Metadata returns the extracted metadata from the ePub.
Example ¶
package main
import (
"fmt"
"log"
"github.com/simp-lee/epub"
)
func main() {
book, err := epub.Open("testdata/book.epub")
if err != nil {
log.Fatal(err)
}
defer book.Close()
md := book.Metadata()
fmt.Printf("Title: %s\n", md.Titles[0])
fmt.Printf("Version: %s\n", md.Version)
for _, a := range md.Authors {
fmt.Printf("Author: %s\n", a.Name)
}
}
Output:
func (*Book) ReadFile ¶
ReadFile reads a file from the ePub archive by its ZIP-internal path. The lookup is case-insensitive as a fallback.
func (*Book) TOC ¶
TOC returns the table of contents as a tree of TOCItem. Each item's SpineIndex is set to the index of the corresponding spine item, or -1 if no match was found.
Example ¶
package main
import (
"fmt"
"log"
"github.com/simp-lee/epub"
)
func main() {
book, err := epub.Open("testdata/book.epub")
if err != nil {
log.Fatal(err)
}
defer book.Close()
for _, item := range book.TOC() {
fmt.Printf("%s → %s\n", item.Title, item.Href)
}
}
Output:
type Chapter ¶
type Chapter struct {
// Title is the chapter title derived from the TOC (empty if not in TOC).
Title string
// Href is the content file path within the ePub archive.
Href string
// ID is the manifest item ID for this chapter.
ID string
// Linear indicates whether this chapter is part of the linear reading order.
Linear bool
// IsLicense indicates whether this chapter is a Project Gutenberg license page.
// Detection is based on known Gutenberg license patterns in the text content.
IsLicense bool
// contains filtered or unexported fields
}
Chapter represents a spine item with methods for content access. Content is loaded lazily from the underlying ePub archive.
func (Chapter) BodyHTML ¶
BodyHTML extracts the inner HTML of the <body> element from this chapter's XHTML. Image paths are rewritten to ZIP-root-relative paths. Script and style elements and event handler attributes are stripped.
func (Chapter) RawContent ¶
RawContent reads the raw XHTML bytes of this chapter from the ePub archive. Leading UTF-8 BOM is stripped if present.
func (Chapter) TextContent ¶
TextContent extracts the plain text content from this chapter's XHTML. Block-level elements produce line breaks; script and style content is skipped.
type CoverImage ¶
type CoverImage struct {
// Path is the ZIP-internal path to the cover image file.
Path string
// MediaType is the MIME type of the cover image (e.g., "image/jpeg").
MediaType string
// Data is the raw image bytes.
Data []byte
}
CoverImage holds the detected cover image data.
type Identifier ¶
type Identifier struct {
// Value is the identifier text content (e.g., ISBN, UUID, URI).
Value string
// Scheme is the opf:scheme attribute value (e.g., "ISBN", "UUID").
Scheme string
// ID is the xml id attribute of this identifier element.
ID string
}
Identifier represents a dc:identifier entry.
type Metadata ¶
type Metadata struct {
// Version is the ePub specification version (e.g., "2.0", "3.0").
Version string
// Titles contains all dc:title values. The first entry is the primary title.
Titles []string
// Authors contains all dc:creator entries with their roles and file-as values.
Authors []Author
// Language contains all dc:language values (BCP 47 tags, e.g., "en", "zh-CN").
Language []string
// Identifiers contains all dc:identifier entries (ISBN, UUID, URI, etc.).
Identifiers []Identifier
// Publisher is the dc:publisher value.
Publisher string
// Date is the dc:date value (publication date as raw string).
Date string
// Description is the dc:description value.
Description string
// Subjects contains all dc:subject values.
Subjects []string
// Rights is the dc:rights value.
Rights string
// Source is the dc:source value.
Source string
}
Metadata holds the Dublin Core and other metadata extracted from the OPF file.
type TOCItem ¶
type TOCItem struct {
// Title is the display text of the TOC entry.
Title string
// Href is the content file reference (may include a fragment, e.g., "chapter01.xhtml#section2").
Href string
// Children contains nested TOC entries under this item.
Children []TOCItem
// SpineIndex is the index into the spine that this TOC entry points to.
// A value of -1 indicates no spine association was found.
SpineIndex int
// SpineEndIndex is the exclusive end index into the spine for this TOC entry.
// The entry covers spine[SpineIndex:SpineEndIndex]. For example, if SpineIndex=0
// and SpineEndIndex=3, the entry covers spine items 0, 1, and 2.
// A value of -1 indicates no spine association was found.
SpineEndIndex int
}
TOCItem represents a single entry in the table of contents. TOC is a tree structure; each item may have nested children.