epub

package module
v0.0.0-...-e0b9638 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 18, 2026 License: MIT Imports: 15 Imported by: 0

README

epub

A pure-Go library for reading and parsing ePub 2 and ePub 3 files.

Go Reference

Features

  • ePub 2 and ePub 3 support
  • Dublin Core metadata extraction (titles, authors, identifiers, language, etc.)
  • Table of contents parsing (NCX for ePub 2, Nav document for ePub 3)
  • Landmarks extraction (ePub 3)
  • Spine-ordered chapter access with lazy content loading
  • Plain text, raw XHTML, and sanitised body HTML output
  • Cover image detection via multiple strategies
  • Project Gutenberg license page detection
  • DRM detection (Adobe ADEPT, Apple FairPlay, Readium LCP)
  • Font obfuscation awareness
  • ZIP bomb protection

Installation

go get github.com/simp-lee/epub

Quick Start

package main

import (
    "fmt"
    "log"

    "github.com/simp-lee/epub"
)

func main() {
    book, err := epub.Open("book.epub")
    if err != nil {
        log.Fatal(err)
    }
    defer book.Close()

    // Metadata
    md := book.Metadata()
    if len(md.Titles) > 0 {
        fmt.Println("Title:", md.Titles[0])
    }
    for _, a := range md.Authors {
        fmt.Println("Author:", a.Name)
    }

    // Table of Contents
    for _, item := range book.TOC() {
        fmt.Printf("  %s → %s\n", item.Title, item.Href)
    }

    // Chapters
    for _, ch := range book.Chapters() {
        text, err := ch.TextContent()
        if err != nil {
            continue
        }
        fmt.Printf("  [%s] %d chars\n", ch.Title, len(text))
    }

    // Cover image
    cover, err := book.Cover()
    if err == nil {
        fmt.Printf("Cover: %s (%d bytes)\n", cover.MediaType, len(cover.Data))
    }
}

API Overview

Opening
Function Description
Open(path) Open an ePub file by path
NewReader(r, size) Open from an io.ReaderAt
Book Methods
Method Description
Close() Release resources
Metadata() Dublin Core metadata
TOC() Table of contents tree
Landmarks() ePub 3 landmarks
Chapters() Spine-ordered chapters
ContentChapters() Chapters excluding license pages
Cover() Detect and return cover image
ReadFile(name) Read any file from the archive
HasTOC() Whether a TOC is present
Warnings() Non-fatal parsing warnings
Chapter Methods
Method Description
RawContent() Raw XHTML bytes
TextContent() Extracted plain text
BodyHTML() Sanitised <body> inner HTML
Error Handling

The package provides sentinel errors for common failure cases:

errors.Is(err, epub.ErrDRMProtected)   // DRM-encrypted file
errors.Is(err, epub.ErrInvalidEPub)    // Invalid ePub structure
errors.Is(err, epub.ErrInvalidChapter) // Invalid chapter handle (zero-value)
errors.Is(err, epub.ErrNoCover)        // No cover image found
errors.Is(err, epub.ErrFileNotFound)   // File not in archive

When a book has no NCX/nav table of contents, TOC() returns an empty slice.

License

MIT

Documentation

Overview

Package epub provides a pure-Go library for reading and parsing ePub 2 and ePub 3 files.

It extracts metadata (Dublin Core), table of contents (NCX and Nav), spine-ordered chapters with lazy content loading, cover images, and landmarks. DRM-protected files are detected and rejected with ErrDRMProtected.

Opening an ePub

Use Open to open a file by path, or NewReader to read from an io.ReaderAt:

book, err := epub.Open("book.epub")
if err != nil {
    log.Fatal(err)
}
defer book.Close()

Metadata

The Book.Metadata method returns a Metadata struct containing titles, authors, language, identifiers (ISBN/UUID), publisher, date, description, subjects, and more:

md := book.Metadata()
fmt.Println(md.Titles[0])

Table of Contents

The Book.TOC method returns a tree of TOCItem entries. Each item includes a spine index range indicating which spine items it covers:

for _, item := range book.TOC() {
    fmt.Println(item.Title, item.Href)
}

Chapters

Chapters are returned in spine order via Book.Chapters. Content is loaded lazily; call Chapter.RawContent for raw XHTML, Chapter.TextContent for plain text, or Chapter.BodyHTML for sanitised inner HTML with rewritten image paths:

for _, ch := range book.Chapters() {
    text, _ := ch.TextContent()
    fmt.Println(ch.Title, len(text))
}

Use Book.ContentChapters to exclude Project Gutenberg license pages.

Cover Image

Book.Cover attempts multiple strategies (ePub 3 properties, ePub 2 meta, guide reference, manifest heuristic, first spine item) to locate the cover:

cover, err := book.Cover()
if err == nil {
    os.WriteFile("cover.jpg", cover.Data, 0644)
}

Error Handling

The package defines sentinel errors for common failure cases:

If no table of contents is present, Book.TOC returns an empty slice and Book.HasTOC returns false.

Index

Examples

Constants

This section is empty.

Variables

View Source
var (
	// ErrDRMProtected indicates the ePub file is protected by DRM
	// (e.g., Adobe ADEPT, Apple FairPlay, Readium LCP) and cannot be read.
	ErrDRMProtected = errors.New("epub: file is DRM protected")

	// ErrInvalidEPub indicates the file is not a valid ePub
	// (e.g., missing container.xml and no .opf file found).
	ErrInvalidEPub = errors.New("epub: invalid ePub file")

	// ErrInvalidChapter indicates a Chapter handle is invalid
	// (for example, a zero-value Chapter without an associated Book).
	ErrInvalidChapter = errors.New("epub: invalid chapter handle")

	// ErrFileNotFound indicates the requested file does not exist
	// in the ePub archive.
	ErrFileNotFound = errors.New("epub: file not found in archive")

	// ErrNoCover indicates no cover image could be detected
	// using any of the supported strategies.
	ErrNoCover = errors.New("epub: no cover image found")
)

Sentinel errors returned by the epub package.

Functions

This section is empty.

Types

type Author

type Author struct {
	// Name is the display name of the author (dc:creator text content).
	Name string

	// FileAs is the opf:file-as attribute value (e.g., "Dickens, Charles").
	FileAs string

	// Role is the opf:role attribute value (e.g., "aut", "edt", "trl").
	Role string
}

Author represents a dc:creator entry with optional file-as and role attributes.

type Book

type Book struct {
	// contains filtered or unexported fields
}

Book is the main public API type for reading ePub files. Use Open or NewReader to create a Book instance.

A Book is not safe for concurrent use by multiple goroutines.

func NewReader

func NewReader(r io.ReaderAt, size int64) (*Book, error)

NewReader creates a Book from an io.ReaderAt with the given size. The caller is responsible for the lifetime of r; Close only cleans up internal state.

Example
package main

import (
	"github.com/simp-lee/epub"
)

func main() {
	// NewReader works with any io.ReaderAt, such as an *os.File or bytes.Reader.
	// f, _ := os.Open("book.epub")
	// info, _ := f.Stat()
	// book, err := epub.NewReader(f, info.Size())

	_ = epub.NewReader // placeholder — see Open example for full usage
}

func Open

func Open(path string) (*Book, error)

Open opens an ePub file at the given path. The caller must call Close when done reading from the book.

Example
package main

import (
	"fmt"
	"log"

	"github.com/simp-lee/epub"
)

func main() {
	book, err := epub.Open("testdata/book.epub")
	if err != nil {
		log.Fatal(err)
	}
	defer book.Close()

	md := book.Metadata()
	fmt.Println(md.Titles[0])
}

func (*Book) Chapters

func (b *Book) Chapters() []Chapter

Chapters returns the chapters in spine order. Each Chapter is a lightweight handle; content is loaded lazily when RawContent, TextContent, or BodyHTML is called. Title is derived from the TOC by matching Href (ignoring fragment). The result is cached after the first call.

Note: IsLicense is not populated by Chapters(). Call ContentChapters() to trigger Gutenberg license detection; after that call, the cached chapters returned by Chapters() will also have IsLicense set.

Example
package main

import (
	"fmt"
	"log"

	"github.com/simp-lee/epub"
)

func main() {
	book, err := epub.Open("testdata/book.epub")
	if err != nil {
		log.Fatal(err)
	}
	defer book.Close()

	for _, ch := range book.Chapters() {
		text, err := ch.TextContent()
		if err != nil {
			continue
		}
		fmt.Printf("%-20s %d chars\n", ch.Title, len(text))
	}
}

func (*Book) Close

func (b *Book) Close() error

Close releases resources held by the Book. When the Book was created via Open, Close closes the underlying file. Close is idempotent.

func (*Book) ContentChapters

func (b *Book) ContentChapters() []Chapter

ContentChapters returns the chapters in spine order, excluding any detected Project Gutenberg license pages (IsLicense == true). On the first call, it reads every chapter file to perform license detection; subsequent calls use the cached result. After this call, Chapters() also returns chapters with IsLicense correctly set.

func (*Book) Cover

func (b *Book) Cover() (CoverImage, error)

Cover detects and returns the cover image using multiple strategies. Strategies are tried in priority order:

  1. ePub 3 manifest item with properties="cover-image"
  2. ePub 2 <meta name="cover" content="ID"/> → manifest lookup
  3. <guide> reference type="cover" → parse XHTML for first <img>
  4. Manifest item whose ID or href contains "cover" with image/* media-type
  5. First spine item's XHTML → first <img>

Returns ErrNoCover if no strategy succeeds.

Example
package main

import (
	"fmt"
	"log"

	"github.com/simp-lee/epub"
)

func main() {
	book, err := epub.Open("testdata/book.epub")
	if err != nil {
		log.Fatal(err)
	}
	defer book.Close()

	cover, err := book.Cover()
	if err != nil {
		fmt.Println("no cover found")
		return
	}

	fmt.Printf("Cover: %s (%s, %d bytes)\n", cover.Path, cover.MediaType, len(cover.Data))
}

func (*Book) HasTOC

func (b *Book) HasTOC() bool

HasTOC reports whether the ePub contains a table of contents.

func (*Book) Landmarks

func (b *Book) Landmarks() []TOCItem

Landmarks returns the landmarks from an ePub 3 nav document. Returns nil for ePub 2 files or when no landmarks are present.

func (*Book) Metadata

func (b *Book) Metadata() Metadata

Metadata returns the extracted metadata from the ePub.

Example
package main

import (
	"fmt"
	"log"

	"github.com/simp-lee/epub"
)

func main() {
	book, err := epub.Open("testdata/book.epub")
	if err != nil {
		log.Fatal(err)
	}
	defer book.Close()

	md := book.Metadata()

	fmt.Printf("Title:   %s\n", md.Titles[0])
	fmt.Printf("Version: %s\n", md.Version)

	for _, a := range md.Authors {
		fmt.Printf("Author:  %s\n", a.Name)
	}
}

func (*Book) ReadFile

func (b *Book) ReadFile(name string) ([]byte, error)

ReadFile reads a file from the ePub archive by its ZIP-internal path. The lookup is case-insensitive as a fallback.

func (*Book) TOC

func (b *Book) TOC() []TOCItem

TOC returns the table of contents as a tree of TOCItem. Each item's SpineIndex is set to the index of the corresponding spine item, or -1 if no match was found.

Example
package main

import (
	"fmt"
	"log"

	"github.com/simp-lee/epub"
)

func main() {
	book, err := epub.Open("testdata/book.epub")
	if err != nil {
		log.Fatal(err)
	}
	defer book.Close()

	for _, item := range book.TOC() {
		fmt.Printf("%s → %s\n", item.Title, item.Href)
	}
}

func (*Book) Warnings

func (b *Book) Warnings() []string

Warnings returns the list of non-fatal warnings accumulated during parsing.

type Chapter

type Chapter struct {
	// Title is the chapter title derived from the TOC (empty if not in TOC).
	Title string

	// Href is the content file path within the ePub archive.
	Href string

	// ID is the manifest item ID for this chapter.
	ID string

	// Linear indicates whether this chapter is part of the linear reading order.
	Linear bool

	// IsLicense indicates whether this chapter is a Project Gutenberg license page.
	// Detection is based on known Gutenberg license patterns in the text content.
	IsLicense bool
	// contains filtered or unexported fields
}

Chapter represents a spine item with methods for content access. Content is loaded lazily from the underlying ePub archive.

func (Chapter) BodyHTML

func (c Chapter) BodyHTML() (string, error)

BodyHTML extracts the inner HTML of the <body> element from this chapter's XHTML. Image paths are rewritten to ZIP-root-relative paths. Script and style elements and event handler attributes are stripped.

func (Chapter) RawContent

func (c Chapter) RawContent() ([]byte, error)

RawContent reads the raw XHTML bytes of this chapter from the ePub archive. Leading UTF-8 BOM is stripped if present.

func (Chapter) TextContent

func (c Chapter) TextContent() (string, error)

TextContent extracts the plain text content from this chapter's XHTML. Block-level elements produce line breaks; script and style content is skipped.

type CoverImage

type CoverImage struct {
	// Path is the ZIP-internal path to the cover image file.
	Path string

	// MediaType is the MIME type of the cover image (e.g., "image/jpeg").
	MediaType string

	// Data is the raw image bytes.
	Data []byte
}

CoverImage holds the detected cover image data.

type Identifier

type Identifier struct {
	// Value is the identifier text content (e.g., ISBN, UUID, URI).
	Value string

	// Scheme is the opf:scheme attribute value (e.g., "ISBN", "UUID").
	Scheme string

	// ID is the xml id attribute of this identifier element.
	ID string
}

Identifier represents a dc:identifier entry.

type Metadata

type Metadata struct {
	// Version is the ePub specification version (e.g., "2.0", "3.0").
	Version string

	// Titles contains all dc:title values. The first entry is the primary title.
	Titles []string

	// Authors contains all dc:creator entries with their roles and file-as values.
	Authors []Author

	// Language contains all dc:language values (BCP 47 tags, e.g., "en", "zh-CN").
	Language []string

	// Identifiers contains all dc:identifier entries (ISBN, UUID, URI, etc.).
	Identifiers []Identifier

	// Publisher is the dc:publisher value.
	Publisher string

	// Date is the dc:date value (publication date as raw string).
	Date string

	// Description is the dc:description value.
	Description string

	// Subjects contains all dc:subject values.
	Subjects []string

	// Rights is the dc:rights value.
	Rights string

	// Source is the dc:source value.
	Source string
}

Metadata holds the Dublin Core and other metadata extracted from the OPF file.

type TOCItem

type TOCItem struct {
	// Title is the display text of the TOC entry.
	Title string

	// Href is the content file reference (may include a fragment, e.g., "chapter01.xhtml#section2").
	Href string

	// Children contains nested TOC entries under this item.
	Children []TOCItem

	// SpineIndex is the index into the spine that this TOC entry points to.
	// A value of -1 indicates no spine association was found.
	SpineIndex int

	// SpineEndIndex is the exclusive end index into the spine for this TOC entry.
	// The entry covers spine[SpineIndex:SpineEndIndex]. For example, if SpineIndex=0
	// and SpineEndIndex=3, the entry covers spine items 0, 1, and 2.
	// A value of -1 indicates no spine association was found.
	SpineEndIndex int
}

TOCItem represents a single entry in the table of contents. TOC is a tree structure; each item may have nested children.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL