epub

package module

v0.0.0-...-e0b9638 Latest Latest Go to latest Published: Feb 18, 2026 License: MIT Imports: 15 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/simp-lee/epub

Links

Open Source Insights

README ¶

epub

A pure-Go library for reading and parsing ePub 2 and ePub 3 files.

Features

ePub 2 and ePub 3 support
Dublin Core metadata extraction (titles, authors, identifiers, language, etc.)
Table of contents parsing (NCX for ePub 2, Nav document for ePub 3)
Landmarks extraction (ePub 3)
Spine-ordered chapter access with lazy content loading
Plain text, raw XHTML, and sanitised body HTML output
Cover image detection via multiple strategies
Project Gutenberg license page detection
DRM detection (Adobe ADEPT, Apple FairPlay, Readium LCP)
Font obfuscation awareness
ZIP bomb protection

Installation

go get github.com/simp-lee/epub

Quick Start

package main

import (
    "fmt"
    "log"

    "github.com/simp-lee/epub"
)

func main() {
    book, err := epub.Open("book.epub")
    if err != nil {
        log.Fatal(err)
    }
    defer book.Close()

    // Metadata
    md := book.Metadata()
    if len(md.Titles) > 0 {
        fmt.Println("Title:", md.Titles[0])
    }
    for _, a := range md.Authors {
        fmt.Println("Author:", a.Name)
    }

    // Table of Contents
    for _, item := range book.TOC() {
        fmt.Printf("  %s → %s\n", item.Title, item.Href)
    }

    // Chapters
    for _, ch := range book.Chapters() {
        text, err := ch.TextContent()
        if err != nil {
            continue
        }
        fmt.Printf("  [%s] %d chars\n", ch.Title, len(text))
    }

    // Cover image
    cover, err := book.Cover()
    if err == nil {
        fmt.Printf("Cover: %s (%d bytes)\n", cover.MediaType, len(cover.Data))
    }
}

API Overview

Opening

Function	Description
`Open(path)`	Open an ePub file by path
`NewReader(r, size)`	Open from an `io.ReaderAt`

Book Methods

Method	Description
`Close()`	Release resources
`Metadata()`	Dublin Core metadata
`TOC()`	Table of contents tree
`Landmarks()`	ePub 3 landmarks
`Chapters()`	Spine-ordered chapters
`ContentChapters()`	Chapters excluding license pages
`Cover()`	Detect and return cover image
`ReadFile(name)`	Read any file from the archive
`HasTOC()`	Whether a TOC is present
`Warnings()`	Non-fatal parsing warnings

Chapter Methods

Method	Description
`RawContent()`	Raw XHTML bytes
`TextContent()`	Extracted plain text
`BodyHTML()`	Sanitised `<body>` inner HTML

Error Handling

The package provides sentinel errors for common failure cases:

errors.Is(err, epub.ErrDRMProtected)   // DRM-encrypted file
errors.Is(err, epub.ErrInvalidEPub)    // Invalid ePub structure
errors.Is(err, epub.ErrInvalidChapter) // Invalid chapter handle (zero-value)
errors.Is(err, epub.ErrNoCover)        // No cover image found
errors.Is(err, epub.ErrFileNotFound)   // File not in archive

When a book has no NCX/nav table of contents, TOC() returns an empty slice.

License

MIT

Documentation ¶

Overview ¶

Package epub provides a pure-Go library for reading and parsing ePub 2 and ePub 3 files.

It extracts metadata (Dublin Core), table of contents (NCX and Nav), spine-ordered chapters with lazy content loading, cover images, and landmarks. DRM-protected files are detected and rejected with ErrDRMProtected.

Opening an ePub ¶

Use Open to open a file by path, or NewReader to read from an io.ReaderAt:

book, err := epub.Open("book.epub")
if err != nil {
    log.Fatal(err)
}
defer book.Close()

Metadata ¶

The Book.Metadata method returns a Metadata struct containing titles, authors, language, identifiers (ISBN/UUID), publisher, date, description, subjects, and more:

md := book.Metadata()
fmt.Println(md.Titles[0])

Table of Contents ¶

The Book.TOC method returns a tree of TOCItem entries. Each item includes a spine index range indicating which spine items it covers:

for _, item := range book.TOC() {
    fmt.Println(item.Title, item.Href)
}

Chapters ¶

Chapters are returned in spine order via Book.Chapters. Content is loaded lazily; call Chapter.RawContent for raw XHTML, Chapter.TextContent for plain text, or Chapter.BodyHTML for sanitised inner HTML with rewritten image paths:

for _, ch := range book.Chapters() {
    text, _ := ch.TextContent()
    fmt.Println(ch.Title, len(text))
}

Use Book.ContentChapters to exclude Project Gutenberg license pages.

Cover Image ¶

Book.Cover attempts multiple strategies (ePub 3 properties, ePub 2 meta, guide reference, manifest heuristic, first spine item) to locate the cover:

cover, err := book.Cover()
if err == nil {
    os.WriteFile("cover.jpg", cover.Data, 0644)
}

Error Handling ¶

The package defines sentinel errors for common failure cases:

ErrDRMProtected – the file is DRM encrypted
ErrInvalidEPub – structural validation failed
ErrInvalidChapter – a Chapter handle is invalid
ErrFileNotFound – a requested file is not in the archive
ErrNoCover – no cover image could be detected

If no table of contents is present, Book.TOC returns an empty slice and Book.HasTOC returns false.

Index ¶

Variables
type Author
type Book
- func NewReader(r io.ReaderAt, size int64) (*Book, error)
- func Open(path string) (*Book, error)
type Chapter
type CoverImage
type Identifier
type Metadata
type TOCItem

Constants ¶

This section is empty.

Variables ¶

View Source

var (
	// ErrDRMProtected indicates the ePub file is protected by DRM
	// (e.g., Adobe ADEPT, Apple FairPlay, Readium LCP) and cannot be read.
	ErrDRMProtected = errors.New("epub: file is DRM protected")

	// ErrInvalidEPub indicates the file is not a valid ePub
	// (e.g., missing container.xml and no .opf file found).
	ErrInvalidEPub = errors.New("epub: invalid ePub file")

	// ErrInvalidChapter indicates a Chapter handle is invalid
	// (for example, a zero-value Chapter without an associated Book).
	ErrInvalidChapter = errors.New("epub: invalid chapter handle")

	// ErrFileNotFound indicates the requested file does not exist
	// in the ePub archive.
	ErrFileNotFound = errors.New("epub: file not found in archive")

	// ErrNoCover indicates no cover image could be detected
	// using any of the supported strategies.
	ErrNoCover = errors.New("epub: no cover image found")
)

Sentinel errors returned by the epub package.

Functions ¶

This section is empty.

Types ¶

type Author ¶

type Author struct {
	// Name is the display name of the author (dc:creator text content).
	Name string

	// FileAs is the opf:file-as attribute value (e.g., "Dickens, Charles").
	FileAs string

	// Role is the opf:role attribute value (e.g., "aut", "edt", "trl").
	Role string
}

Author represents a dc:creator entry with optional file-as and role attributes.

type Book ¶

type Book struct {
	// contains filtered or unexported fields
}

Book is the main public API type for reading ePub files. Use Open or NewReader to create a Book instance.

A Book is not safe for concurrent use by multiple goroutines.

func NewReader ¶

func NewReader(r io.ReaderAt, size int64) (*Book, error)

NewReader creates a Book from an io.ReaderAt with the given size. The caller is responsible for the lifetime of r; Close only cleans up internal state.

Example ¶

package main

import (
	"github.com/simp-lee/epub"
)

func main() {
	// NewReader works with any io.ReaderAt, such as an *os.File or bytes.Reader.
	// f, _ := os.Open("book.epub")
	// info, _ := f.Stat()
	// book, err := epub.NewReader(f, info.Size())

	_ = epub.NewReader // placeholder — see Open example for full usage
}

Output:

func Open ¶

func Open(path string) (*Book, error)

Open opens an ePub file at the given path. The caller must call Close when done reading from the book.

Example ¶

package main

import (
	"fmt"
	"log"

	"github.com/simp-lee/epub"
)

func main() {
	book, err := epub.Open("testdata/book.epub")
	if err != nil {
		log.Fatal(err)
	}
	defer book.Close()

	md := book.Metadata()
	fmt.Println(md.Titles[0])
}

Output:

func (*Book) Chapters ¶

func (b *Book) Chapters() []Chapter

Chapters returns the chapters in spine order. Each Chapter is a lightweight handle; content is loaded lazily when RawContent, TextContent, or BodyHTML is called. Title is derived from the TOC by matching Href (ignoring fragment). The result is cached after the first call.

Note: IsLicense is not populated by Chapters(). Call ContentChapters() to trigger Gutenberg license detection; after that call, the cached chapters returned by Chapters() will also have IsLicense set.

Example ¶

package main

import (
	"fmt"
	"log"

	"github.com/simp-lee/epub"
)

func main() {
	book, err := epub.Open("testdata/book.epub")
	if err != nil {
		log.Fatal(err)
	}
	defer book.Close()

	for _, ch := range book.Chapters() {
		text, err := ch.TextContent()
		if err != nil {
			continue
		}
		fmt.Printf("%-20s %d chars\n", ch.Title, len(text))
	}
}

Output:

func (*Book) Close ¶

func (b *Book) Close() error

Close releases resources held by the Book. When the Book was created via Open, Close closes the underlying file. Close is idempotent.

func (*Book) ContentChapters ¶

func (b *Book) ContentChapters() []Chapter

ContentChapters returns the chapters in spine order, excluding any detected Project Gutenberg license pages (IsLicense == true). On the first call, it reads every chapter file to perform license detection; subsequent calls use the cached result. After this call, Chapters() also returns chapters with IsLicense correctly set.

func (*Book) Cover ¶

func (b *Book) Cover() (CoverImage, error)

Cover detects and returns the cover image using multiple strategies. Strategies are tried in priority order:

ePub 3 manifest item with properties="cover-image"
ePub 2 <meta name="cover" content="ID"/> → manifest lookup
<guide> reference type="cover" → parse XHTML for first <img>
Manifest item whose ID or href contains "cover" with image/* media-type
First spine item's XHTML → first <img>

Returns ErrNoCover if no strategy succeeds.

Example ¶

package main

import (
	"fmt"
	"log"

	"github.com/simp-lee/epub"
)

func main() {
	book, err := epub.Open("testdata/book.epub")
	if err != nil {
		log.Fatal(err)
	}
	defer book.Close()

	cover, err := book.Cover()
	if err != nil {
		fmt.Println("no cover found")
		return
	}

	fmt.Printf("Cover: %s (%s, %d bytes)\n", cover.Path, cover.MediaType, len(cover.Data))
}

Output:

func (*Book) HasTOC ¶

func (b *Book) HasTOC() bool

HasTOC reports whether the ePub contains a table of contents.

func (*Book) Landmarks ¶

func (b *Book) Landmarks() []TOCItem

Landmarks returns the landmarks from an ePub 3 nav document. Returns nil for ePub 2 files or when no landmarks are present.

func (*Book) Metadata ¶

func (b *Book) Metadata() Metadata

Metadata returns the extracted metadata from the ePub.

Example ¶

package main

import (
	"fmt"
	"log"

	"github.com/simp-lee/epub"
)

func main() {
	book, err := epub.Open("testdata/book.epub")
	if err != nil {
		log.Fatal(err)
	}
	defer book.Close()

	md := book.Metadata()

	fmt.Printf("Title:   %s\n", md.Titles[0])
	fmt.Printf("Version: %s\n", md.Version)

	for _, a := range md.Authors {
		fmt.Printf("Author:  %s\n", a.Name)
	}
}

Output:

func (*Book) ReadFile ¶

func (b *Book) ReadFile(name string) ([]byte, error)

ReadFile reads a file from the ePub archive by its ZIP-internal path. The lookup is case-insensitive as a fallback.

func (*Book) TOC ¶

func (b *Book) TOC() []TOCItem

TOC returns the table of contents as a tree of TOCItem. Each item's SpineIndex is set to the index of the corresponding spine item, or -1 if no match was found.

Example ¶

package main

import (
	"fmt"
	"log"

	"github.com/simp-lee/epub"
)

func main() {
	book, err := epub.Open("testdata/book.epub")
	if err != nil {
		log.Fatal(err)
	}
	defer book.Close()

	for _, item := range book.TOC() {
		fmt.Printf("%s → %s\n", item.Title, item.Href)
	}
}

Output:

func (*Book) Warnings ¶

func (b *Book) Warnings() []string

Warnings returns the list of non-fatal warnings accumulated during parsing.

type Chapter ¶

type Chapter struct {
	// Title is the chapter title derived from the TOC (empty if not in TOC).
	Title string

	// Href is the content file path within the ePub archive.
	Href string

	// ID is the manifest item ID for this chapter.
	ID string

	// Linear indicates whether this chapter is part of the linear reading order.
	Linear bool

	// IsLicense indicates whether this chapter is a Project Gutenberg license page.
	// Detection is based on known Gutenberg license patterns in the text content.
	IsLicense bool
	// contains filtered or unexported fields
}

Chapter represents a spine item with methods for content access. Content is loaded lazily from the underlying ePub archive.

func (Chapter) BodyHTML ¶

func (c Chapter) BodyHTML() (string, error)

BodyHTML extracts the inner HTML of the <body> element from this chapter's XHTML. Image paths are rewritten to ZIP-root-relative paths. Script and style elements and event handler attributes are stripped.

func (Chapter) RawContent ¶

func (c Chapter) RawContent() ([]byte, error)

RawContent reads the raw XHTML bytes of this chapter from the ePub archive. Leading UTF-8 BOM is stripped if present.

func (Chapter) TextContent ¶

func (c Chapter) TextContent() (string, error)

TextContent extracts the plain text content from this chapter's XHTML. Block-level elements produce line breaks; script and style content is skipped.

type CoverImage ¶

type CoverImage struct {
	// Path is the ZIP-internal path to the cover image file.
	Path string

	// MediaType is the MIME type of the cover image (e.g., "image/jpeg").
	MediaType string

	// Data is the raw image bytes.
	Data []byte
}

CoverImage holds the detected cover image data.

type Identifier ¶

type Identifier struct {
	// Value is the identifier text content (e.g., ISBN, UUID, URI).
	Value string

	// Scheme is the opf:scheme attribute value (e.g., "ISBN", "UUID").
	Scheme string

	// ID is the xml id attribute of this identifier element.
	ID string
}

Identifier represents a dc:identifier entry.

type Metadata ¶

type Metadata struct {
	// Version is the ePub specification version (e.g., "2.0", "3.0").
	Version string

	// Titles contains all dc:title values. The first entry is the primary title.
	Titles []string

	// Authors contains all dc:creator entries with their roles and file-as values.
	Authors []Author

	// Language contains all dc:language values (BCP 47 tags, e.g., "en", "zh-CN").
	Language []string

	// Identifiers contains all dc:identifier entries (ISBN, UUID, URI, etc.).
	Identifiers []Identifier

	// Publisher is the dc:publisher value.
	Publisher string

	// Date is the dc:date value (publication date as raw string).
	Date string

	// Description is the dc:description value.
	Description string

	// Subjects contains all dc:subject values.
	Subjects []string

	// Rights is the dc:rights value.
	Rights string

	// Source is the dc:source value.
	Source string
}

Metadata holds the Dublin Core and other metadata extracted from the OPF file.

type TOCItem ¶

type TOCItem struct {
	// Title is the display text of the TOC entry.
	Title string

	// Href is the content file reference (may include a fragment, e.g., "chapter01.xhtml#section2").
	Href string

	// Children contains nested TOC entries under this item.
	Children []TOCItem

	// SpineIndex is the index into the spine that this TOC entry points to.
	// A value of -1 indicates no spine association was found.
	SpineIndex int

	// SpineEndIndex is the exclusive end index into the spine for this TOC entry.
	// The entry covers spine[SpineIndex:SpineEndIndex]. For example, if SpineIndex=0
	// and SpineEndIndex=3, the entry covers spine items 0, 1, and 2.
	// A value of -1 indicates no spine association was found.
	SpineEndIndex int
}

TOCItem represents a single entry in the table of contents. TOC is a tree structure; each item may have nested children.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL