README

go-zim

Package zim implements reading support for the ZIM File Format.

Documentation at https://godoc.org/github.com/tim-st/go-zim.

Download and install package zim and its tools with

go get -u github.com/tim-st/go-zim/...

or download a prebuilt binary from the release-section.

You can download a ZIM file for testing here.

Commands

The command above installs the tools of this package to $GOPATH/bin/.

zimserver

Tool for browsing a ZIM file in your webbrowser via an HTTP interface.

  • Starting a ZIM server at TCP port 8080: zimserver -filename="filename.zim" -port=8080
  • Browsing the ZIM file via Web Browser is now possible at http://localhost:8080/
  • The last part of the URL can be used as a basic prefix search by passing the search term after the last / in the URL
zimindex

Tool for creating a full text index of a given ZIM file.

zimsearch

Tool that lists search results for a given ZIM file and text query. If no index file created by zimindex is found, a builtin prefix search is used. Otherwise the index file is used to retrieve search results sorted by score, where the search result can be calculated by union or intersection operation.

zimtext

Tool to extract clean texts from a Wikipedia ZIM file. Each clean HTML paragraph is written on a single line in a text file.

  • Extracting first 1000 clean texts from a ZIM file: zimtext -zim="filename.zim" -txt="lines.txt" -limit=1000
  • Extracting all clean texts from a ZIM file: zimtext -zim="filename.zim" -txt="lines.txt"
  • Extracting first 1000 clean sentences (likely a sentence) from a ZIM file: zimtext -zim="filename.zim" -txt="lines.txt" -limit=1000 -sentences
  • Extracting all clean sentences (likely a sentence) from a ZIM file: zimtext -zim="filename.zim" -txt="lines.txt" -sentences
  • If you want to support your language or use-case better, it's recommended to define your own Regular Expression to extract only texts you accept. The RE-Syntax is defined here and can be tested here (select Flavor=Golang).

Example:

zimtext -zim="wikipedia_de_top_nopic_2019-08.zim" -txt="de.txt" -limit=10000 -regexFilter="^(?:\p{Lu}|\p{N})[ \pL\pN\,\;\:\-]{10,}[\.\)\]\?\"…«»›‹‘“’”]{1}$"

Expand ▾ Collapse ▴

Documentation

Overview

    Package zim implements reading support for the ZIM File Format.

    Index

    Constants

    View Source
    const (
    	MagicNumber  = uint32(72173914)
    	NoMainPage   = ^uint32(0)
    	NoLayoutPage = NoMainPage
    )

      Some useful constants belonging to a ZIM file.

      View Source
      const (
      	MimetypeDeletedEntry  = Mimetype(0xFFFD)
      	MimetypeLinkTarget    = Mimetype(0xFFFE)
      	MimetypeRedirectEntry = Mimetype(0xFFFF)
      )

        Possible fixed Mimetype values for Directory Entry.

        View Source
        const (
        	NamespaceLayout                           = Namespace('-') // layout, eg. the LayoutPage, CSS, favicon.png (48x48), JavaScript and images not related to the articles
        	NamespaceArticles                         = Namespace('A')
        	NamespaceArticleMetadata                  = Namespace('B')
        	NamespaceImagesFiles                      = Namespace('I')
        	NamespaceImagesText                       = Namespace('J')
        	NamespaceZimMetadata                      = Namespace('M')
        	NamespaceCategoriesText                   = Namespace('U')
        	NamespaceCategoriesArticleList            = Namespace('V')
        	NamespaceCategoriesPerArticleCategoryList = Namespace('W')
        	NamespaceFulltextIndex                    = Namespace('X') // Xapian fulltext index
        )

          Possible values for a Namespace.

          Variables

          This section is empty.

          Functions

          This section is empty.

          Types

          type Cluster

          type Cluster struct {
          	// contains filtered or unexported fields
          }

            Cluster stores the uncompressed cluster data (blob positions followed by a sequence of blobs). Each blob belongs to a Directory Entry.

            func (*Cluster) BlobAt

            func (c *Cluster) BlobAt(blobPosition uint32) ([]byte, error)

              BlobAt returns the blob data at blob position of a given Cluster. This is only useful when iteration over all blobs in a Cluster is done. When only a single blob of a Cluster should be retrieved, it's better to use z.BlobReaderAt(clusterPosition, blobPosition) instead. The blob position starts at 0 and ends if an error is returned.

              func (*Cluster) WasCompressed

              func (c *Cluster) WasCompressed() bool

                WasCompressed shows if the cluster data was compressed. This information can be used as an indicator about the cluster contents.

                type DirectoryEntry

                type DirectoryEntry struct {
                	// contains filtered or unexported fields
                }

                  DirectoryEntry holds the information about a specific article, image or other object in a ZIM file.

                  func (*DirectoryEntry) BlobNumber

                  func (e *DirectoryEntry) BlobNumber() uint32

                    BlobNumber is the blob number inside the uncompressed cluster, where the contents are stored.

                    func (*DirectoryEntry) ClusterNumber

                    func (e *DirectoryEntry) ClusterNumber() uint32

                      ClusterNumber in which the data of this Directory Entry is stored.

                      func (*DirectoryEntry) IsArticle

                      func (e *DirectoryEntry) IsArticle() bool

                        IsArticle checks whether the Directory Entry is an Article

                        func (*DirectoryEntry) IsDeletedEntry

                        func (e *DirectoryEntry) IsDeletedEntry() bool

                          IsDeletedEntry checks whether the Directory Entry is a DeletedEntry

                          func (*DirectoryEntry) IsLinkTarget

                          func (e *DirectoryEntry) IsLinkTarget() bool

                            IsLinkTarget checks whether the Directory Entry is a LinkTarget

                            func (*DirectoryEntry) IsRedirect

                            func (e *DirectoryEntry) IsRedirect() bool

                              IsRedirect checks whether the Directory Entry is a Redirect to another Directory Entry

                              func (*DirectoryEntry) Mimetype

                              func (e *DirectoryEntry) Mimetype() Mimetype

                                Mimetype is the Mimetype of the Directory Entry.

                                func (*DirectoryEntry) Namespace

                                func (e *DirectoryEntry) Namespace() Namespace

                                  Namespace defines to which namespace the Directory Entry belongs.

                                  func (*DirectoryEntry) RedirectIndex

                                  func (e *DirectoryEntry) RedirectIndex() uint32

                                    RedirectIndex is a pointer to the Directory Entry of the Redirect Target.

                                    func (*DirectoryEntry) Revision

                                    func (e *DirectoryEntry) Revision() uint32

                                      Revision identifies a revision of the contents of the Directory Entry, needed to identify updates or revisions in the original history.

                                      func (*DirectoryEntry) String

                                      func (e *DirectoryEntry) String() string

                                      func (*DirectoryEntry) Title

                                      func (e *DirectoryEntry) Title() []byte

                                        Title is the title of the Directory Entry.

                                        func (*DirectoryEntry) URL

                                        func (e *DirectoryEntry) URL() []byte

                                          URL is the URL of the Directory Entry, which is unique for the specific Namespace.

                                          type File

                                          type File struct {
                                          	// contains filtered or unexported fields
                                          }

                                            File represents a ZIM file and contains the most important information that is retrieved once and used again.

                                            func Open

                                            func Open(filename string) (*File, error)

                                              Open opens the file and checks for a valid ZIM header.

                                              func (*File) ArticleCount

                                              func (z *File) ArticleCount() uint32

                                                ArticleCount is the total number of articles defined in the pointerlists of the ZIM file.

                                                func (*File) BlobReader

                                                func (z *File) BlobReader(e *DirectoryEntry) (
                                                	reader io.Reader, blobSize int64, err error)

                                                  BlobReader returns a LimitedReader for the blob data of the given Directory Entry.

                                                  func (*File) BlobReaderAt

                                                  func (z *File) BlobReaderAt(clusterPosition, blobPosition uint32) (
                                                  	reader io.Reader, blobSize int64, err error)

                                                    BlobReaderAt returns a LimitedReader for the blob data at the given positions.

                                                    func (*File) CalculateChecksum

                                                    func (z *File) CalculateChecksum() ([md5.Size]byte, error)

                                                      CalculateChecksum calculates the MD5 checksum of the ZIM file. This could take some time dependent on the size of the file.

                                                      func (*File) Close

                                                      func (z *File) Close()

                                                        Close closes the ZIM file.

                                                        func (*File) ClusterAt

                                                        func (z *File) ClusterAt(clusterPosition uint32) (Cluster, error)

                                                          ClusterAt returns the Cluster of the ZIM file at the given cluster position. The complete cluster data is stored uncompressed in memory. If the size of the cluster data is more than 32MB an error is returned and the data is not read into memory. Note: Only use this function, when it's needed to read every single blob of a ZIM file into memory (for example when iterating over all contents this improves performance).

                                                          func (*File) ClusterCount

                                                          func (z *File) ClusterCount() uint32

                                                            ClusterCount is the number of clusters the ZIM file contains.

                                                            func (*File) Counter

                                                            func (z *File) Counter() string

                                                              Counter returns a String containing the number of Directory Entries per Mimetype.

                                                              func (*File) Creator

                                                              func (z *File) Creator() string

                                                                Creator returns the Creator of the ZIM file as found in the Metadata.

                                                                func (*File) Date

                                                                func (z *File) Date() string

                                                                  Date returns the Date of the ZIM file as found in the Metadata.

                                                                  func (*File) Description

                                                                  func (z *File) Description() string

                                                                    Description returns the Description of the ZIM file as found in the Metadata.

                                                                    func (*File) EntriesWithNamespace

                                                                    func (z *File) EntriesWithNamespace(namespace Namespace, limit int) []DirectoryEntry

                                                                      EntriesWithNamespace returns the first n Directory Entries in the Namespace where n <= limit. When the Limit is set to <= 0 it gets the default value 100.

                                                                      func (*File) EntriesWithSimilarity

                                                                      func (z *File) EntriesWithSimilarity(namespace Namespace, prefix []byte, limit int) []DirectoryEntry

                                                                        EntriesWithSimilarity returns Directory Entries in the Namespace that have a similar URL prefix or Title prefix to the given one. When the Limit is set to <= 0 it takes the default value 100.

                                                                        func (*File) EntriesWithTitlePrefix

                                                                        func (z *File) EntriesWithTitlePrefix(namespace Namespace, prefix []byte, limit int) []DirectoryEntry

                                                                          EntriesWithTitlePrefix returns all Directory Entries in the Namespace that have the same Title prefix like the given. When the Limit is set to <= 0 it gets the default value 100.

                                                                          func (*File) EntriesWithURLPrefix

                                                                          func (z *File) EntriesWithURLPrefix(namespace Namespace, prefix []byte, limit int) []DirectoryEntry

                                                                            EntriesWithURLPrefix returns all Directory Entries in the Namespace that have the same URL prefix like the given. When the Limit is set to <= 0 it gets the default value 100.

                                                                            func (*File) EntryAtTitlePosition

                                                                            func (z *File) EntryAtTitlePosition(position uint32) (DirectoryEntry, error)

                                                                              EntryAtTitlePosition returns the Directory Entry at the position as defined in the ordered title pointerlist. If 0 >= position < z.ArticleCount() the returned error is nil. Redirects are not followed automatically.

                                                                              func (*File) EntryAtURLPosition

                                                                              func (z *File) EntryAtURLPosition(position uint32) (DirectoryEntry, error)

                                                                                EntryAtURLPosition returns the Directory Entry at the position as defined in the ordered URL pointerlist. If 0 >= position < z.ArticleCount() the returned error is nil. Redirects are not followed automatically.

                                                                                func (*File) EntryWithNamespace

                                                                                func (z *File) EntryWithNamespace(namespace Namespace) (
                                                                                	entry DirectoryEntry, position uint32, found bool)

                                                                                  EntryWithNamespace searches the first Directory Entry in the namespace. If it was found, found is set to true and the returned position will be the position in the URL pointerlist. This can be used to iterate over the next n Directory Entries using z.EntryAtURLPosition(position+n).

                                                                                  func (*File) EntryWithTitlePrefix

                                                                                  func (z *File) EntryWithTitlePrefix(namespace Namespace, prefix []byte) (
                                                                                  	entry DirectoryEntry, position uint32, found bool)

                                                                                    EntryWithTitlePrefix searches the first Directory Entry in the namespace having the given title prefix. If it was found, found is set to true and the returned position will be the position in the title pointerlist. This can be used to iterate over the next n Directory Entries using z.EntryAtTitlePosition(position+n).

                                                                                    func (*File) EntryWithURL

                                                                                    func (z *File) EntryWithURL(namespace Namespace, url []byte) (
                                                                                    	entry DirectoryEntry, urlPosition uint32, found bool)

                                                                                      EntryWithURL searches for the Directory Entry with the exact URL. If the Directory Entry was found, found is set to true and the returned position will be the position in the URL pointerlist. This can be used to iterate over the next n Directory Entries using z.EntryAtURLPosition(position+n).

                                                                                      func (*File) EntryWithURLPrefix

                                                                                      func (z *File) EntryWithURLPrefix(namespace Namespace, prefix []byte) (
                                                                                      	entry DirectoryEntry, position uint32, found bool)

                                                                                        EntryWithURLPrefix searches the first Directory Entry in the namespace having the given URL prefix. If it was found, found is set to true and the returned position will be the position in the URL pointerlist. This can be used to iterate over the next n Directory Entries using z.EntryAtURLPosition(position+n).

                                                                                        func (*File) Favicon

                                                                                        func (z *File) Favicon() (entry DirectoryEntry, err error)

                                                                                          Favicon returns the Directory Entry for the Favicon of the ZIM file

                                                                                          func (*File) Filename

                                                                                          func (z *File) Filename() string

                                                                                            Filename is the filename of the ZIM file on the disk.

                                                                                            func (*File) Filesize

                                                                                            func (z *File) Filesize() int

                                                                                              Filesize is the filesize in Bytes of the ZIM file.

                                                                                              func (*File) FollowRedirect

                                                                                              func (z *File) FollowRedirect(redirectEntry *DirectoryEntry) (DirectoryEntry, error)

                                                                                                FollowRedirect returns the target Directory Entry of the given Redirect Entry

                                                                                                func (*File) InternalChecksum

                                                                                                func (z *File) InternalChecksum() ([md5.Size]byte, error)

                                                                                                  InternalChecksum is the MD5 checksum for the ZIM file. It's precalculated and saved in the header.

                                                                                                  func (*File) Language

                                                                                                  func (z *File) Language() string

                                                                                                    Language returns the Language of the ZIM file as found in the Metadata.

                                                                                                    func (*File) LayoutPage

                                                                                                    func (z *File) LayoutPage() (DirectoryEntry, error)

                                                                                                      LayoutPage returns the Directory Entry for the LayoutPage of the ZIM file

                                                                                                      func (*File) License

                                                                                                      func (z *File) License() string

                                                                                                        License returns the License of the ZIM file as found in the Metadata.

                                                                                                        func (*File) LongDescription

                                                                                                        func (z *File) LongDescription() string

                                                                                                          LongDescription returns the LongDescription of the ZIM file as found in the Metadata.

                                                                                                          func (*File) MainPage

                                                                                                          func (z *File) MainPage() (DirectoryEntry, error)

                                                                                                            MainPage returns the Directory Entry for the MainPage of the ZIM file

                                                                                                            func (*File) Metadata

                                                                                                            func (z *File) Metadata() map[string]string

                                                                                                              Metadata returns a copy of the internal metadata map of the ZIM file.

                                                                                                              func (*File) MetadataFor

                                                                                                              func (z *File) MetadataFor(key string) string

                                                                                                                MetadataFor returns the metadata value for a given key. If the key is not set, an empty string is returned.

                                                                                                                func (*File) MimetypeList

                                                                                                                func (z *File) MimetypeList() []string

                                                                                                                  MimetypeList returns the internal Mimetype list of the ZIM file.

                                                                                                                  func (*File) Name

                                                                                                                  func (z *File) Name() string

                                                                                                                    Name returns the Name of the ZIM file as found in the Metadata.

                                                                                                                    func (*File) Publisher

                                                                                                                    func (z *File) Publisher() string

                                                                                                                      Publisher returns the Publisher of the ZIM file as found in the Metadata.

                                                                                                                      func (*File) Relation

                                                                                                                      func (z *File) Relation() string

                                                                                                                        Relation returns the Relation of the ZIM file as found in the Metadata.

                                                                                                                        func (*File) Source

                                                                                                                        func (z *File) Source() string

                                                                                                                          Source returns the Source of the ZIM file as found in the Metadata.

                                                                                                                          func (*File) Tags

                                                                                                                          func (z *File) Tags() string

                                                                                                                            Tags returns the Tags of the ZIM file as found in the Metadata.

                                                                                                                            func (*File) Title

                                                                                                                            func (z *File) Title() string

                                                                                                                              Title returns the Title of the ZIM file as found in the Metadata.

                                                                                                                              func (*File) UUID

                                                                                                                              func (z *File) UUID() UUID

                                                                                                                                UUID is the unique id of a ZIM file.

                                                                                                                                func (*File) ValidateChecksum

                                                                                                                                func (z *File) ValidateChecksum() error

                                                                                                                                  ValidateChecksum compares the internal MD5 checksum of the ZIM file with the calculated one.

                                                                                                                                  func (*File) Version

                                                                                                                                  func (z *File) Version() (majorVersion, minorVersion uint16)

                                                                                                                                    Version is the version tuple of the ZIM file.

                                                                                                                                    type Header struct {
                                                                                                                                    	// contains filtered or unexported fields
                                                                                                                                    }

                                                                                                                                      Header is the header of a ZIM file

                                                                                                                                      type Mimetype

                                                                                                                                      type Mimetype uint16

                                                                                                                                        Mimetype describes one of the three possible fixed Mimetypes for a Directory Entry.

                                                                                                                                        type Namespace

                                                                                                                                        type Namespace byte

                                                                                                                                          Namespace is an Ascii-Character representing the category of a Directory Entry.

                                                                                                                                          func (Namespace) String

                                                                                                                                          func (n Namespace) String() string

                                                                                                                                          type UUID

                                                                                                                                          type UUID []byte

                                                                                                                                            UUID is the unique ID of a ZIM file

                                                                                                                                            func (UUID) String

                                                                                                                                            func (u UUID) String() string

                                                                                                                                            Directories

                                                                                                                                            Path Synopsis
                                                                                                                                            cmd