tika

package
v0.0.0-...-aafbd7e Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 1, 2023 License: Apache-2.0 Imports: 14 Imported by: 0

Documentation

Overview

Package tika provides a client and server for downloading, starting, and using Apache Tika's (http://tika.apache.org) Server.

Start with basic imports:

import "github.com/google/go-tika/tika"

You will need a running Server to make API calls to. So, if you don't have a server that is already running, and you don't have the Server JAR already downloaded, you can download one. The caller is responsible for removing the file when no longer needed.

Version is a custom type, and should be passed as such. There are constants in the code for these. The following example downloads version 1.21 to the named JAR in the current working directory.

err := tika.DownloadServer(context.Background(), tika.Version121, "tika-server-1.21.jar")
if err != nil {
	log.Fatal(err)
}

If you don't have a running Tika Server, you can start one.

// Optionally pass a port as the second argument.
s, err := tika.NewServer("tika-server-1.21.jar", "")
if err != nil {
	log.Fatal(err)
}
err := s.Start(context.Background())
if err != nil {
	log.Fatal(err)
}
defer s.Stop()

To parse the contents of a file (or any io.Reader), you will need to open the io.Reader, create a client, and call client.Parse.

// import "os"
f, err := os.Open("path/to/file")
if err != nil {
	log.Fatal(err)
}
defer f.Close()

client := tika.NewClient(nil, s.URL())
body, err := client.Parse(context.Background(), f)

If you pass an *http.Client to tika.NewClient, it will be used for all requests.

Some functions return a custom type, like Parsers(), Detectors(), and MIMETypes(). Use these to see what features are supported by the current Tika server.

Index

Constants

View Source
const XTIKAContent = "X-TIKA:content"

XTIKAContent is the metadata field of the content of a file after recursive parsing. See ParseRecursive and MetaRecursive.

Variables

Versions is a list of supported versions of Apache Tika.

Functions

func DownloadServer

func DownloadServer(ctx context.Context, v Version, path string) error

DownloadServer downloads and validates the given server version, saving it at path. DownloadServer returns an error if it could not be downloaded/validated. It is the caller's responsibility to remove the file when no longer needed. If the file already exists and has the correct sha512, DownloadServer will do nothing.

Types

type ChildOptions

type ChildOptions struct {
	MaxFiles          int
	TaskPulseMillis   int
	TaskTimeoutMillis int
	PingPulseMillis   int
	PingTimeoutMillis int
}

ChildOptions represent command line parameters that can be used when Tika is run with the -spawnChild option. If a field is less than or equal to 0, the associated flag is not included.

type Client

type Client struct {
	// contains filtered or unexported fields
}

Client represents a connection to a Tika Server.

func NewClient

func NewClient(httpClient *http.Client, urlString string) *Client

NewClient creates a new Client. If httpClient is nil, the http.DefaultClient will be used.

func (*Client) Detect

func (c *Client) Detect(ctx context.Context, input io.Reader) (string, error)

Detect gets the mimetype of the given input, returning the mimetype and an error. If the error is not nil, the mimetype is undefined.

func (*Client) Detectors

func (c *Client) Detectors(ctx context.Context) (*Detector, error)

Detectors returns the list of available Detectors for this server. To get all available detectors, iterate through the Children of every Detector.

func (*Client) Language

func (c *Client) Language(ctx context.Context, input io.Reader) (string, error)

Language detects the language of the given input, returning the two letter language code and an error. If the error is not nil, the language is undefined.

func (*Client) LanguageString

func (c *Client) LanguageString(ctx context.Context, input string) (string, error)

LanguageString detects the language of the given string, returning the two letter language code and an error. If the error is not nil, the language is undefined.

func (*Client) MIMETypes

func (c *Client) MIMETypes(ctx context.Context) (map[string]MIMEType, error)

MIMETypes returns a map from MIME Type name to MIMEType, or properties about that specific MIMEType.

func (*Client) Meta

func (c *Client) Meta(ctx context.Context, input io.Reader) (string, error)

Meta parses the metadata from the given input, returning the metadata and an error. If the error is not nil, the metadata is undefined.

func (*Client) MetaField

func (c *Client) MetaField(ctx context.Context, input io.Reader, field string) (string, error)

MetaField parses the metadata from the given input and returns the given field. If the error is not nil, the result string is undefined.

func (*Client) MetaFieldWithHeader

func (c *Client) MetaFieldWithHeader(ctx context.Context, input io.Reader, field string, header http.Header) (string, error)

MetaFieldWithHeader parses the metadata from the given input and returns the given field. If the error is not nil, the result string is undefined. This function also accepts a header so the caller can specify things like `Accept`

func (*Client) MetaRecursive

func (c *Client) MetaRecursive(ctx context.Context, input io.Reader) ([]map[string][]string, error)

MetaRecursive parses the given input and all embedded documents. The result is a list of maps from metadata key to value for each document. The content of each document is in the XTIKAContent field in text form. See ParseRecursive to just get the content of each document. If the error is not nil, the result list is undefined.

func (*Client) MetaRecursiveType

func (c *Client) MetaRecursiveType(ctx context.Context, input io.Reader, contentType string) ([]map[string][]string, error)

MetaRecursiveType parses the given input and all embedded documents. The result is a list of maps from metadata key to value for each document. The content of each document is in the XTIKAContent field, and is of the type indicated by the contentType parameter An empty string can be passed in for a default type of XML. See ParseRecursive to just get the content of each document. If the error is not nil, the result list is undefined.

func (*Client) MetaWithHeader

func (c *Client) MetaWithHeader(ctx context.Context, input io.Reader, header http.Header) (string, error)

MetaWithHeader parses the metadata from the given input, returning the metadata and an error. If the error is not nil, the metadata is undefined. This function also accepts a header so the caller can specify things like `Accept`

func (*Client) Parse

func (c *Client) Parse(ctx context.Context, input io.Reader) (string, error)

Parse parses the given input, returning the body of the input as a string and an error. If the error is not nil, the body is undefined.

func (*Client) ParseReader

func (c *Client) ParseReader(ctx context.Context, input io.Reader) (io.ReadCloser, error)

ParseReader parses the given input, returning the body of the input as a reader and an error. If the error is nil, the returned reader must be closed, else, the reader is nil.

func (*Client) ParseReaderWithHeader

func (c *Client) ParseReaderWithHeader(ctx context.Context, input io.Reader, header http.Header) (io.ReadCloser, error)

ParseReaderWithHeader parses the given input, returning the body of the input as a reader and an error. If the error is nil, the returned reader must be closed, else, the reader is nil. This function also accepts a header so the caller can specify things like `Accept`

func (*Client) ParseRecursive

func (c *Client) ParseRecursive(ctx context.Context, input io.Reader) ([]string, error)

ParseRecursive parses the given input and all embedded documents, returning a list of the contents of the input with one element per document. See MetaRecursive for access to all metadata fields. If the error is not nil, the result is undefined.

func (*Client) ParseWithHeader

func (c *Client) ParseWithHeader(ctx context.Context, input io.Reader, header http.Header) (string, error)

ParseWithHeader parses the given input, returning the body of the input as a string and an error. If the error is not nil. the body is undefined. This function also accepts a header so the caller can specify things like `Accept`

func (*Client) Parsers

func (c *Client) Parsers(ctx context.Context) (*Parser, error)

Parsers returns the list of available parsers and an error. If the error is not nil, the list is undefined. To get all available parsers, iterate through the Children of every Parser.

func (*Client) Translate

func (c *Client) Translate(ctx context.Context, input io.Reader, t Translator, src, dst string) (string, error)

Translate returns an error and the translated input from src language to dst language using t. If the error is not nil, the translation is undefined.

func (*Client) TranslateReader

func (c *Client) TranslateReader(ctx context.Context, input io.Reader, t Translator, src, dst string) (io.ReadCloser, error)

TranslateReader translates the given input from src language to dst language using t. It returns the translated document as a reader. If an error occurs, the reader is nil, else, the reader must be closed by the caller after usage.

func (*Client) Version

func (c *Client) Version(ctx context.Context) (string, error)

Version returns the default hello message from Tika server.

type ClientError

type ClientError struct {
	// StatusCode is the HTTP status code returned by the Tika server.
	StatusCode int
}

ClientError is returned by Client's various parse methods and represents an error response from the Tika server. Example usage:

client := tika.NewClient(nil, tikaURL)
s, err := client.Parse(context.Background(), input)
var tikaErr tika.ClientError
if errors.As(err, &tikaErr) {
    switch tikaErr.StatusCode {
    case http.StatusUnsupportedMediaType, http.StatusUnprocessableEntity:
        // Handle content related error
    default:
        // Handle possibly intermittent http error
    }
} else if err != nil {
    // Handle non-http error
}

func (ClientError) Error

func (e ClientError) Error() string

type Detector

type Detector struct {
	Name      string
	Composite bool
	Children  []Detector
}

A Detector represents a Tika Detector. Detectors are used to get the filetype of a file. To get a list of all Detectors, see Detectors().

type MIMEType

type MIMEType struct {
	Alias     []string
	SuperType string
}

MIMEType represents a Tika MIME Type. To get a list of all MIME Types, see MIMETypes.

type Parser

type Parser struct {
	Name           string
	Decorated      bool
	Composite      bool
	Children       []Parser
	SupportedTypes []string
}

A Parser represents a Tika Parser. To get a list of all Parsers, see Parsers().

type Server

type Server struct {
	JavaProps map[string]string
	// contains filtered or unexported fields
}

Server represents a Tika server. Create a new Server with NewServer, start it with Start, and shut it down with the close function returned from Start. There is no need to create a Server for an already running Tika Server since you can pass its URL directly to a Client. Additional Java system properties can be added to a Taka Server before startup by adding to the JavaProps map

func NewServer

func NewServer(jar, port string) (*Server, error)

NewServer creates a new Server. The default port is 9998.

func (*Server) ChildMode

func (s *Server) ChildMode(ops *ChildOptions) error

ChildMode sets up the server to use the -spawnChild option. If used, ChildMode must be called before starting the server. If you want to turn off the -spawnChild option, call Server.ChildMode(nil).

func (*Server) Shutdown

func (s *Server) Shutdown(ctx context.Context) error

Shutdown attempts to close the server gracefully before using SIGKILL, Stop() uses SIGKILL right away, which causes the kernal to stop the java process instantly.

func (*Server) Start

func (s *Server) Start(ctx context.Context) error

Start starts the given server. Start will start a new Java process. The caller must call Stop() to shut down the process when finished with the Server. Start will wait for the server to be available or until ctx is cancelled.

func (*Server) Stop

func (s *Server) Stop() error

Stop shuts the server down, killing the underlying Java process. Stop must be called when finished with the server to avoid leaking the Java process. If s has not been started, Stop will panic. If not running in a Windows environment, it is recommended to use Shutdown for a more graceful shutdown of the Java process.

func (*Server) URL

func (s *Server) URL() string

URL returns the URL of this Server.

type Translator

type Translator string

Translator represents the Java package of a Tika Translator.

const (
	Lingo24Translator   Translator = "org.apache.tika.language.translate.Lingo24Translator"
	GoogleTranslator    Translator = "org.apache.tika.language.translate.GoogleTranslator"
	MosesTranslator     Translator = "org.apache.tika.language.translate.MosesTranslator"
	JoshuaTranslator    Translator = "org.apache.tika.language.translate.JoshuaTranslator"
	MicrosoftTranslator Translator = "org.apache.tika.language.translate.MicrosoftTranslator"
	YandexTranslator    Translator = "org.apache.tika.language.translate.YandexTranslator"
)

Translators available by default in Tika. You must configure all required authentication details in Tika Server (for example, an API key).

type Version

type Version string

A Version represents a Tika Server version.

const (
	Version119  Version = "1.19"
	Version120  Version = "1.20"
	Version121  Version = "1.21"
	Version122  Version = "1.22"
	Version123  Version = "1.23"
	Version124  Version = "1.24"
	Version125  Version = "1.25"
	Version126  Version = "1.26"
	Version127  Version = "1.27"
	Version128  Version = "1.28"
	Version1285 Version = "1.28.5"
)

Supported versions of Tika Server.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL