README

gosseract OCR

Build Status codecov Go Report Card GoDoc

Golang OCR package, by using Tesseract C++ library.

OCR Server

Do you just want OCR server, or see the working example of this package? Yes, there is already-made server application, which is seriously easy to deploy!

👉 https://github.com/otiai10/ocrserver

Example

package main

import (
	"fmt"
	"github.com/otiai10/gosseract"
)

func main() {
	client := gosseract.NewClient()
	defer client.Close()
	client.SetImage("path/to/image.png")
	text, _ := client.Text()
	fmt.Println(text)
	// Hello, World!
}

Install

  1. tesseract-ocr, including library and headers
  2. go get -t github.com/otiai10/gosseract

Check Dockerfile for more detail of installation, or you can just try by docker run -it --rm otiai10/gosseract.

Test

In case you have tesseract-ocr on your local, you can just hit

% go test .

Otherwise, if you DON'T want to install tesseract-ocr on your local, kick ./test/runtime which is using Docker and Vagrant to test the source code on some runtimes.

% ./test/runtime --driver docker
% ./test/runtime --driver vagrant

Check ./test/runtimes for more information about runtime tests.

Issues

Expand ▾ Collapse ▴

Documentation

Index

Examples

Constants

This section is empty.

Variables

This section is empty.

Functions

func ClearPersistentCache

func ClearPersistentCache()

    ClearPersistentCache clears any library-level memory caches. There are a variety of expensive-to-load constant data structures (mostly language dictionaries) that are cached globally – surviving the Init() and End() of individual TessBaseAPI's. This function allows the clearing of these caches.

    func Version

    func Version() string

      Version returns the version of Tesseract-OCR

      Types

      type BoundingBox

      type BoundingBox struct {
      	Box        image.Rectangle
      	Word       string
      	Confidence float64
      }

        BoundingBox contains the position, confidence and UTF8 text of the recognized word

        type Client

        type Client struct {
        
        	// Trim specifies characters to trim, which would be trimed from result string.
        	// As results of OCR, text often contains unnecessary characters, such as newlines, on the head/foot of string.
        	// If `Trim` is set, this client will remove specified characters from the result.
        	Trim bool
        
        	// TessdataPrefix can indicate directory path to `tessdata`.
        	// It is set `/usr/local/share/tessdata/` or something like that, as default.
        	// TODO: Implement and test
        	TessdataPrefix *string
        
        	// Languages are languages to be detected. If not specified, it's gonna be "eng".
        	Languages []string
        
        	// Variables is just a pool to evaluate "tesseract::TessBaseAPI->SetVariable" in delay.
        	// TODO: Think if it should be public, or private property.
        	Variables map[SettableVariable]string
        
        	// Config is a file path to the configuration for Tesseract
        	// See http://www.sk-spell.sk.cx/tesseract-ocr-parameters-in-302-version
        	// TODO: Fix link to official page
        	ConfigFilePath string
        	// contains filtered or unexported fields
        }

          Client is argument builder for tesseract::TessBaseAPI.

          func NewClient

          func NewClient() *Client

            NewClient construct new Client. It's due to caller to Close this client.

            Example
            Output:
            
            

            func (*Client) Close

            func (client *Client) Close() (err error)

              Close frees allocated API. This MUST be called for ANY client constructed by "NewClient" function.

              func (*Client) DisableOutput

              func (client *Client) DisableOutput() error

              func (*Client) GetBoundingBoxes

              func (client *Client) GetBoundingBoxes(level PageIteratorLevel) (out []BoundingBox, err error)

                GetBoundingBoxes returns bounding boxes for each matched word

                func (*Client) HOCRText

                func (client *Client) HOCRText() (out string, err error)

                  HOCRText finally initialize tesseract::TessBaseAPI, execute OCR and returns hOCR text. See https://en.wikipedia.org/wiki/HOCR for more information of hOCR.

                  func (*Client) SetBlacklist

                  func (client *Client) SetBlacklist(whitelist string) error

                    SetBlacklist sets whitelist chars. See official documentation for whitelist here https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality#dictionaries-word-lists-and-patterns

                    func (*Client) SetConfigFile

                    func (client *Client) SetConfigFile(fpath string) error

                      SetConfigFile sets the file path to config file.

                      func (*Client) SetImage

                      func (client *Client) SetImage(imagepath string) error

                        SetImage sets path to image file to be processed OCR.

                        Example
                        Output:
                        
                        

                        func (*Client) SetImageFromBytes

                        func (client *Client) SetImageFromBytes(data []byte) error

                          SetImageFromBytes sets the image data to be processed OCR.

                          func (*Client) SetLanguage

                          func (client *Client) SetLanguage(langs ...string) error

                            SetLanguage sets languages to use. English as default.

                            func (*Client) SetPageSegMode

                            func (client *Client) SetPageSegMode(mode PageSegMode) error

                              SetPageSegMode sets "Page Segmentation Mode" (PSM) to detect layout of characters. See official documentation for PSM here https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality#page-segmentation-method See https://github.com/otiai10/gosseract/issues/52 for more information.

                              func (*Client) SetVariable

                              func (client *Client) SetVariable(key SettableVariable, value string) error

                                SetVariable sets parameters, representing tesseract::TessBaseAPI->SetVariable. See official documentation here https://zdenop.github.io/tesseract-doc/classtesseract_1_1_tess_base_a_p_i.html#a2e09259c558c6d8e0f7e523cbaf5adf5 Because `api->SetVariable` must be called after `api->Init`, this method cannot detect unexpected key for variables. Check `client.setVariablesToInitializedAPI` for more information.

                                func (*Client) SetWhitelist

                                func (client *Client) SetWhitelist(whitelist string) error

                                  SetWhitelist sets whitelist chars. See official documentation for whitelist here https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality#dictionaries-word-lists-and-patterns

                                  Example
                                  Output:
                                  
                                  IO- IOO 10-100
                                  

                                  func (*Client) Text

                                  func (client *Client) Text() (out string, err error)

                                    Text finally initialize tesseract::TessBaseAPI, execute OCR and extract text detected as string.

                                    Example
                                    Output:
                                    
                                    Hello, World! <nil>
                                    

                                    type PageIteratorLevel

                                    type PageIteratorLevel int

                                      PageIteratorLevel maps directly to tesseracts enum tesseract::PageIteratorLevel represents the hierarchy of the page elements used in ResultIterator. https://github.com/tesseract-ocr/tesseract/blob/a18620cfea33d03032b71fe1b9fc424777e34252/ccstruct/publictypes.h#L219-L225

                                      const (
                                      	// RIL_BLOCK - Block of text/image/separator line.
                                      	RIL_BLOCK PageIteratorLevel = iota
                                      	// RIL_PARA - Paragraph within a block.
                                      	RIL_PARA
                                      	// RIL_TEXTLINE - Line within a paragraph.
                                      	RIL_TEXTLINE
                                      	// RIL_WORD - Word within a textline.
                                      	RIL_WORD
                                      	// RIL_SYMBOL - Symbol/character within a word.
                                      	RIL_SYMBOL
                                      )

                                      type PageSegMode

                                      type PageSegMode int

                                        PageSegMode represents tesseract::PageSegMode. See https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality#page-segmentation-method and https://github.com/tesseract-ocr/tesseract/blob/a18620cfea33d03032b71fe1b9fc424777e34252/ccstruct/publictypes.h#L158-L183 for more information.

                                        const (
                                        	// PSM_OSD_ONLY - Orientation and script detection (OSD) only.
                                        	PSM_OSD_ONLY PageSegMode = iota
                                        	// PSM_AUTO_OSD - Automatic page segmentation with OSD.
                                        	PSM_AUTO_OSD
                                        	// PSM_AUTO_ONLY - Automatic page segmentation, but no OSD, or OCR.
                                        	PSM_AUTO_ONLY
                                        	// PSM_AUTO - (DEFAULT) Fully automatic page segmentation, but no OSD.
                                        	PSM_AUTO
                                        	// PSM_SINGLE_COLUMN - Assume a single column of text of variable sizes.
                                        	PSM_SINGLE_COLUMN
                                        	// PSM_SINGLE_BLOCK_VERT_TEXT - Assume a single uniform block of vertically aligned text.
                                        	PSM_SINGLE_BLOCK_VERT_TEXT
                                        	// PSM_SINGLE_BLOCK - Assume a single uniform block of text.
                                        	PSM_SINGLE_BLOCK
                                        	// PSM_SINGLE_LINE - Treat the image as a single text line.
                                        	PSM_SINGLE_LINE
                                        	// PSM_SINGLE_WORD - Treat the image as a single word.
                                        	PSM_SINGLE_WORD
                                        	// PSM_CIRCLE_WORD - Treat the image as a single word in a circle.
                                        	PSM_CIRCLE_WORD
                                        	// PSM_SINGLE_CHAR - Treat the image as a single character.
                                        	PSM_SINGLE_CHAR
                                        	// PSM_SPARSE_TEXT - Find as much text as possible in no particular order.
                                        	PSM_SPARSE_TEXT
                                        	// PSM_SPARSE_TEXT_OSD - Sparse text with orientation and script det.
                                        	PSM_SPARSE_TEXT_OSD
                                        	// PSM_RAW_LINE - Treat the image as a single text line, bypassing hacks that are Tesseract-specific.
                                        	PSM_RAW_LINE
                                        
                                        	// PSM_COUNT - Just a number of enum entries. This is NOT a member of PSM ;)
                                        	PSM_COUNT
                                        )

                                        type SettableVariable

                                        type SettableVariable string

                                          SettableVariable represents available strings for TessBaseAPI::SetVariable. See https://groups.google.com/forum/#!topic/tesseract-ocr/eHTBzrBiwvQ and https://github.com/tesseract-ocr/tesseract/blob/master/ccmain/tesseractclass.h

                                          const (
                                          	// DEBUG_FILE - File to send output to.
                                          	DEBUG_FILE SettableVariable = "debug_file"
                                          	// TESSEDIT_CHAR_WHITELIST - Whitelist of chars to recognize
                                          	// There is a known issue in 4.00 with LSTM
                                          	// https://github.com/tesseract-ocr/tesseract/issues/751
                                          	TESSEDIT_CHAR_WHITELIST SettableVariable = "tessedit_char_whitelist"
                                          	// TESSEDIT_CHAR_BLACKLIST - Blacklist of chars not to recognize
                                          	// There is a known issue in 4.00 with LSTM
                                          	// https://github.com/tesseract-ocr/tesseract/issues/751
                                          	TESSEDIT_CHAR_BLACKLIST SettableVariable = "tessedit_char_blacklist"
                                          )

                                            Followings are variables which can be used for TessBaseAPI::SetVariable. If anything missing (I know there are many), please add one below.