bidi

package
v0.1.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 13, 2021 License: BSD-3-Clause Imports: 13 Imported by: 0

README

Unicode Bidirectional Algorithm

Package bidi will implement a variant of the Unicode UAX#9 Bidirectional Algorithm. It is not fully standards-conforming, but good enough for practical purposes.

Unicode Annex UAX#9 presents an algorithm to identify directional runs within texts. The algorithm deals with characters and character runs, which UAX#9 maps to Bidi character classes. Bidi classes are then grouped according to certain rules to determine writing directions. The algorithm is not perfect and there are some cases where manual overriding will be necessary to produce correct output, but it is good enough for many real-life cases.

Deviations from the Standard

This package will interpret some of the Bidi algorithm's rules a bit differently than a strict adhering to the standard would require, the reason being that we postulate some general requirements which make it hard to conform to the standard 100%. The main general requirement is a restriction of the mode of access for the input text: We operate on an io.Reader and do not buffer the characters read from it. As a consequence, we will never travel backwards over characters and will never read a character twice. However, some parts of the UAX#9 algorithm are presented as operations on “look-behinds,” or as setting properties per character (Bidi class, embedding level) or as a multi-pass approach. This package employs strategies borrowed from parsing theory to arrive at the same results as the original UAX#9 algorithm.

That said, this package will implement UAX#9 in a way that conforms to the standard for “reasonable texts”, i.e. text produced by humans for humans. Deviation from the standard is confined to areas of the standard that deal with rather obscure border cases. As an example, the Bidi Annex postulates a clear maximum nesting level of bracket pairings (63 levels) per isolating run sequence. However, this package will ignore this boundary in a certain case when markers ending an isolating run sequence go missing. The only clients to ever recognize this deviation are most probably UAX#9 conformity tests.

There is one limitation, however, which ignores the standard in a relevant way: We do not implement legacy formatting directives, which the Annex calls “Explicit Directional Embedding and Override Formatting Characters”, i.e. the formatting directives LRE, RLE, LRO, RLO and PDF. Unicode recommends sticking to the more modern “Isolate Formatting Characters” LRI, RLI, FSI and PDI. This package will deal with isolate run sequences produced by isolate formatting characters (or external markup) only. The need to deal with legacy formatting characters may arise in the future, but currently I do not plan to implement them.

API

As the algorithms in this package will not copy any input characters, it leaves the burden to store the text to the calling client. This package will return Bidi runs as intervals of text positions, which means clients must be able to reproduce the text identified by text position. That's trivially true for text stored in a bytes buffer or string, but one can imagine other situations where this requirement involves some additional effort, like an input stream read from a file.

Attention: Work in progress, not yet fully functional.

Documentation

Overview

Package bidi will implement a variant of the Unicode UAX#9 Bidirectional Algorithm. It is not fully standards-conforming, but good enough for practical purposes.

Unicode Annex UAX#9 presents an algorithm to identify directional runs within texts. The algorithm deals with characters and character runs, which UAX#9 maps to Bidi character classes. Bidi classes are then grouped according to certain rules to determine writing directions. The algorithm is not perfect and there are some cases where manual overriding will be necessary to produce correct output, but it is good enough for many real-life cases.

This package will interpret some of the Bidi algorithm's rules a bit differently than a strict adhering to the standard would require, the reason being that we postulate some general requirements which make it hard to conform to the standard 100%. The main general requirement is a restriction of the mode of access for the input text: We operate on an `io.Reader` and do not buffer the characters read from it. As a consequence, we will never travel backwards over characters and will never read a character twice. However, some parts of the UAX#9 algorithm are presented as operations on “look-behinds,” or as setting properties per character (Bidi class, embedding level) or as a multi-pass approach. This package employs strategies borrowed from parsing theory to arrive at the same results as the original UAX#9 algorithm.

Deviations from the Standard

This package implements UAX#9 in a way that is standards-conforming for “reasonable texts”, i.e. text produced by humans for humans. Deviations from the standard are confined to two areas of the standard: error handling and legacy formatting directives.

As an example for error handling, the Bidi Annex postulates a clear maximum nesting level of bracket pairings (63 levels) per isolating run sequence. However, this package will ignore this boundary in a certain case when markers ending an isolating run sequence go missing. The only clients to ever recognize this deviation are most probably UAX#9 conformity tests.

We do not implement legacy formatting directives, which the Annex calls “Explicit Directional Embedding and Override Formatting Characters”, i.e. the formatting directives LRE, RLE, LRO, RLO and PDF. Unicode recommends sticking to the more modern “Isolate Formatting Characters” LRI, RLI, FSI and PDI. This package will deal with isolate run sequences produced by isolate formatting characters (or external markup) only. The need to deal with legacy formatting characters may arise in the future, but currently I do not plan to implement them.

API

As the algorithms in this package will not copy any input characters, it leaves the burden to store the text to the calling client. This package will return Bidi runs as intervals of text positions, which means clients must be able to reproduce the text identified by text position. That's trivially true for text stored in a bytes buffer or string, but one can imagine other situations where this requirement involves some additional effort, like an input stream read from a file.

Attention: Work in progress, not yet fully functional.

________________________________________________________________________________

BSD License

All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

3. Neither the name of this software nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Index

Constants

View Source
const (
	MarkupLRI    int = int(bidi.LRI) << 8
	MarkupRLI    int = int(bidi.RLI) << 8
	MarkupPDI    int = int(bidi.PDI)
	MarkupPDILRI int = MarkupPDI | MarkupLRI
	MarkupPDIRLI int = MarkupPDI | MarkupRLI
)

Constants to use by clients as OutOfLineBidiMarkup return values.

View Source
const BD16MaxNesting = 63

BD16MaxNesting is the maximum stack depth for rule BS16 as defined in UAX#9.

View Source
const UnicodeVersion = "13.0.0"

UnicodeVersion is the UAX#9 version this implementation follows.

Variables

This section is empty.

Functions

func T

func T() tracing.Trace

T traces to a global core tracer

Types

type Direction

type Direction int

A Direction indicates the overall flow of text.

const (
	// LeftToRight indicates a requirement to order the characters of a script
	// from left to right.
	LeftToRight Direction = iota
	// RightToLeft indicates a requirement to order the characters of a script
	// from right to left.
	RightToLeft
)

func (Direction) String

func (dir Direction) String() string

String returns either "L2R" or "R2L".

type Option

type Option func(p *bidiScanner)

Option configures a Bidi algorithm

func DefaultDirection

func DefaultDirection(dir Direction) Option

DefaultDirection sets outer embedding level for a paragraph (LeftToRight is the normal default).

func IgnoreParagraphSeparators

func IgnoreParagraphSeparators(b bool) Option

IgnoreParagraphSeparators determines wether paragraph separators (i.e., newlines at al.) are to be ignored and interpretet as whitespace instead. The default value for this option is `false`, resulting in effectively interpreting any paragraph separator as `end of input`. In this case the paragraph separator is cut off of the input.

func RecognizeLegacy

func RecognizeLegacy(b bool) Option

RecognizeLegacy is not yet implemented. It was indented to make the resolver recognize legacy formatting, i.e. LRE, RLE, LRO, RLO, PDF. However, I changed my mind and currently do not intend to support legacy formatting types, thus setting this option will have no effect.

func TestMode

func TestMode(b bool) Option

TestMode will set up the scanner to recognize UPPERCASE letters as having R2L class. This is a common pattern in bidi algorithm development and testing. Additionally we follow a convention of the UAX#9 algorithm documentation: “The invisible, zero-width formatting characters LRI, RLI, and PDI are represented with the symbols '>', '<', and '=', respectively.” Thus it is possible to replay the examples of section 3.4 of UAX#9:

<car MEANS CAR.=

or

DID YOU SAY ’>he said “<car MEANS CAR=”=‘?

type Ordering

type Ordering struct {
	Runs []Run
}

An Ordering holds the computed visual order of bidi-runs of a paragraph of text.

type OutOfLineBidiMarkup

type OutOfLineBidiMarkup func(uint64) int

OutOfLineBidiMarkup is queried during read of input text for out-of-line Bidi delimiters (LRI, RLI, PDI). Such markup may result, e.g., from HTML attributes or CSS styles. It receives a text position and—if appropriate—returns a Bidi class to be inserted. It will be treated by the resolver as a Bidi delimiter of byte-length zero.

type ResolvedLevels

type ResolvedLevels struct {
	// contains filtered or unexported fields
}

ResolvedLevels is a type for holding the result of phase 3.3 “Resolving Embedded Levels”.

func ResolveParagraph

func ResolveParagraph(inp io.Reader, markup OutOfLineBidiMarkup, opts ...Option) *ResolvedLevels

ResolveParagraph accepts character input and returns a BiDi ordering for the characters. inp should be the text of a single paragraph, but this is not enforced.

UAX#9 lists the following phases for bidi typesetting:

3.3  Resolving Embedding Levels
3.4  Reordering Resolved Levels
3.5  Shaping

Resolving means identifying runs of left-to-right or right-to-left text fragements.

The subsequent phases (3.4 and 3.5) require the text to be segmented into lines, which is not handled by this package. Reordering is done on a line by line basis and this package contains functions to support that phase, but will not help in line-breaking.

markup may be provided to inform the resolver about out-of-line Bidi delimiter locations; can be nil.

func (*ResolvedLevels) DirectionAt

func (rl *ResolvedLevels) DirectionAt(pos uint64) Direction

DirectionAt returns the text direction at byte position pos.

func (*ResolvedLevels) Reorder

func (rl *ResolvedLevels) Reorder() *Ordering

Reorder reorders runs of resolved levels and returns an ordering reflecting runs of characters with either L2R or R2L direction.

func (*ResolvedLevels) Split

func (rl *ResolvedLevels) Split(at uint64, shift0 bool) (*ResolvedLevels, *ResolvedLevels)

Split cuts a resolved level run into 2 pieces at position at. The character at position at will be the first character of the second (cut-off) piece.

Clients typically use this for line-wrapping. Cut-off level runs (= lines) can then be reordered one by one.

If parameter shift0 is set, all indices within resolved levels will be lowered by `at`, resulting in the first level to have a left boundary of zero. This is useful for cases where the clients splits the underlying text congruently to Bidi levels and characters are therefore “re-positioned”.

func (*ResolvedLevels) String

func (rl *ResolvedLevels) String() string

type Run

type Run struct {
	Dir    Direction // either LeftToRight or RightToLeft
	Length int64     // length of run in bytes
	// contains filtered or unexported fields
}

A Run represents a directional run of text (i.e., a continuous sequence of characters of a single direction). Type Run holds the positions of characters, not the characters themselves.

func (*Run) IsOpposite

func (r *Run) IsOpposite(dir Direction) bool

IsOpposite returns true if the run's direction is oppostite to the given direction.

func (*Run) SegmentIterator

func (r *Run) SegmentIterator(reverse bool) *SegmentIterator

SegmentIterator creates an interator for the text segments contained within a Bidi run. Assume the output devices current direction is set to left-to-right:

it := run.SegmentIterator(run.IsOpposite(bidi.LeftToRight))
for it.Next() {
    dir, from, to := it.Segment()
    var segment string
    segment = myGetSegString(from, to)  // client func to get the text-segment by positions
    if dir == bidi.RightToLeft {
        segment = reverse(segment)      // client func to reverse graphemes of a string
    }
    …                                   // visual output of segment
}

Clients of this package should proceed like this for every Run of an Ordering.

Parameter reverse should be set to true if the clients expects fragments of the run in reverse order. This will be the case in situations where the direction of the run is opposite to the visual direction of an output device and the application has to handle visual re-ordering.

func (Run) String

func (r Run) String() string

type SegmentIterator

type SegmentIterator struct {
	//run      *Run
	Dir Direction
	// contains filtered or unexported fields
}

SegmentIterator iterates over the text segments contained in a run. Runs are the product of a re-ordering of text, which may lead to segments of text to be shuffled around. A segment starts and ends at text positions of the unshuffled text. Clients will need this information to create the correct visual order of text segments.

func (*SegmentIterator) EOF

func (it *SegmentIterator) EOF() bool

EOF returns true if the iterator has read the last segment.

func (*SegmentIterator) Next

func (it *SegmentIterator) Next() bool

Next proceeds the iterator to the next segment of text.

func (*SegmentIterator) Segment

func (it *SegmentIterator) Segment() (Direction, uint64, uint64)

Segment returns the bounds of the current segment of text.

Directories

Path Synopsis
internal
gen
Package trie implements a trie data-structure similar to the one described by Donald E Knuth in “Programming Perls”.
Package trie implements a trie data-structure similar to the one described by Donald E Knuth in “Programming Perls”.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL