bidi

package

v0.1.0 Latest Latest Go to latest Published: Nov 13, 2021 License: BSD-3-Clause Imports: 13 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/npillmayer/uax

Links

Open Source Insights

README ¶

Unicode Bidirectional Algorithm

Package bidi will implement a variant of the Unicode UAX#9 Bidirectional Algorithm. It is not fully standards-conforming, but good enough for practical purposes.

Unicode Annex UAX#9 presents an algorithm to identify directional runs within texts. The algorithm deals with characters and character runs, which UAX#9 maps to Bidi character classes. Bidi classes are then grouped according to certain rules to determine writing directions. The algorithm is not perfect and there are some cases where manual overriding will be necessary to produce correct output, but it is good enough for many real-life cases.

Deviations from the Standard

This package will interpret some of the Bidi algorithm's rules a bit differently than a strict adhering to the standard would require, the reason being that we postulate some general requirements which make it hard to conform to the standard 100%. The main general requirement is a restriction of the mode of access for the input text: We operate on an io.Reader and do not buffer the characters read from it. As a consequence, we will never travel backwards over characters and will never read a character twice. However, some parts of the UAX#9 algorithm are presented as operations on “look-behinds,” or as setting properties per character (Bidi class, embedding level) or as a multi-pass approach. This package employs strategies borrowed from parsing theory to arrive at the same results as the original UAX#9 algorithm.

That said, this package will implement UAX#9 in a way that conforms to the standard for “reasonable texts”, i.e. text produced by humans for humans. Deviation from the standard is confined to areas of the standard that deal with rather obscure border cases. As an example, the Bidi Annex postulates a clear maximum nesting level of bracket pairings (63 levels) per isolating run sequence. However, this package will ignore this boundary in a certain case when markers ending an isolating run sequence go missing. The only clients to ever recognize this deviation are most probably UAX#9 conformity tests.

There is one limitation, however, which ignores the standard in a relevant way: We do not implement legacy formatting directives, which the Annex calls “Explicit Directional Embedding and Override Formatting Characters”, i.e. the formatting directives LRE, RLE, LRO, RLO and PDF. Unicode recommends sticking to the more modern “Isolate Formatting Characters” LRI, RLI, FSI and PDI. This package will deal with isolate run sequences produced by isolate formatting characters (or external markup) only. The need to deal with legacy formatting characters may arise in the future, but currently I do not plan to implement them.

API

As the algorithms in this package will not copy any input characters, it leaves the burden to store the text to the calling client. This package will return Bidi runs as intervals of text positions, which means clients must be able to reproduce the text identified by text position. That's trivially true for text stored in a bytes buffer or string, but one can imagine other situations where this requirement involves some additional effort, like an input stream read from a file.

Attention: Work in progress, not yet fully functional.

Documentation ¶

Overview ¶

Package bidi will implement a variant of the Unicode UAX#9 Bidirectional Algorithm. It is not fully standards-conforming, but good enough for practical purposes.

Unicode Annex UAX#9 presents an algorithm to identify directional runs within texts. The algorithm deals with characters and character runs, which UAX#9 maps to Bidi character classes. Bidi classes are then grouped according to certain rules to determine writing directions. The algorithm is not perfect and there are some cases where manual overriding will be necessary to produce correct output, but it is good enough for many real-life cases.

This package will interpret some of the Bidi algorithm's rules a bit differently than a strict adhering to the standard would require, the reason being that we postulate some general requirements which make it hard to conform to the standard 100%. The main general requirement is a restriction of the mode of access for the input text: We operate on an `io.Reader` and do not buffer the characters read from it. As a consequence, we will never travel backwards over characters and will never read a character twice. However, some parts of the UAX#9 algorithm are presented as operations on “look-behinds,” or as setting properties per character (Bidi class, embedding level) or as a multi-pass approach. This package employs strategies borrowed from parsing theory to arrive at the same results as the original UAX#9 algorithm.

Deviations from the Standard ¶

This package implements UAX#9 in a way that is standards-conforming for “reasonable texts”, i.e. text produced by humans for humans. Deviations from the standard are confined to two areas of the standard: error handling and legacy formatting directives.

As an example for error handling, the Bidi Annex postulates a clear maximum nesting level of bracket pairings (63 levels) per isolating run sequence. However, this package will ignore this boundary in a certain case when markers ending an isolating run sequence go missing. The only clients to ever recognize this deviation are most probably UAX#9 conformity tests.

We do not implement legacy formatting directives, which the Annex calls “Explicit Directional Embedding and Override Formatting Characters”, i.e. the formatting directives LRE, RLE, LRO, RLO and PDF. Unicode recommends sticking to the more modern “Isolate Formatting Characters” LRI, RLI, FSI and PDI. This package will deal with isolate run sequences produced by isolate formatting characters (or external markup) only. The need to deal with legacy formatting characters may arise in the future, but currently I do not plan to implement them.

API ¶

As the algorithms in this package will not copy any input characters, it leaves the burden to store the text to the calling client. This package will return Bidi runs as intervals of text positions, which means clients must be able to reproduce the text identified by text position. That's trivially true for text stored in a bytes buffer or string, but one can imagine other situations where this requirement involves some additional effort, like an input stream read from a file.

Attention: Work in progress, not yet fully functional.

________________________________________________________________________________

BSD License ¶

Copyright © 2019–2021, Norbert Pillmayer ¶

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

3. Neither the name of this software nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Index ¶

Constants
func T() tracing.Trace
type Direction
- func (dir Direction) String() string
type Option
type Ordering
type OutOfLineBidiMarkup
type ResolvedLevels
- func ResolveParagraph(inp io.Reader, markup OutOfLineBidiMarkup, opts ...Option) *ResolvedLevels
type Run
type SegmentIterator

Constants ¶

View Source

const (
	MarkupLRI    int = int(bidi.LRI) << 8
	MarkupRLI    int = int(bidi.RLI) << 8
	MarkupPDI    int = int(bidi.PDI)
	MarkupPDILRI int = MarkupPDI | MarkupLRI
	MarkupPDIRLI int = MarkupPDI | MarkupRLI
)

Constants to use by clients as OutOfLineBidiMarkup return values.

View Source

const BD16MaxNesting = 63

BD16MaxNesting is the maximum stack depth for rule BS16 as defined in UAX#9.

View Source

const UnicodeVersion = "13.0.0"

UnicodeVersion is the UAX#9 version this implementation follows.

Variables ¶

This section is empty.

Functions ¶

func T ¶

func T() tracing.Trace

T traces to a global core tracer

Types ¶

type Direction ¶

type Direction int

A Direction indicates the overall flow of text.

const (
	// LeftToRight indicates a requirement to order the characters of a script
	// from left to right.
	LeftToRight Direction = iota
	// RightToLeft indicates a requirement to order the characters of a script
	// from right to left.
	RightToLeft
)

func (Direction) String ¶

func (dir Direction) String() string

String returns either "L2R" or "R2L".

type Option ¶

type Option func(p *bidiScanner)

Option configures a Bidi algorithm

func DefaultDirection ¶

func DefaultDirection(dir Direction) Option

DefaultDirection sets outer embedding level for a paragraph (LeftToRight is the normal default).

func IgnoreParagraphSeparators ¶

func IgnoreParagraphSeparators(b bool) Option

IgnoreParagraphSeparators determines wether paragraph separators (i.e., newlines at al.) are to be ignored and interpretet as whitespace instead. The default value for this option is `false`, resulting in effectively interpreting any paragraph separator as `end of input`. In this case the paragraph separator is cut off of the input.

func RecognizeLegacy ¶

func RecognizeLegacy(b bool) Option

RecognizeLegacy is not yet implemented. It was indented to make the resolver recognize legacy formatting, i.e. LRE, RLE, LRO, RLO, PDF. However, I changed my mind and currently do not intend to support legacy formatting types, thus setting this option will have no effect.

func TestMode ¶

func TestMode(b bool) Option

TestMode will set up the scanner to recognize UPPERCASE letters as having R2L class. This is a common pattern in bidi algorithm development and testing. Additionally we follow a convention of the UAX#9 algorithm documentation: “The invisible, zero-width formatting characters LRI, RLI, and PDI are represented with the symbols '>', '<', and '=', respectively.” Thus it is possible to replay the examples of section 3.4 of UAX#9:

<car MEANS CAR.=

or

DID YOU SAY ’>he said “<car MEANS CAR=”=‘?

type Ordering ¶

type Ordering struct {
	Runs []Run
}

An Ordering holds the computed visual order of bidi-runs of a paragraph of text.

type OutOfLineBidiMarkup ¶

type OutOfLineBidiMarkup func(uint64) int

OutOfLineBidiMarkup is queried during read of input text for out-of-line Bidi delimiters (LRI, RLI, PDI). Such markup may result, e.g., from HTML attributes or CSS styles. It receives a text position and—if appropriate—returns a Bidi class to be inserted. It will be treated by the resolver as a Bidi delimiter of byte-length zero.

type ResolvedLevels ¶

type ResolvedLevels struct {
	// contains filtered or unexported fields
}

ResolvedLevels is a type for holding the result of phase 3.3 “Resolving Embedded Levels”.

func ResolveParagraph ¶

func ResolveParagraph(inp io.Reader, markup OutOfLineBidiMarkup, opts ...Option) *ResolvedLevels

ResolveParagraph accepts character input and returns a BiDi ordering for the characters. inp should be the text of a single paragraph, but this is not enforced.

UAX#9 lists the following phases for bidi typesetting:

3.3  Resolving Embedding Levels
3.4  Reordering Resolved Levels
3.5  Shaping

Resolving means identifying runs of left-to-right or right-to-left text fragements.

The subsequent phases (3.4 and 3.5) require the text to be segmented into lines, which is not handled by this package. Reordering is done on a line by line basis and this package contains functions to support that phase, but will not help in line-breaking.

markup may be provided to inform the resolver about out-of-line Bidi delimiter locations; can be nil.

func (*ResolvedLevels) DirectionAt ¶

func (rl *ResolvedLevels) DirectionAt(pos uint64) Direction

DirectionAt returns the text direction at byte position pos.

func (*ResolvedLevels) Reorder ¶

func (rl *ResolvedLevels) Reorder() *Ordering

Reorder reorders runs of resolved levels and returns an ordering reflecting runs of characters with either L2R or R2L direction.

func (*ResolvedLevels) Split ¶

func (rl *ResolvedLevels) Split(at uint64, shift0 bool) (*ResolvedLevels, *ResolvedLevels)

Split cuts a resolved level run into 2 pieces at position at. The character at position at will be the first character of the second (cut-off) piece.

Clients typically use this for line-wrapping. Cut-off level runs (= lines) can then be reordered one by one.

If parameter shift0 is set, all indices within resolved levels will be lowered by `at`, resulting in the first level to have a left boundary of zero. This is useful for cases where the clients splits the underlying text congruently to Bidi levels and characters are therefore “re-positioned”.

func (*ResolvedLevels) String ¶

func (rl *ResolvedLevels) String() string

type Run ¶

type Run struct {
	Dir    Direction // either LeftToRight or RightToLeft
	Length int64     // length of run in bytes
	// contains filtered or unexported fields
}

A Run represents a directional run of text (i.e., a continuous sequence of characters of a single direction). Type Run holds the positions of characters, not the characters themselves.

func (*Run) IsOpposite ¶

func (r *Run) IsOpposite(dir Direction) bool

IsOpposite returns true if the run's direction is oppostite to the given direction.

func (*Run) SegmentIterator ¶

func (r *Run) SegmentIterator(reverse bool) *SegmentIterator

SegmentIterator creates an interator for the text segments contained within a Bidi run. Assume the output devices current direction is set to left-to-right:

it := run.SegmentIterator(run.IsOpposite(bidi.LeftToRight))
for it.Next() {
    dir, from, to := it.Segment()
    var segment string
    segment = myGetSegString(from, to)  // client func to get the text-segment by positions
    if dir == bidi.RightToLeft {
        segment = reverse(segment)      // client func to reverse graphemes of a string
    }
    …                                   // visual output of segment
}

Clients of this package should proceed like this for every Run of an Ordering.

Parameter reverse should be set to true if the clients expects fragments of the run in reverse order. This will be the case in situations where the direction of the run is opposite to the visual direction of an output device and the application has to handle visual re-ordering.

func (Run) String ¶

func (r Run) String() string

type SegmentIterator ¶

type SegmentIterator struct {
	//run      *Run
	Dir Direction
	// contains filtered or unexported fields
}

SegmentIterator iterates over the text segments contained in a run. Runs are the product of a re-ordering of text, which may lead to segments of text to be shuffled around. A segment starts and ends at text positions of the unshuffled text. Clients will need this information to create the correct visual order of text segments.

func (*SegmentIterator) EOF ¶

func (it *SegmentIterator) EOF() bool

EOF returns true if the iterator has read the last segment.

func (*SegmentIterator) Next ¶

func (it *SegmentIterator) Next() bool

Next proceeds the iterator to the next segment of text.

func (*SegmentIterator) Segment ¶

func (it *SegmentIterator) Segment() (Direction, uint64, uint64)

Segment returns the bounds of the current segment of text.

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
internal
gen
trie Package trie implements a trie data-structure similar to the one described by Donald E Knuth in “Programming Perls”.	Package trie implements a trie data-structure similar to the one described by Donald E Knuth in “Programming Perls”.

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL