stemmer

package
v0.14.4 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 15, 2020 License: MIT Imports: 1 Imported by: 0

Documentation

Overview

http://snowballstem.org/otherapps/schinke/ http://caio.ueberalles.net/a_stemming_algorithm_for_latin_text_databases-schinke_et_al.pdf

The Schinke Latin stemming algorithm is described in, Schinke R, Greengrass M, Robertson AM and Willett P (1996) A stemming algorithm for Latin text databases. Journal of Documentation, 52: 172-187.

It has the feature that it stems each word to two forms, noun and verb. For example,

            NOUN        VERB
            ----        ----
aquila      aquil       aquila
portat      portat      porta
portis      port        por

Here (slightly reformatted) are the rules of the stemmer,

1. (start)

  1. Convert all occurrences of the letters 'j' or 'v' to 'i' or 'u', respectively.
  1. If the word ends in '-que' then if the word is on the list shown in Figure 4, then write the original word to both the noun-based and verb-based stem dictionaries and go to 8. else remove '-que'

    [Figure 4 was

    atque quoque neque itaque absque apsque abusque adaeque adusque denique deque susque oblique peraeque plenisque quandoque quisque quaeque cuiusque cuique quemque quamque quaque quique quorumque quarumque quibusque quosque quasque quotusquisque quousque ubique undique usque uterque utique utroque utribique torque coque concoque contorque detorque decoque excoque extorque obtorque optorque retorque recoque attorque incoque intorque praetorque]

  1. Match the end of the word against the suffix list show in Figure 6(a), removing the longest matching suffix, (if any).

    [Figure 6(a) was

    -ibus -ius -ae -am -as -em -es -ia -is -nt -os -ud -um -us -a -e -i -o -u]

  1. If the resulting stem contains at least two characters then write this stem to the noun-based stem dictionary.
  1. Match the end of the word against the suffix list show in Figure 6(b), identifying the longest matching suffix, (if any).

    [Figure 6(b) was

    -iuntur-beris -erunt -untur -iunt -mini -ntur -stis -bor -ero -mur -mus -ris -sti -tis -tur -unt -bo -ns -nt -ri -m -r -s -t]

    If any of the following suffixes are found then convert them as shown:

    '-iuntur', '-erunt', '-untur', '-iunt', and '-unt', to '-i'; '-beris', '-bor', and '-bo' to '-bi'; '-ero' to '-eri'

    else remove the suffix in the normal way.

  1. If the resulting stem contains at least two characters then write this stem to the verb-based stem dictionary.

8. (end)

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func StemCanonical

func StemCanonical(c string) string

StemCanonical takes a short form of a canonical name and returns back stemmed specific and infraspecific epithets. It assumes the following properties of a string:

  1. There are no empty spaces over any side of a string.
  2. All spaces withing the string are single.
  3. All characters in the string are ASCII with exception of the hybrid sign.
  4. The string always starts with a capitalized word.

Types

type StemmedWord

type StemmedWord struct {
	Orig   string
	Stem   string
	Suffix string
}

func Stem

func Stem(wrd string) StemmedWord

Stem takes a word and, assuming the word is noun, removes its latin suffix if such suffix is detected.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL