stemmer

package

v0.14.4 Latest Latest Go to latest Published: Dec 15, 2020 License: MIT Imports: 1 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

gitlab.com/gogna/gnparser

Links

Open Source Insights

Documentation ¶

Overview ¶

http://snowballstem.org/otherapps/schinke/ http://caio.ueberalles.net/a_stemming_algorithm_for_latin_text_databases-schinke_et_al.pdf

The Schinke Latin stemming algorithm is described in, Schinke R, Greengrass M, Robertson AM and Willett P (1996) A stemming algorithm for Latin text databases. Journal of Documentation, 52: 172-187.

It has the feature that it stems each word to two forms, noun and verb. For example,

            NOUN        VERB
            ----        ----
aquila      aquil       aquila
portat      portat      porta
portis      port        por

Here (slightly reformatted) are the rules of the stemmer,

1. (start)

Convert all occurrences of the letters 'j' or 'v' to 'i' or 'u', respectively.

If the word ends in '-que' then if the word is on the list shown in Figure 4, then write the original word to both the noun-based and verb-based stem dictionaries and go to 8. else remove '-que'
[Figure 4 was
atque quoque neque itaque absque apsque abusque adaeque adusque denique deque susque oblique peraeque plenisque quandoque quisque quaeque cuiusque cuique quemque quamque quaque quique quorumque quarumque quibusque quosque quasque quotusquisque quousque ubique undique usque uterque utique utroque utribique torque coque concoque contorque detorque decoque excoque extorque obtorque optorque retorque recoque attorque incoque intorque praetorque]

Match the end of the word against the suffix list show in Figure 6(a), removing the longest matching suffix, (if any).
[Figure 6(a) was
-ibus -ius -ae -am -as -em -es -ia -is -nt -os -ud -um -us -a -e -i -o -u]

If the resulting stem contains at least two characters then write this stem to the noun-based stem dictionary.

Match the end of the word against the suffix list show in Figure 6(b), identifying the longest matching suffix, (if any).
[Figure 6(b) was
-iuntur-beris -erunt -untur -iunt -mini -ntur -stis -bor -ero -mur -mus -ris -sti -tis -tur -unt -bo -ns -nt -ri -m -r -s -t]
If any of the following suffixes are found then convert them as shown:
'-iuntur', '-erunt', '-untur', '-iunt', and '-unt', to '-i'; '-beris', '-bor', and '-bo' to '-bi'; '-ero' to '-eri'
else remove the suffix in the normal way.

If the resulting stem contains at least two characters then write this stem to the verb-based stem dictionary.

8. (end)

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func StemCanonical ¶

func StemCanonical(c string) string

StemCanonical takes a short form of a canonical name and returns back stemmed specific and infraspecific epithets. It assumes the following properties of a string:

There are no empty spaces over any side of a string.
All spaces withing the string are single.
All characters in the string are ASCII with exception of the hybrid sign.
The string always starts with a capitalized word.

Types ¶

type StemmedWord ¶

type StemmedWord struct {
	Orig   string
	Stem   string
	Suffix string
}

func Stem ¶

func Stem(wrd string) StemmedWord

Stem takes a word and, assuming the word is noun, removes its latin suffix if such suffix is detected.

Source Files ¶

View all Source files

stemmer.go

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL