metaphone3

package module
v0.0.0-...-5fe87fc Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Sep 3, 2019 License: BSD-3-Clause Imports: 2 Imported by: 3

README

metaphone3 - a sound-a-like index for names

Metaphone3 is a more accurate version of the original Soundex algorithm. It's designed so that similar-sounding words in American English share the same keys. For example Smith, Smyth, Smithe, Smythe all encode to SM0 primary and XMT alt. Whereas Schmidt encodes to XMT primary with no secondary.

Searching for matches where either primary or secondary matches will give the best results.

You can read more about Metaphone on Wikipedia.

Usage

Basic usage of the encoder looks like this:

	e := &metaphone3.Encoder{}
	prim, second := e.Encode("Smith")

An Encoder is designed to be re-used to reduce memory pressure at scale and has three settable options. An Encoder is not thread-safe so it is not safe to use one Encoder across goroutines. If you're comparing values you must use the exact same options.

Option Type Default Purpose
EncodeExact bool false Setting EncodeExact to true will tighten the output so that certain sounds will be differentiated. E.g. more separation between hard "G" sounds and hard "K" sounds.
EncodeVowels bool false Setting EncodeVowels to true will include non-first-letter vowel sounds in the output. By default only consonent sounds are included.
MaxLength int metaphone3.DefaultMaxLength This limits the output of long words and is useful to reduce the cycles and memory spent on processing long words.
metaphone3.DefaultMaxLength int 8 If MaxLength is 0 (or negative) then it defaults as metaphone3.DefaultMaxLength, which starts as 8 (like the java implementation).

Additional usage details available in the godocs.

Basis for algorithm

The reference implementation of metaphone3 in Java can be found here.

Differences from v2.1.3 Java Implementation

  • Fix ROBILL
  • Fix lengths for very long words where certain situations would cause the primary or secondary to get too long and the other would get truncated. (e.g. Villafranca when EncodeVowel is true)
  • Fix JAKOB
  • Fix ending CIAS and CIOS (e.g. MECIAS)
  • Fix words starting with HARGER
  • Fix SUPERNODE (prevent D from being silent)

Documentation

Overview

Package metaphone3 is a Go implementation of the Metaphone 3 algorithm. Metaphone 3 is designed to return an *approximate* phonetic key (and an alternate approximate phonetic key when appropriate) that should be the same for English words, and most names familiar in the United States, that are pronounced *similarly*. The key value is *not* intended to be an *exact* phonetic, or even phonemic, representation of the word. This is because a certain degree of 'fuzziness' has proven to be useful in compensating for variations in pronunciation, as well as misheard pronunciations. For example, although americans are not usually aware of it, the letter 's' is normally pronounced 'z' at the end of words such as "sounds".

The 'approximate' aspect of the encoding is implemented according to the following rules:

(1) All vowels are encoded to the same value - 'A'. If the parameter encodeVowels is set to false, only *initial* vowels will be encoded at all. If encodeVowels is set to true, 'A' will be encoded at all places in the word that any vowels are normally pronounced. 'W' as well as 'Y' are treated as vowels. Although there are differences in the pronunciation of 'W' and 'Y' in different circumstances that lead to their being classified as vowels under some circumstances and as consonants in others, for the purposes of the 'fuzziness' component of the Soundex and Metaphone family of algorithms they will be always be treated here as vowels.

(2) Voiced and un-voiced consonant pairs are mapped to the same encoded value. This means that: 'D' and 'T' -> 'T' 'B' and 'P' -> 'P' 'G' and 'K' -> 'K' 'Z' and 'S' -> 'S' 'V' and 'F' -> 'F'

- In addition to the above voiced/unvoiced rules, 'CH' and 'SH' -> 'X', where 'X' represents the "-SH-" and "-CH-" sounds in Metaphone 3 encoding.

- Also, the sound that is spelled as "TH" in English is encoded to '0' (zero symbol). (Although Americans are not usually aware of it, "TH" is pronounced in a voiced (e.g. "that") as well as an unvoiced (e.g. "theater") form, which are naturally mapped to the same encoding.)

The encodings in this version of Metaphone 3 are according to pronunciations common in the United States. This means that they will be inaccurate for consonant pronunciations that are different in the United Kingdom, for example "tube" -> "CHOOBE" -> XAP rather than american TAP.

Metaphone 3 was preceded by Soundex, patented in 1919, and Metaphone and Double Metaphone, developed by Lawrence Philips. All of these algorithms resulted in a significant number of incorrect encodings. Metaphone3 was tested against a database of about 100 thousand English words, names common in the United States, and non-English words found in publications in the United States, with an emphasis on words that are commonly mispronounced, prepared by the Moby Words website, but with the Moby Words 'phonetic' encodings algorithmically mapped to Double Metaphone encodings. Metaphone3 increases the accuracy of encoding of english words, common names, and non-English words found in american publications from the 89% for Double Metaphone, to over 98%.

Index

Constants

This section is empty.

Variables

View Source
var DefaultMaxLength = 8

DefaultMaxLength is the max number of runes in a result when not specified in the encoder

Functions

This section is empty.

Types

type Encoder

type Encoder struct {
	// EncodeVowels determines if Metaphone3 will encode non-initial vowels. However, even
	// if there are more than one vowel sound in a vowel sequence (i.e.
	// vowel diphthong, etc.), only one 'A' will be encoded before the next consonant or the
	// end of the word.
	EncodeVowels bool

	// EncodeExact controls if Metaphone3 will encode consonants as exactly as possible.
	// This does not include 'S' vs. 'Z', since americans will pronounce 'S' at the
	// at the end of many words as 'Z', nor does it include "CH" vs. "SH". It does cause
	// a distinction to be made between 'B' and 'P', 'D' and 'T', 'G' and 'K', and 'V'
	// and 'F'.
	EncodeExact bool

	// The max allowed length of the output metaphs, if <= 0 then the DefaultMaxLength is used
	MaxLength int
	// contains filtered or unexported fields
}

Encoder is a metaphone3 encoder that contains options and state for encoding. It is not safe to use across goroutines.

func (*Encoder) Encode

func (e *Encoder) Encode(in string) (primary, secondary string)

Encode takes in a string and returns primary and secondary metaphones. Both will be blank if given a blank input, and secondary can be blank if there's only one metaphone.

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL