stopwords is a go package that removes stop words from a text content. If instructed to do so, it will remove HTML tags and parse HTML entities. The objective is to prepare a text in view to be used by natural processing algos or text comparison algorithms such as SimHash.
It uses a curated list of the most frequent words used in these languages:
If the function is used with an unsupported language, it doesn't fail, but will apply english filter to the content.
How to use this package?
You can find an example here https:github.com/bbalet/gorelated where stopwords package is used in conjunction with SimHash algorithm in order to find a list of related content for a static website generator:
import ( "github.com/bbalet/stopwords" ) //Example with 2 strings containing P html tags //"la", "un", etc. are (stop) words without lexical value in French string1 := byte("<p>la fin d'un bel après-midi d'été</p>") string2 := byte("<p>cet été, nous avons eu un bel après-midi</p>") //Return a string where HTML tags and French stop words has been removed cleanContent := stopwords.CleanString(string1, "fr", true) //Get two (Sim) hash representing the content of each string hash1 := stopwords.Simhash(string1, "fr", true) hash2 := stopwords.Simhash(string2, "fr", true) //Hamming distance between the two strings (diffference between contents) distance := stopwords.CompareSimhash(hash1, hash2) //Clean the content of string1 and string2, compute the Levenshtein Distance stopwords.LevenshteinDistance(string1, string2, "fr", true)
Where fr is the ISO 639-1 code for French (it accepts a BCP 47 tag as well). https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes
How to load a custom list of stop words from a file/string?
This package comes with a predefined list of stopwords. However, two functions allow you to use your own list of words:
stopwords.LoadStopWordsFromFile(filePath, langCode, separator) stopwords.LoadStopWordsFromString(wordsList, langCode, separator)
They will overwrite the predefined words for a given language.
You can find an example with the file
How to overwrite the word segmenter?
If you don't want to strip the Unicode Characters of the 'Number, Decimal Digit'
Category, call the function
DontStripDigits before using the package :
If you want to use your own segmenter, you can overwrite the regular expression:
Please note that this library doesn't break words. If you want to break words prior using stopwords, you need to use another library that provides a binding to ICU library.
These curated lists contain the most used words in various topics, they were not built with a corpus limited to any given specialized topic.
Most of the lists were built by IR Multilingual Resources at UniNE http://members.unine.ch/jacques.savoy/clef/index.html
stopwords is released under the BSD license.
Package stopwords allows you to customize the list of stopwords
Package stopwords implements the Levenshtein Distance algorithm to evaluate the diference between 2 strings
Package stopwords implements Charikar's simhash algorithm to generate a 64-bit fingerprint of a given document.
Package stopwords contains various algorithms of text comparison (Simhash, Levenshtein)
- func Clean(content byte, langCode string, cleanHTML bool) byte
- func CleanString(content string, langCode string, cleanHTML bool) string
- func CompareSimhash(a uint64, b uint64) uint8
- func DontStripDigits()
- func LevenshteinDistance(contentA byte, contentB byte, langCode string, cleanHTML bool) int
- func LoadStopWordsFromFile(filePath string, langCode string, sep string)
- func LoadStopWordsFromString(wordsList string, langCode string, sep string)
- func OverwriteWordSegmenter(expression string)
- func Simhash(content byte, langCode string, cleanHTML bool) uint64
This section is empty.
This section is empty.
func Clean ¶
Clean removes useless spaces and stop words from a byte slice. BCP 47 or ISO 639-1 language code (if unknown, we'll apply english filters). If cleanHTML is TRUE, remove HTML tags from content and unescape HTML entities.
func CleanString ¶
CleanString removes useless spaces and stop words from string content. BCP 47 or ISO 639-1 language code (if unknown, we'll apply english filters). If cleanHTML is TRUE, remove HTML tags from content and unescape HTML entities.
func CompareSimhash ¶
CompareSimhash calculates the Hamming distance between two 64-bit integers using the Kernighan method.
func DontStripDigits ¶
DontStripDigits changes the behaviour of the default word segmenter by including 'Number, Decimal Digit' Unicode Category as words
func LevenshteinDistance ¶
LevenshteinDistance compute the LevenshteinDistance between 2 strings it removes useless spaces and stop words from a byte slice. BCP 47 or ISO 639-1 language code (if unknown, we'll apply english filters). If cleanHTML is TRUE, remove HTML tags from content and unescape HTML entities.
func LoadStopWordsFromFile ¶
LoadStopWordsFromFile loads a list of stop words from a file filePath is the full path to the file to be loaded langCode is a BCP 47 or ISO 639-1 language code (e.g. "en" for English). sep is the string separator (e.g. "\n" for newline)
func LoadStopWordsFromString ¶
LoadStopWordsFromString loads a list of stop words from a string filePath is the full path to the file to be loaded langCode is a BCP 47 or ISO 639-1 language code (e.g. "en" for English). sep is the string separator (e.g. "\n" for newline)
func OverwriteWordSegmenter ¶
func OverwriteWordSegmenter(expression string)
OverwriteWordSegmenter allows you to overwrite the default word segmenter with your own regular expression
func Simhash ¶
Simhash returns a 64-bit simhash representing the content of the string removes useless spaces and stop words from a byte slice. BCP 47 or ISO 639-1 language code (if unknown, we'll apply english filters). If cleanHTML is TRUE, remove HTML tags from content and unescape HTML entities.
This section is empty.
Source Files ¶