Documentation
¶
Overview ¶
Package textutil provides text processing utilities for social discussion analysis.
Index ¶
- Variables
- func ExtractKeywords(text string, minLength int) []string
- func ExtractKeywordsWithStopWords(text string, minLength int, stopWords map[string]bool) []string
- func RemoveQuotedLines(s string) string
- func StripHTML(s string) string
- func StripHTMLAndQuotes(s string) string
- func Truncate(s string, maxLen int) string
- func TruncateAtSentence(s string, maxLen int) string
- func WordSimilarity(a, b string) float64
- type ScoreTextResult
Constants ¶
This section is empty.
Variables ¶
var DefaultStopWords = map[string]bool{ "the": true, "a": true, "an": true, "and": true, "or": true, "but": true, "in": true, "on": true, "at": true, "to": true, "for": true, "of": true, "with": true, "by": true, "from": true, "as": true, "is": true, "was": true, "are": true, "were": true, "been": true, "be": true, "have": true, "has": true, "had": true, "do": true, "does": true, "did": true, "will": true, "would": true, "could": true, "should": true, "may": true, "might": true, "must": true, "that": true, "which": true, "who": true, "whom": true, "this": true, "these": true, "those": true, "it": true, "its": true, "they": true, "their": true, "them": true, "we": true, "our": true, "us": true, "i": true, "me": true, "my": true, "you": true, "your": true, "not": true, "no": true, "yes": true, "so": true, "if": true, "then": true, "than": true, "too": true, "very": true, "just": true, "only": true, "also": true, "about": true, "into": true, "over": true, "after": true, }
DefaultStopWords returns a set of common English stop words to filter from keyword extraction.
Functions ¶
func ExtractKeywords ¶
ExtractKeywords extracts meaningful keywords from text, filtering stop words. Words shorter than minLength are excluded. Duplicates are removed.
func ExtractKeywordsWithStopWords ¶
ExtractKeywordsWithStopWords extracts keywords using a custom stop word set.
func RemoveQuotedLines ¶
RemoveQuotedLines removes lines that start with ">" (quoted parent text in forums). This is common in HackerNews and Reddit where users quote parent comments.
func StripHTML ¶
StripHTML removes HTML tags and decodes HTML entities from text. It preserves paragraph breaks as spaces and removes anchor tags while keeping their text.
func StripHTMLAndQuotes ¶
StripHTMLAndQuotes combines StripHTML and RemoveQuotedLines for processing forum comments.
func TruncateAtSentence ¶
TruncateAtSentence truncates text at a sentence boundary near maxLen. It looks for sentence-ending punctuation (.!?) within a range before maxLen.
func WordSimilarity ¶
WordSimilarity computes word overlap similarity between two strings. Returns a value between 0.0 (no overlap) and 1.0 (complete overlap). Uses a Jaccard-like coefficient based on word matching.
Types ¶
type ScoreTextResult ¶
ScoreTextResult holds the result of scoring text against keywords.
func ScoreComment ¶
func ScoreComment(text string, keywords []string) ScoreTextResult
ScoreComment scores a comment with length-based adjustments. Reasonable length comments (20-200 words) get a bonus. Very short (<10 words) or very long (>300 words) comments are penalized.
func ScoreText ¶
func ScoreText(text string, keywords []string) ScoreTextResult
ScoreText scores text against a list of keywords. Longer keywords contribute more to the score. Returns the score and list of matched keywords.