textutil

package
v0.1.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 30, 2026 License: MIT Imports: 3 Imported by: 0

Documentation

Overview

Package textutil provides text processing utilities for social discussion analysis.

Index

Constants

This section is empty.

Variables

View Source
var DefaultStopWords = map[string]bool{
	"the": true, "a": true, "an": true, "and": true, "or": true, "but": true,
	"in": true, "on": true, "at": true, "to": true, "for": true, "of": true,
	"with": true, "by": true, "from": true, "as": true, "is": true, "was": true,
	"are": true, "were": true, "been": true, "be": true, "have": true, "has": true,
	"had": true, "do": true, "does": true, "did": true, "will": true, "would": true,
	"could": true, "should": true, "may": true, "might": true, "must": true,
	"that": true, "which": true, "who": true, "whom": true, "this": true,
	"these": true, "those": true, "it": true, "its": true, "they": true,
	"their": true, "them": true, "we": true, "our": true, "us": true,
	"i": true, "me": true, "my": true, "you": true, "your": true,
	"not": true, "no": true, "yes": true, "so": true, "if": true, "then": true,
	"than": true, "too": true, "very": true, "just": true, "only": true,
	"also": true, "about": true, "into": true, "over": true, "after": true,
}

DefaultStopWords returns a set of common English stop words to filter from keyword extraction.

Functions

func ExtractKeywords

func ExtractKeywords(text string, minLength int) []string

ExtractKeywords extracts meaningful keywords from text, filtering stop words. Words shorter than minLength are excluded. Duplicates are removed.

func ExtractKeywordsWithStopWords

func ExtractKeywordsWithStopWords(text string, minLength int, stopWords map[string]bool) []string

ExtractKeywordsWithStopWords extracts keywords using a custom stop word set.

func RemoveQuotedLines

func RemoveQuotedLines(s string) string

RemoveQuotedLines removes lines that start with ">" (quoted parent text in forums). This is common in HackerNews and Reddit where users quote parent comments.

func StripHTML

func StripHTML(s string) string

StripHTML removes HTML tags and decodes HTML entities from text. It preserves paragraph breaks as spaces and removes anchor tags while keeping their text.

func StripHTMLAndQuotes

func StripHTMLAndQuotes(s string) string

StripHTMLAndQuotes combines StripHTML and RemoveQuotedLines for processing forum comments.

func Truncate

func Truncate(s string, maxLen int) string

Truncate shortens a string to maxLen characters, adding "..." if truncated.

func TruncateAtSentence

func TruncateAtSentence(s string, maxLen int) string

TruncateAtSentence truncates text at a sentence boundary near maxLen. It looks for sentence-ending punctuation (.!?) within a range before maxLen.

func WordSimilarity

func WordSimilarity(a, b string) float64

WordSimilarity computes word overlap similarity between two strings. Returns a value between 0.0 (no overlap) and 1.0 (complete overlap). Uses a Jaccard-like coefficient based on word matching.

Types

type ScoreTextResult

type ScoreTextResult struct {
	Score   float64
	Matches []string
}

ScoreTextResult holds the result of scoring text against keywords.

func ScoreComment

func ScoreComment(text string, keywords []string) ScoreTextResult

ScoreComment scores a comment with length-based adjustments. Reasonable length comments (20-200 words) get a bonus. Very short (<10 words) or very long (>300 words) comments are penalized.

func ScoreText

func ScoreText(text string, keywords []string) ScoreTextResult

ScoreText scores text against a list of keywords. Longer keywords contribute more to the score. Returns the score and list of matched keywords.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL