plaintext

package module
v1.1.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 21, 2023 License: MIT Imports: 2 Imported by: 1

README

Plain Text Extractor

License Go Report Card

Plain Text Extractor is a Golang library that helps you extract plain text from HTML and Markdown.

It provides a flexible and extensible interface for extracting the plain text content using both the predefined extraction methods and your own custom extraction requirements.

Features

  • Parse HTML and Markdown documents into plain text.
  • Support for custom extraction functions.
  • Easy-to-use API to convert complex documents to simple plain text.

Installation

go get github.com/huantt/plaintext-extractror

Usage

Markdown extractor
markdownContent := "# H1 \n*italic* **bold** `code` `not code [link](https://example.com) ![image](https://image.com/image.png) ~~strikethrough~~"
extractor := NewMarkdownExtractor()
output, err := extractor.PlainText(markdownContent)
if err != nil {
    panic(err)
}
fmt.Println(output)
// Output: H1 \nitalic bold code `not code link image strikethrough
Benchmark
goos: windows
goarch: amd64
pkg: github.com/huantt/plaintext-extractor/markdown
cpu: 11th Gen Intel(R) Core(TM) i5-1155G7 @ 2.50GHz
BenchmarkMarkdownExtractorMediumSize
BenchmarkMarkdownExtractorMediumSize-8   	12194006	        89.09 ns/op	      16 B/op	       1 allocs/op
BenchmarkMarkdownExtractorLargeSize
BenchmarkMarkdownExtractorLargeSize-8    	12645927	        88.25 ns/op	      16 B/op	       1 allocs/op
PASS
Custom Markdown Tag
markdownContent := "This is {color:#0A84FF}red{color}"

customTag := markdown.Tag{
    Name:       "color-custom-tag",
    FullRegex:  regexp.MustCompile("{color:[a-zA-Z0-9#]+}(.*?){color}"),
    StartRegex: regexp.MustCompile("{color:[a-zA-Z0-9#]+}"),
    EndRegex:   regexp.MustCompile("{color}"),
}

markdownExtractor := NewMarkdownExtractor(customTag)
plaintextExtractor := plaintext.NewExtractor(markdownExtractor.PlainText)
plaintext, err := plaintextExtractor.PlainText(markdownContent)
if err != nil{
    panic(nil)
}
fmt.Println(plaintext)
// Output: This is red
HTML Extractor
html := `<div>This is a <a href="https://example.com">link</a></div>`
extractor := NewHtmlExtractor()
output, err := extractor.PlainText(html)
if err != nil {
    panic(err)
}
fmt.Println(output)
// Output: This is a link
Multiple extractors
input := `<div> html </div> *markdown*`
markdownExtractor := markdown.NewExtractor()
htmlExtractor := html.NewExtractor()
extractor := NewExtractor(markdownExtractor.PlainText, htmlExtractor.PlainText)
output, err := extractor.PlainText(input)
if err != nil {
    panic(err)
}
fmt.Println(output)
// Output: html markdown

Contribution

Contributions to the Plain Text Parser project are welcome! If you find any issues or want to add new features, please feel free to open an issue or submit a pull request. Please see the CONTRIBUTING.md for more information.

License

This project released under the MIT License, refer LICENSE file.

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type ExtractFunc

type ExtractFunc func(input string) (*string, error)

ExtractFunc is the function signature for extracting plain text from a given input string. Implement this function to extend availability of extracting plain text by passing into Extractor.AddExtractor function.

type Extractor

type Extractor struct {
	// contains filtered or unexported fields
}

Extractor represents a plain text extractor that can parse input strings using multiple extract functions (for example html or markdown).

func NewExtractor

func NewExtractor(extractFunc ExtractFunc, moreFuncs ...ExtractFunc) *Extractor

NewExtractor creates a new Extractor instance with the given extract function.

func NewHtmlExtractor

func NewHtmlExtractor(blockTags ...string) *Extractor

func NewMarkdownExtractor

func NewMarkdownExtractor(customTags ...markdown.Tag) *Extractor

func (*Extractor) AddExtractor

func (p *Extractor) AddExtractor(extractor ExtractFunc) *Extractor

AddExtractor adds an extract function to the Extractor instance.

func (*Extractor) PlainText

func (p *Extractor) PlainText(input string) (plainText *string, err error)

PlainText extracts plain text from the input string using registered extract functions. It iterates over all extract functions, applying them in sequence, and returns the final plain text.

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL