README

html-to-markdown

Go Report Card codecov GitHub MIT License GoDoc

gopher stading on top of a machine that converts a box of html to blocks of markdown

Convert HTML into Markdown with Go. It is using an HTML Parser to avoid the use of regexp as much as possible. That should prevent some weird cases and allows it to be used for cases where the input is totally unknown.

Installation

go get github.com/JohannesKaufmann/html-to-markdown

Usage

import md "github.com/JohannesKaufmann/html-to-markdown"

converter := md.NewConverter("", true, nil)

html = `<strong>Important</strong>`

markdown, err := converter.ConvertString(html)
if err != nil {
  log.Fatal(err)
}
fmt.Println("md ->", markdown)

If you are already using goquery you can pass a selection to Convert.

markdown, err := converter.Convert(selec)
Using it on the command line

If you want to make use of html-to-markdown on the command line without any Go coding, check out html2md, a cli wrapper for html-to-markdown that has all the following options and plugins builtin.

Options

The third parameter to md.NewConverter is *md.Options.

For example you can change the character that is around a bold text ("**") to a different one (for example "__") by changing the value of StrongDelimiter.

opt := &md.Options{
  StrongDelimiter: "__", // default: **
  // ...
}
converter := md.NewConverter("", true, opt)

For all the possible options look at godocs and for a example look at the example.

Adding Rules

converter.AddRules(
  md.Rule{
    Filter: []string{"del", "s", "strike"},
    Replacement: func(content string, selec *goquery.Selection, opt *md.Options) *string {
      // You need to return a pointer to a string (md.String is just a helper function).
      // If you return nil the next function for that html element
      // will be picked. For example you could only convert an element
      // if it has a certain class name and fallback if not.
      content = strings.TrimSpace(content)
      return md.String("~" + content + "~")
    },
  },
  // more rules
)

For more information have a look at the example add_rules.

Using Plugins

If you want plugins (github flavored markdown like striketrough, tables, ...) you can pass it to Use.

import "github.com/JohannesKaufmann/html-to-markdown/plugin"

// Use the `GitHubFlavored` plugin from the `plugin` package.
converter.Use(plugin.GitHubFlavored())

Or if you only want to use the Strikethrough plugin. You can change the character that distinguishes the text that is crossed out by setting the first argument to a different value (for example "~~" instead of "~").

converter.Use(plugin.Strikethrough(""))

For more information have a look at the example github_flavored.

Writing Plugins

Have a look at the plugin folder for a reference implementation. The most basic one is Strikethrough.

Other Methods

Godoc

func (c *Converter) Keep(tags ...string) *Converter

Determines which elements are to be kept and rendered as HTML.

func (c *Converter) Remove(tags ...string) *Converter

Determines which elements are to be removed altogether i.e. converted to an empty string.

Issues

If you find HTML snippets (or even full websites) that don't produce the expected results, please open an issue!

Expand ▾ Collapse ▴

Documentation

Overview

    Package md converts html to markdown.

    converter := md.NewConverter("", true, nil)
    
    html = `<strong>Important</strong>`
    
    markdown, err := converter.ConvertString(html)
    if err != nil {
      log.Fatal(err)
    }
    fmt.Println("md ->", markdown)
    

    Or if you are already using goquery:

    markdown, err := converter.Convert(selec)
    

    Index

    Constants

    This section is empty.

    Variables

    View Source
    var Timeout = time.Second * 10

      Timeout for the http client

      Functions

      func AddSpaceIfNessesary

      func AddSpaceIfNessesary(selec *goquery.Selection, markdown string) string

        AddSpaceIfNessesary adds spaces to the text based on the neighbors. That makes sure that there is always a space to the side, to recognize the delimiter.

        func CalculateCodeFence

        func CalculateCodeFence(fenceChar rune, content string) string

          CalculateCodeFence can be passed the content of a code block and it returns how many fence characters (` or ~) should be used.

          This is useful if the html content includes the same fence characters for example “` -> https://stackoverflow.com/a/49268657

          func CollectText

          func CollectText(n *html.Node) string

            CollectText returns the text of the node and all its children

            func DefaultGetAbsoluteURL

            func DefaultGetAbsoluteURL(selec *goquery.Selection, rawURL string, domain string) string

              DefaultGetAbsoluteURL is the default function and can be overridden through `GetAbsoluteURL` in the options.

              func DomainFromURL

              func DomainFromURL(rawURL string) string

                DomainFromURL returns `u.Host` from the parsed url.

                func EscapeMultiLine

                func EscapeMultiLine(content string) string

                  EscapeMultiLine deals with multiline content inside a link

                  func IsInlineElement

                  func IsInlineElement(e string) bool

                    IsInlineElement can be used to check wether a node name (goquery.Nodename) is an html inline element and not a block element. Used in the rule for the p tag to check wether the text is inside a block element.

                    func String

                    func String(text string) *string

                      String is a helper function to return a pointer.

                      func TrimTrailingSpaces

                      func TrimTrailingSpaces(text string) string

                        TrimTrailingSpaces removes unnecessary spaces from the end of lines.

                        func TrimpLeadingSpaces

                        func TrimpLeadingSpaces(text string) string

                          TrimpLeadingSpaces removes spaces from the beginning of a line but makes sure that list items and code blocks are not affected.

                          Types

                          type AdvancedResult

                          type AdvancedResult struct {
                          	Header   string
                          	Markdown string
                          	Footer   string
                          }

                            AdvancedResult is used for example for links. If you use LinkStyle:referenced the link href is placed at the bottom of the generated markdown (Footer).

                            type Afterhook

                            type Afterhook func(markdown string) string

                            type BeforeHook

                            type BeforeHook func(selec *goquery.Selection)

                            type Converter

                            type Converter struct {
                            	// contains filtered or unexported fields
                            }

                              Converter is initialized by NewConverter.

                              func NewConverter

                              func NewConverter(domain string, enableCommonmark bool, options *Options) *Converter

                                NewConverter initializes a new converter and holds all the rules. - `domain` is used for links and images to convert relative urls ("/image.png") to absolute urls. - CommonMark is the default set of rules. Set enableCommonmark to false if you want

                                to customize everything using AddRules and DONT want to fallback to default rules.
                                

                                func (*Converter) AddRules

                                func (conv *Converter) AddRules(rules ...Rule) *Converter

                                  AddRules adds the rules that are passed in to the converter.

                                  By default it overrides the rule for that html tag. You can fall back to the default rule by returning nil.

                                  func (*Converter) After

                                  func (conv *Converter) After(hooks ...Afterhook) *Converter

                                    After registers a hook that is run after the convertion. It can be used to transform the markdown document that is about to be returned.

                                    For example, the default after hook trims the returned markdown.

                                    func (*Converter) Before

                                    func (conv *Converter) Before(hooks ...BeforeHook) *Converter

                                      Before registers a hook that is run before the convertion. It can be used to transform the original goquery html document.

                                      For example, the default before hook adds an index to every link, so that the `a` tag rule (for "reference" "full") can have an incremental number.

                                      func (*Converter) Convert

                                      func (conv *Converter) Convert(selec *goquery.Selection) string

                                        Convert returns the content from a goquery selection. If you have a goquery document just pass in doc.Selection.

                                        func (*Converter) ConvertBytes

                                        func (conv *Converter) ConvertBytes(bytes []byte) ([]byte, error)

                                          ConvertBytes returns the content from a html byte array.

                                          func (*Converter) ConvertReader

                                          func (conv *Converter) ConvertReader(reader io.Reader) (bytes.Buffer, error)

                                            ConvertReader returns the content from a reader and returns a buffer.

                                            func (*Converter) ConvertResponse

                                            func (conv *Converter) ConvertResponse(res *http.Response) (string, error)

                                              ConvertResponse returns the content from a html response.

                                              func (*Converter) ConvertString

                                              func (conv *Converter) ConvertString(html string) (string, error)

                                                ConvertString returns the content from a html string. If you already have a goquery selection use `Convert`.

                                                func (*Converter) ConvertURL

                                                func (conv *Converter) ConvertURL(url string) (string, error)

                                                  ConvertURL returns the content from the page with that url.

                                                  func (*Converter) Keep

                                                  func (conv *Converter) Keep(tags ...string) *Converter

                                                    Keep certain html tags in the generated output.

                                                    func (*Converter) Remove

                                                    func (conv *Converter) Remove(tags ...string) *Converter

                                                      Remove certain html tags from the source.

                                                      func (*Converter) Use

                                                      func (conv *Converter) Use(plugins ...Plugin) *Converter

                                                        Use can be used to add additional functionality to the converter. It is used when its not sufficient to use only rules for example in Plugins.

                                                        type Options

                                                        type Options struct {
                                                        	// "setext" or "atx"
                                                        	// default: "atx"
                                                        	HeadingStyle string
                                                        
                                                        	// Any Thematic break
                                                        	// default: "* * *"
                                                        	HorizontalRule string
                                                        
                                                        	// "-", "+", or "*"
                                                        	// default: "-"
                                                        	BulletListMarker string
                                                        
                                                        	// "indented" or "fenced"
                                                        	// default: "indented"
                                                        	CodeBlockStyle string
                                                        
                                                        	// “` or ~~~
                                                        	// default: “`
                                                        	Fence string
                                                        
                                                        	// _ or *
                                                        	// default: _
                                                        	EmDelimiter string
                                                        
                                                        	// ** or __
                                                        	// default: **
                                                        	StrongDelimiter string
                                                        
                                                        	// inlined or referenced
                                                        	// default: inlined
                                                        	LinkStyle string
                                                        
                                                        	// full, collapsed, or shortcut
                                                        	// default: full
                                                        	LinkReferenceStyle string
                                                        
                                                        	// GetAbsoluteURL parses the `rawURL` and adds the `domain` to convert relative (/page.html)
                                                        	// urls to absolute urls (http://domain.com/page.html).
                                                        	//
                                                        	// The default is `DefaultGetAbsoluteURL`, unless you override it. That can also
                                                        	// be useful if you want to proxy the images.
                                                        	GetAbsoluteURL func(selec *goquery.Selection, rawURL string, domain string) string
                                                        	// contains filtered or unexported fields
                                                        }

                                                          Options to customize the output. You can change stuff like the character that is used for strong text.

                                                          type Plugin

                                                          type Plugin func(conv *Converter) []Rule

                                                            Plugin can be used to extends functionality beyond what is offered by commonmark.

                                                            type Rule

                                                            type Rule struct {
                                                            	Filter              []string
                                                            	Replacement         func(content string, selec *goquery.Selection, options *Options) *string
                                                            	AdvancedReplacement func(content string, selec *goquery.Selection, options *Options) (res AdvancedResult, skip bool)
                                                            }

                                                              Rule to convert certain html tags to markdown.

                                                              md.Rule{
                                                                Filter: []string{"del", "s", "strike"},
                                                                Replacement: func(content string, selec *goquery.Selection, opt *md.Options) *string {
                                                                  // You need to return a pointer to a string (md.String is just a helper function).
                                                                  // If you return nil the next function for that html element
                                                                  // will be picked. For example you could only convert an element
                                                                  // if it has a certain class name and fallback if not.
                                                                  return md.String("~" + content + "~")
                                                                },
                                                              }
                                                              

                                                              Directories

                                                              Path Synopsis
                                                              Package escape escapes characters that are commonly used in markdown like the * for strong/italic.
                                                              Package escape escapes characters that are commonly used in markdown like the * for strong/italic.
                                                              examples
                                                              Package plugin contains all the rules that are not part of Commonmark like GitHub Flavored Markdown.
                                                              Package plugin contains all the rules that are not part of Commonmark like GitHub Flavored Markdown.