Documentation ¶
Overview ¶
Package md implements a Markdown parser.
To use this package, call Render with one of the Codec implementations:
HTMLCodec converts Markdown to HTML. This is used in src.elv.sh/website/cmd/md2html, part of Elvish's website toolchain.
FmtCodec formats Markdown. This is used in src.elv.sh/cmd/elvmdfmt, used for formatting Markdown files in the Elvish repo.
TTYCodec renders Markdown in the terminal. This will be used in a help system that can used directly from Elvish to render documentation of Elvish modules.
Why another Markdown implementation? ¶
The Elvish project uses Markdown in the documentation ("elvdoc") for the functions and variables defined in builtin modules. These docs are then converted to HTML as part of the website; for example, you can read the docs for builtin functions and variables at https://elv.sh/ref/builtin.html.
We used to use Pandoc to convert the docs from their Markdown sources to HTML. However, we would also like to expand the elvdoc system in two ways:
We would like to support elvdocs in user-defined modules, not just builtin modules.
We would like to allow users to read elvdocs directly from the Elvish program, in the terminal, without needing a browser or an Internet connection.
With these requirements, Elvish itself needs to know how to parse Markdown sources and render them in the terminal, so we need a Go implementation instead. There is a good Go implementation, github.com/yuin/goldmark, but it is quite large: linking it into Elvish will increase the binary size by more than 1MB. (There is another popular Markdown implementation, github.com/russross/blackfriday/v2, but it doesn't support CommonMark.)
By having a more narrow focus, this package is much smaller than goldmark, and can be easily optimized for Elvish's use cases. In contrast to goldmark's 1MB, including Render and HTMLCodec in Elvish only increases the binary size by 150KB. That said, the functionalities provided by this package still try to be as general as possible, and can potentially be used by other people interested in a small Markdown implementation.
Besides elvdocs, Pandoc was also used to convert all the other content on the Elvish website (https://elv.sh) to HTML. Additionally, Prettier used to be used to format all the Markdown files in the repo. Now that Elvish has its own Markdown implementation, we can use it not just for rendering elvdocs in the terminal, but also replace the use of Pandoc and Prettier. These external tools are decent, but using them still came with some frictions:
Even though both are relatively easy to set up, they can still be a hindrance to casual contributors.
Since behavior of these tools can change with version, we explicit specify their versions in both CI configurations and contributing instructions. But this creates another problem: every time these tools release new versions, we have to manually bump the versions, and every contributor also needs to manually update them in their development environments.
Replacing external tools with this package removes these frictions.
Additionally, this package is very easy to extend and optimize to suit Elvish's needs:
We used to custom Pandoc using a mix of shell scripts, templates and Lua scripts. While these customization options of Pandoc are well documented, they are not something people are likely to be familiar with.
With this implementation, everything is now done with Go code.
The Markdown formatter is much faster than Prettier, so it's now feasible to run the formatter every time when saving a Markdown file.
Which Markdown variant does this package implement? ¶
This package implements a large subset of the CommonMark spec, with the following omissions:
"\r" and "\r\n" are not supported as line endings. This can be easily worked around by converting them to "\n" first.
Tabs are not supported for defining block structures; use spaces instead. Tabs in other context are supported.
Among HTML entities, only a few are supported: < > "e; ' &. This is because the full list of HTML entities is very large and will inflate the binary size.
If full support for HTML entities are desirable, this can be done by overriding the UnescapeHTML variable with html.UnescapeString.
(Numeric character references like 	 and   are fully supported.)
Setext headings are not supported; use ATX headings instead.
Reference links are not supported; use inline links instead.
Lists are always considered loose.
The package also supports the following extensions:
- ATX headers may be followed by Pandoc header attributes {...}.
These omitted features are never used in Elvish's Markdown sources.
All implemented features pass their relevant CommonMark spec tests, currently targeting CommonMark 0.31.2. See testutils_test.go for a complete list of which spec tests are skipped.
Is this package useful outside Elvish? ¶
Yes! Well, hopefully. Assuming you don't use the features this package omits, it can be useful in at least the following ways:
The implementation is quite lightweight, so you can use it instead of a more full-features Markdown library if small binary size is important.
As shown above, the increase in binary size when using this package in Elvish is about 150KB, compared to more than 1MB when using github.com/yuin/goldmark. You mileage may vary though, since the binary size increase depends on which packages the binary is already including.
The formatter implemented by FmtCodec is heavily fuzz-tested to ensure that it does not alter the semantics of the Markdown.
Markdown formatting is fraught with tricky edge cases. For example, if a formatter standardizes all bullet markers to "-", it might reformat "* --" to "- ---", but the latter will now be parsed as a thematic break.
Thanks to Go's builtin fuzzing support, the formatter is able to handle many such corner cases (at least all the corner cases found by the fuzzer; take a look and try them on other formatters!). There are two areas - namely nested and consecutive emphasis or strong emphasis - that are just too tricky to get 100% right that the formatter is not guaranteed to be correct; the fuzz test explicitly skips those cases.
Nonetheless, if you are writing a Markdown formatter and care about correctness, the corner cases will be interesting, regardless of which language you are using to implement the formatter.
Index ¶
- Variables
- func Render(text string, codec Codec)
- func RenderInlineContentToHTML(sb *strings.Builder, ops []InlineOp)
- func RenderString(text string, codec StringerCodec) string
- type Codec
- type FmtCodec
- type FmtUnsupported
- type HTMLCodec
- type InlineOp
- type InlineOpType
- type Op
- type OpType
- type SmartPunctsCodec
- type StringerCodec
- type TTYCodec
- type TextBlock
- type TextCodec
- type TraceCodec
Constants ¶
This section is empty.
Variables ¶
var UnescapeHTML = unescapeHTML
UnescapeHTML is used by the parser to unescape HTML entities and numeric character references.
The default implementation supports numeric character references, plus a minimal set of entities that are necessary for writing valid HTML or can appear in the output of FmtCodec. It can be set to html.UnescapeString for better CommonMark compliance.
Functions ¶
func RenderInlineContentToHTML ¶
RenderInlineContentToHTML renders inline content to HTML, writing to a strings.Builder. This is useful for implementing an alternative HTML-outputting Codec.
func RenderString ¶
func RenderString(text string, codec StringerCodec) string
Render calls Render(text, codec) and returns codec.String(). This can be a bit more convenient to use than Render.
Types ¶
type FmtCodec ¶
type FmtCodec struct { Width int // contains filtered or unexported fields }
FmtCodec is a codec that formats Markdown in a specific style.
The only supported configuration option is the text width.
The formatted text uses the following style:
Blocks are always separated by a blank line.
Thematic breaks use "***" where possible, falling back to "---" if using the former is problematic.
Code blocks are always fenced, never indented.
Code fences use backquotes (like "```") wherever possible, falling back to "~~~" if using the former is problematic.
Continuation markers of container blocks ("> " for blockquotes and spaces for list items) are never omitted; in other words, lazy continuation is never used.
Blockquotes use "> ", never omitting the space.
Bullet lists use "-" as markers where possible, falling back to "*" if using the former is problematic.
Ordered lists use "X." (X being a number) where possible, falling back to "X)" if using the former is problematic.
Bullet lists and ordered lists are indented 4 spaces where possible.
Emphasis always uses "*".
Strong emphasis always uses "**".
Hard line break always uses an explicit "\".
func (*FmtCodec) Unsupported ¶
func (c *FmtCodec) Unsupported() *FmtUnsupported
Unsupported returns information about use of unsupported features that may make the output incorrect. It returns nil if there is no use of unsupported features.
type FmtUnsupported ¶
type FmtUnsupported struct { // Input contains emphasis or strong emphasis nested in another emphasis or // strong emphasis (not necessarily of the same type). NestedEmphasisOrStrongEmphasis bool // Input contains emphasis or strong emphasis that follows immediately after // another emphasis or strong emphasis (not necessarily of the same type). ConsecutiveEmphasisOrStrongEmphasis bool }
FmtUnsupported contains information about use of unsupported features.
type HTMLCodec ¶
type HTMLCodec struct { strings.Builder // If non-nil, will be called for each code block. The return value is // inserted into the HTML output and should be properly escaped. ConvertCodeBlock func(info, code string) string }
HTMLCodec converts markdown to HTML.
type InlineOp ¶
type InlineOp struct { Type InlineOpType // OpText, OpCodeSpan, OpRawHTML, OpAutolink: Text content // OpLinkStart, OpLinkEnd, OpImage: title text Text string // OpLinkStart, OpLinkEnd, OpImage, OpAutolink Dest string // ForOpImage Alt string }
InlineOp represents an inline operation.
type InlineOpType ¶
type InlineOpType uint
InlineOpType enumerates possible types of an InlineOp.
const ( // Text elements. Embedded newlines in OpText are turned into OpNewLine, but // OpRawHTML can contain embedded newlines. OpCodeSpan never contains // embedded newlines. OpText InlineOpType = iota OpCodeSpan OpRawHTML OpNewLine // Inline markup elements. OpEmphasisStart OpEmphasisEnd OpStrongEmphasisStart OpStrongEmphasisEnd OpLinkStart OpLinkEnd OpImage OpAutolink OpHardLineBreak )
func (InlineOpType) String ¶
func (i InlineOpType) String() string
type Op ¶
type Op struct { Type OpType // 1-based line number. If the Op spans multiple lines, this identifies the // first line. For the *End types, this identifies the first line that // causes the block to be terminated, which can be the first line of another // block. LineNo int // For OpOrderedListStart (the start number) or OpHeading (as the heading // level) Number int // For OpHeading (attributes inside { }) and OpCodeBlock (text after opening // fence) Info string // For OpCodeBlock and OpHTMLBlock Lines []string // For OpParagraph and OpHeading Content []InlineOp }
Op represents an operation for the Codec.
type OpType ¶
type OpType uint
OpType enumerates possible types of an Op.
type SmartPunctsCodec ¶
type SmartPunctsCodec struct{ Inner Codec }
SmartPunctsCodec wraps another codec, converting certain ASCII punctuations to nicer Unicode counterparts:
A straight double quote (") is converted to a left double quote (“) when it follows a whitespace, or a right double quote (”) when it follows a non-whitespace.
A straight single quote (') is converted to a left single quote (‘) when it follows a whitespace, or a right single quote or apostrophe (’) when it follows a non-whitespace.
A run of two dashes (--) is converted to an en-dash (–).
A run of three dashes (---) is converted to an em-dash (—).
A run of three dot (...) is converted to an ellipsis (…).
Start of lines are considered to be whitespaces.
func (SmartPunctsCodec) Do ¶
func (c SmartPunctsCodec) Do(op Op)
type StringerCodec ¶
StringerCodec is a Codec that also implements the String method.
type TTYCodec ¶
type TTYCodec struct { Width int // If non-nil, will be called to highlight the content of code blocks. HighlightCodeBlock func(info, code string) ui.Text // If non-nil, will be called for each relative link destination. ConvertRelativeLink func(dest string) string // contains filtered or unexported fields }
TTYCodec renders Markdown in a terminal.
The rendered text uses the following style:
Adjacent blocks are always separated with one blank line.
Thematic breaks are rendered as "────" (four U+2500 "box drawing light horizontal").
Headings are rendered like "# Heading" in bold, with the same number of hashes as in Markdown
Code blocks are indented two spaces. The HighlightCodeBlock callback can be supplied to highlight the content of the code block.
HTML blocks are ignored.
Paragraphs are always reflowed to fit the given width.
Blockquotes start with "│ " (U+2502 "box drawing light vertical", then a space) on each line.
Bullet list items start with "• " (U+2022 "bullet", then a space) on the first line. Continuation lines are indented two spaces.
Ordered list items start with "X. " (where X is a number) on the first line. Continuation lines are indented three spaces.
Code spans are underlined.
Emphasis makes the text italic. (Some terminal emulators turn italic text into inverse text, which is not ideal but fine.)
Strong emphasis makes the text bold.
Links are rendered with their text content underlined. If the link is absolute (starts with scheme:), the destination is rendered like " (https://example.com)" after the text content.
Relative link destinations are not shown by default, since they are usually not useful in a terminal. If the ConvertRelativeLink callback is non-nil, it is called for each relative links and non-empty return values are shown.
The link description is ignored for now since Elvish's Markdown sources never use them.
Images are rendered like "Image: alt text (https://example.com/a.png)".
Autolinks have their text content rendered.
Raw HTML is mostly ignored, except that text between <kbd> and </kbd> becomes inverse video.
Hard line breaks are respected.
The structure of the implementation closely mirrors FmtCodec in a lot of places, without the complexity of handling all edge cases correctly, but with the slight complexity of handling styles.
type TextCodec ¶
type TextCodec struct {
// contains filtered or unexported fields
}
TextCodec is a codec that dumps the pure text content of Markdown.
type TraceCodec ¶
TraceCodec is a Codec that records all the Op's passed to its Do method.
func (*TraceCodec) Do ¶
func (c *TraceCodec) Do(op Op)
func (*TraceCodec) Ops ¶
func (c *TraceCodec) Ops() []Op