sxhtml - Generate HTML from S-Expressions
HTML can be represented as a symbolic expression, also called
s-expression
or sexpr (for short). This is a similar approach compared to
SXML, an attempt to encode XML as
S-expressions.
For example, the following simple HTML text:
<html>
<head><title>Example</title></head>
<body>
<h1 id="main">Title</h1>
<p>This is some example text.</p>
<hr>
<div class="small" id="footnote">Small text.</div>
</body>
</html>
A s-expression representation could be:
(html
(head (title "Example))
(body
(h1 (@ {id "main}) "Title")
(p "This is some example text.")
(hr)
(div (@ {class "small"} {id "footnote}) "Small text".)
)
)
The s-expression representation has the advantage of easier parsing than the
HTML text. In addition, a s-expression can be easier analysed and possibly
optimized, compared to a string representation. For example, a ((p) (p))
can
be simplified to ((p))
. Similar there are circumstances, where a (li (p "text))
should be transformed to (li "text")
.
This library allows to generate HTML from s-expressions.
Often, HTML is generated by using string template libraries,
like Mustache (many programming
languages), Jinja (Python), or
html/template (Go).
One problem area is to escape certain characters, which have a special
meaning in various parts of the HTML text. Obviously, the less-than character
"<
" signals the beginning of a tag and cannot be used literally in normal
text. It must be replaced by "<
". Now, the ampersand character "&
" has a
special meaning too. It must be replaced with "&
". But this is only true
for ordinary HTML content. Within HTML attributes (for example "href" in "<a href="...">...</a>
"), other characters must not occur. If you embed JavaScript
in your HTML text, there is another set of rules.
Most string template libraries fail on certain scenarios. Mustache provide
replacement characters only for HTML content, but not even for HTML attributes.
Similar for Jinja. The html/template library for Go requires the developer to
correctly specify the adequate escaping mode.
This is because string template libraries operates on, well, the string level.
All structure of the HTML text is lost.
By using a structured representation of HTML, the HTML generator knows about
the specific context and can automatically select the appropriate escape mode.
Language
SxHTML is relatively lenient about the supported HTML language. However, if in
doubt, it is targeted for HTML5. All tag and attribute names must be lowercase
symbols. Do not use strings or keywords to specify a tag or an attribute.
SxHTML does not check, if a symbol specifies a valid HTML tag or attribute.
Some tag and attribute symbols have a special meaning.
https://html.spec.whatwg.org/multipage/syntax.html#void-elements specifies
the list of void elements that does not have and end tag. All other tags will
haven an end tag.
https://html.spec.whatwg.org/multipage/indices.html#attributes-1 associates
attribute names with expected content. This will result in an additional
escaping mechanism for specific content type. Currently, only URL content is
recognized and escaped.
In addition to the list above, the are some heuristics in detecting content
type based on the attribute name.
- A prefix of "data-" is stripped. For example,
data-href
is also treated as
an URL attribute.
- If there is no "data-" prefix, any namespace prefix is stripped. For example,
svg:href
is also treated as an URL attribute, but not svg:data-href
.
- The namespace "xmlns" will always result in treating the attribute as an URL
attribute, e.g.
xmlns:svg
.
- If the attribute name contains one of the strings "url", "uri", "src", it
will be treated as an URL attribute.
- If the attribute name starts with "on", it will be treated in future versions
as JavaScript.
- An attribute name "style" will treat the attribute value as CSS in the
future.
SxHTML defines some additional symbols, all starting with "@":
@
specifies the attribute list of an HTML tag. If must follow immediately
the tag symbol and contains a list of pairs, where the first component is a
symbol and the second component is a string, a keyword, or a number.
@C
marks some content that should be written as <![CDATA[...]]>
.
@H
specifies some HTML content that must not be escaped. For example,
@H "&"
is transformed to &
, but not &amp;
.
@@
specifies a HTML comment, e.g. (@@ "comment")
is transformed to
<!-- comment -->
.
@@@
specifies a multiline HTML comment, e.g. (@@@ "line1" "line2")
is
transformed to \n<!--\nline1\nline2\n-->\n
.
@@@@
specifies the doctype statement, e.g. (@@@@ (html ...))
is
transformed to <!DOCTYPE html>\n<html>...</html>
.
HTML defines some tags as void elements.
A void element has no content, they have a start tag only.
End tags must not be specified, SxHTML will not generated them.
Any content except attributes are ignored.
Void elements are: area
, base
, br
, col
, embed
, hr
, img
, input
, link
, meta
, source
, track
, and wbr
.
Attributes
Attributes are always in the second position of a list containing a tag symbol.
For example (a (@ (href . "https://codeberg.org/t73fde/sxhtml")) "SxHTML)
specifies a link to the page of this library.
It will be transformed to <a href="https://codeberg.org/t73fde/sxhtml">SxHTML</a>
.
The syntax for attributes is as follows:
- The first element of the attribute list must be the symbol
@
.
- Remaining elements must be a list, where the first element of the list is a symbol, which names the attribute.
- If there is no second element in the list, the attribute is an empty attribute.
For example,
(input (@ (disabled)))
will be transformed to <input disabled>
.
- If there is a second element in the list, it must be an atomic value, preferably a string.
For example,
(input (@ (disabled "yes")))
will be transformed to <input disabled="yes">
.
- If the lists contains more elements, they are ignored.
- if the list is really a cons cell, the second element of the cons cell must be an atomic value, preferably a string.
For example,
(input (@ (disabled . "yes")))
will be transformed to <input disabled="yes">
.
Since the attribute list is just a list, there might be duplicate symbols as attribute names.
Only the first occurrence of the symbol will create an attribute.
For example, (input (@ (disabled "no") (disabled . "yes")))
will be transformed to <input disabled="no">
.
This allows to extend the list of attributes at the front, if you later want to overwrite the value of an attribute.
If you want to prohibit the generation of some attribute while still exntending the list of attributes at the front,
use the boolean Value False as the value of the attribute.
For example, (input (@ (disabled False) (disabled . "yes")))
will be transformed to <input>
.