forest

package module
v1.9.57 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 27, 2024 License: MIT Imports: 517 Imported by: 0

README

a Go 🌳 Sitter Forest

Where a Gopher wanders around and meets lots of 🌳 Sitters...

First of all, giving credits where they are due:

This repository started as a fork of @smacker's go-tree-sitter repo until I realized I don't want to also handle the bindings library itself in the same project (i.e. the stuff in the root of the repo, exposing sitter.Language type itself & co.), I just want a (big) collection of all the tree-sitter parsers I can add.

So here it is: started with the parsers and the automation from the above mentioned repo then added a bunch more parsers on top of it and updated automation (to support more parsers and also to automatically update the PARSERS.md file, git tags, etc.).

The credits for the parsers go to all the parsers' respective authors (see grammars.json for the source of each and all of the parsers).

This repository does NOT implement any parsers at all, it simply automates pulling them in from upstream, re-generating them from grammar.js and providing the Go bindings with it.

See PARSERS.md for the list of supported parsers. The end goal is to maintain parity with nvim_treesitter and to add any other parsers I find, so that it becomes as complete as possible.

For contributing (or just to see how the automation works) see CONTRIBUTING.md.

Differences

  • ~490 parsers in this repo vs. ~30 in the parent repo;
  • all (but 7) are regenerated from grammar.js (via tree-sitter generate) instead of copying the pre-generated files from the parser repo;
  • end-to-end "tree-sitter version alignment", see below,
  • queries fetching mechanism, so that you can get the queries along with the parsers;
  • filetype detection mechanism that allows to quickly determine which parser and/or query you should pull for a given file (see filetype.json);
  • 3 different ways you can use parsers;
  • kept up to date pretty much on a daily basis with the parsers' changes;
  • kept up to date with tree-sitter;
  • constantly adding new parsers as they become available, even new/experimental ones;
  • tests suite to ensure all the parsers in the repo can actually parse content (can build & run successfully).
Tree-Sitter Version Alignment

This library is designed to ensure that we are aligned end-to-end with the latest tree-sitter version:

  • the bindings library go-tree-sitter-bare (itself forked from the same repo - see it's own README for why and/or differences),
  • the tree-sitter-cli in package.json and consequently
  • the generated parsers included, ALL use the same tree-sitter version (generally the latest).

Contrast that with the parent repo where bindings lag quite a bit behind tree-sitter (was last updated to v0.22.5) and then on top of that, the parsers' files (parser.c, parser.h, etc.) are copied from the parser repos, meaning that they are each built with whatever version happened to be last built by the parsers' repo maintainers, so it can vary greatly from each other AND from the version the bindings are built with.

Just as an example, in the parent repo, toml parser was generated with https://github.com/ikatyang/tree-sitter/tree/fc5a692 which is v0.19.3~10 of tree-sitter, vs. in here where it was last rebuilt with v0.24.2 (actually, with 0.24.4 but there were no changes in parser.c so nothing was committed).

Naming Conventions

The language name used is the same as TreeSitter language name (the name exported by grammar.js) and the same as the query folder name in nvim_treesitter (where applicable).

This keeps things simple and consistent.

In rare cases, the Go package name differs from the language name:

  • go actually has the package name Go because package go does not go well in Go (pun intended) but otherwise the language name remains "go";
  • func language, same problem as above, so package name is actually FunC (but everything else is func as normal: folder, language name, etc.);
  • context language, same problem (conflict with stdlib context package) so it uses the name ConTeXt;
  • COBOL language is named COBOL in grammar.js but we expose it as cobol (for aligning with the rest of the parsers);
  • dotenv language is named env in grammar.js but we expose it as dotenv;
  • walnut language is named cwal in grammar.js but we retain it as walnut;
  • janet language is named janet_simple in grammar.js but in here is simply named janet;
  • cgsql is named cql in grammar.js but we expose it as cgsql;
  • verilog from https://github.com/gmlarumbe/tree-sitter-systemverilog is renamed to systemverilog to disambiguate it from the plain verilog grammar.

Also, some languages may have names that are not very straightforward acronyms. In those cases, an altName field will be populated, i.e. requirements language has an altName of Pip requirements, query has Tree-Sitter Query Language and so on. Search grammars.json for your grammar of interest.

Usage

Parsers

See the README in go-tree-sitter-bare, as well as the example_*.go files in this repo.

This repo only gives you the GetLanguage() function, you will still use the sibling repo for all your interactions with the tree.

You can use the parsers in this repo in several ways:

1. Standalone

You can use the parsers one (or more) at a time, as you'd use any other Go package:

package main

import (
	"context"
	"fmt"

	"github.com/alexaandru/go-sitter-forest/risor"
	sitter "github.com/alexaandru/go-tree-sitter-bare"
)

func main() {
	content := []byte("print('It works!')\n")
	node, err := sitter.Parse(context.TODO(), content, sitter.NewLanguage(risor.GetLanguage()))
	if err != nil {
		panic(err)
	}

	// Do something interesting with the parsed tree...
	fmt.Println(node)
}

Please note that in standalone mode, the GetLanguage() returns an unsafe.Pointer rather than a *sitter.Language. To pass it to sitter.Parse() or parser.SetLanguage() you need to wrap it in sitter.NewLanguage() as above.

The rationale is to enable end users to use the parsers with the bindings library of their choice, whether it's @smacker's library or go-tree-sitter-bare.

2. In Bulk

If (and only IF) you want to use ALL (or most of) the parsers (beware, your binary size will be huge, as in 350MB+ huge) then you can use the root (forest) package:

package main

import (
	"context"
	"fmt"

	forest "github.com/alexaandru/go-sitter-forest"
	sitter "github.com/alexaandru/go-tree-sitter-bare"
)

func main() {
	content := []byte("print('It works!')\n")
	parser := sitter.NewParser()
	parser.SetLanguage(forest.GetLanguage("risor"))

	tree, err := parser.Parse(context.TODO(), nil, content)
	if err != nil {
		panic(err)
	}

	// Do something interesting with the parsed tree...
	fmt.Println(tree.RootNode())
}

this way you can fetch and use any of the parsers dynamically, without having to manually import them. You should rarely need this though, unless you're writing a text editor or something.

Note that unlike individual mode, the forest.GetLanguage() returns a *sitter.Language which can be passed directly to parser.SetLanguage().

3. As a Plugin

A third way, and perhaps the most convenient (no, it's not, it's ~300MB with all parsers built into the binary whereas all parsers built as plugins took ~1400MB for all 354 parsers), is to use the included Plugins.make makefile, which allows easy creation of any and all plugins. Simply copy it to your repo, and then you can easily make -f Plugins.make plugin-risor, etc. or use the plugin-all target which creates all the plugins.

Then you can selectively use them in your app using the plugins mechanism.

IMPORTANT: You MUST use -trimpath when building your app, when using plugins (the Plugins.make file already includes it, but the app that uses them also needs it).

4. Your Own Way

You can mix and match the above, obviously.

Probably the best approach would be to build your own "mini-forest", using the forest package as a template but only including the languages you are interested in.

I'm not excluding offering "mini forests" in the future, guarded by build tags, if I ever figure some subsets that make sense (most used/popular/known/whatever).

Info

Each individual parser (as well as the bulk loader) offers an Info() function which can be used to retrieve information about a parser. It exposes it's entry from grammars.json either raw (as a string holding the JSON encoded entry) or as an object (only available in bulk mode).

The returned Grammar type implements Stringer so it should give a nice summary when printed (to screen or logs, etc.).

Queries

The root package not only includes the "parsers forest" but also the corresponding queries. The queries are compiled from two sources:

  1. nvim_treesitter project and
  2. the individual sitter repos' own queries folders.

The queries are embedded in the packages (at the time of writing this, for 359 parsers, the queries are only 11MB) and they can be fetched exactly the same as languages, just replace GetLanguage() with GetQuery(kind) or forest.GetQuery(lang, kind). They are available for standalone packages, plugins as well as forest itself.

The kind is one of {highlights, indent, folds, etc.} (preferably without the ".scm" extension, but will work with it included as well). I.e. to get the highlights query for Go, one would call forest.GetQuery("go", "highlights").

You can optionally pass the query lookup preference, see the NvimFirst, NativeFirst, etc. in forest.go for details, as in: forest.GetQuery("go", "highlights", forest.NvimOnly).

The queries respect the "inherits:" directive (nvim_treesitter specific), recursively, returning the final query with all inherited queries included, at the forest level. The individual packages' own GetQuery() obviously cannot do that, since they do not have access to other parsers' own queries, only the forest has that. See forest.GetQuery() on how to replicate that on your end if using the individual packages.

File Type Detection

The root package also includes a file type detector: forest.DetectLanguage(<abs path|rel path|filename>). For best results, the absolute path to the file should be provided as that enables all the available detectors, in order of priority:

  • shebang or vim modeline - whichever is available on the 1st 255 bytes of the file;
  • glob matching against the path tail (i.e. */*/foo.txt will match .../a/b/foo.txt regardless of the rest of the path),
  • file name;
  • file extension.

The language name is obviously the same as parser and query name.

You can optionally register your own "patterns" (only for languages that are part of the forest, as they are validated against it) or override existing patterns (particularly useful where there is file extension clashing, like both V and Verilog using .v file extension - you can opt for one or the other, etc.). See forest.RegisterLanguage() for details.

You can inspect the mapping in the filetype.json file.

Parser Code Changes

For transparency, any and all changes made to the parsers' (and, to be clear, I include in this term ALL the files coming from parsers, not just parser.c) files are documented below.

For one thing ALL changes are fully automated (any exceptions are noted below), no change is ever made manually, so inspecting the automation should give you a clear picture of all the changes performed to the code, changes which are detailed below:

  • the include paths are rewritten to use a flat structure (i.e. "tree_sitter/parser.h" becomes "parser.h"); This is needed so that the files are part of the same package, plus it also makes automation simpler;
  • for unison the scanner file includes maybe.c which causes cgo to include the file twice and throw duplicate symbols error. The solution chosen was to copy the content of the included file into the scanner file and set the included file to zero bytes; this way all the code is in one file and the compilation is possible;
  • similar to unison, comment and perl also use the same technique of combining C files;
  • for parsers that include a tag.h file: the TAG_TYPES_BY_TAG_NAME variable clashes between them (when those parsers are all included into one app). The solution chosen was to rename the variable by adding the _<lang> suffix, i.e., we currently have:
    • TAG_TYPES_BY_TAG_NAME_astro;
    • TAG_TYPES_BY_TAG_NAME_html;
    • TAG_TYPES_BY_TAG_NAME_svelte;
    • TAG_TYPES_BY_TAG_NAME_vue;
  • for parsers that define serialize(), deserialize(), scan() (and a few others) (i.e. org, beancount, html & a few others): the offending identifiers are renamed by appending the _<lang> suffix to them (i.e. serialize -> serialize_org, etc.); See the putFile() function in internal/automation/main.go for details;
  • some parsers' grammar.js files were not yet updated to work with latest TreeSitter, in which case we hot patch them before regenerating the parser. See the replMap in downloadGrammar() function.

Documentation

Overview

Package forest makes all the parsers, queries and filetype detection.

This includes every single parser, as long as it is not pending (even new/experimental, etc.).

The parsers, queries and filetype detection work in tandem: the exact same language returned by filetype detection, can be used to pull in the language parser as well as the corresponding queries.

Index

Examples

Constants

View Source
const (
	NvimFirst   byte // Fetch both, prefer nvim.
	NativeFirst      // Fetch both, prefer native.
	NvimOnly         // Fetch only from nvim.
	NativeOnly       // Fetch only from sitter repo.

)

Query fetching preference.

Variables

This section is empty.

Functions

func DetectLanguage added in v1.5.105

func DetectLanguage(fpath string) string

DetectLanguage detects the language name based on given file path. The given fpath should preferably be the absolute path as that guarantees that all the detectors can be used. It can however work with relative paths, the filename or even with just the file extension (including leading dot) alone. However the success rate will correspondingly be reduced due to the inability to use all the detectors available.

func GetLanguage added in v1.5.11

func GetLanguage(lang string) (l *sitter.Language)

GetLanguage returns the corresponding TS language function for given lang, and caches it so that language copies are not created unnecessarily.

Example
package main

import (
	"context"
	"fmt"

	"github.com/alexaandru/go-sitter-forest/lua"
	sitter "github.com/alexaandru/go-tree-sitter-bare"
)

func main() {
	content := []byte("print('It works!')\n")

	node, err := sitter.Parse(context.TODO(), content, sitter.NewLanguage(lua.GetLanguage()))
	if err != nil {
		panic(err)
	}

	// Do something interesting with the parsed tree...
	fmt.Println(node)
}
Output:

(chunk (function_call name: (identifier) arguments: (arguments (string content: (string_content)))))

func GetQuery added in v1.5.89

func GetQuery(lang, kind string, opts ...byte) (out []byte)

GetQuery returns (if it exists) the `kind`.scm query for `lang` language, using the DefaultPreference, resolving "inherits" directives, recursively. You should omit the ".scm" extension.

func Info

func Info(lang string) (gr *grammar.Grammar)

Info returns the language parser (grammar) related information. TODO: Hmm, now that we also have queries, should it include info about them too? Or offer a similar function?

Example

This is still an example for GetLanguage, but I cannot have two ExampleGetLanguage in the same package.

package main

import (
	"context"
	"fmt"

	forest "github.com/alexaandru/go-sitter-forest"
	sitter "github.com/alexaandru/go-tree-sitter-bare"
)

func main() {
	content := []byte("print('It works!')")
	parser := sitter.NewParser()
	parser.SetLanguage(forest.GetLanguage("lua"))

	tree, err := parser.ParseString(context.TODO(), nil, content)
	if err != nil {
		panic(err)
	}

	// Do something interesting with the parsed tree...
	fmt.Println(tree.RootNode())
}
Output:

(chunk (function_call name: (identifier) arguments: (arguments (string content: (string_content)))))

func RegisterLanguage added in v1.5.107

func RegisterLanguage(pat, lang string) error

RegisterLanguage allows end users to register their own mappings, potentially overriding existing ones. Typical use would be for languages not maintained by this library, or overriding ambiguous ones (i.e. v vs verilog for `v`, ldg or ledger for `ldg`, etc.).

The pattern pat can be a glob, a path, a filename or a file extension (including the leading dot).

func SupportedLanguage added in v1.5.119

func SupportedLanguage(lang string) bool

SupportedLanguage checks if the given language is supported.

func SupportedLanguages added in v1.5.11

func SupportedLanguages() []string

SupportedLanguages returns the list of supported languages' names.

Types

This section is empty.

Directories

Path Synopsis
abap module
abl module
ada module
agda module
aiken module
al module
alcha module
amber module
angular module
animationtxt module
ansible module
anzu module
apex module
applesoft module
arduino module
asciidoc module
asm module
astro module
august module
authzed module
awa5_rs module
awatalk module
awk module
bara module
bash module
bass module
beancount module
bend module
bibtex module
bicep module
bitbake module
blade module
blueprint module
bluespec module
bond module
bp module
bqn module
brightscript module
bruno module
c module
c3 module
c_sharp module
ca65 module
cairo module
calc module
capnp module
carbon module
cds module
cedar module
cel module
cerium module
cfengine module
cg module
cgsql module
chatito module
circom module
clarity module
cleancopy module
clingo module
clojure module
cloudflare module
cmake module
cmdl module
cobol module
cognate module
comment module
commonlisp module
context module
cooklang module
core module
corn module
cpon module
cpp module
crystal module
css module
csv module
cuda module
cue module
cylc module
d module
d2 module
dale module
dart module
dataweave module
dbml module
desktop module
devicetree module
dezyne module
dhall module
diff module
disassembly module
djot module
djot_inline module
dockerfile module
dot module
dotenv module
doxygen module
dtd module
dune module
earthfile module
ebnf module
editorconfig module
eds module
eex module
effekt module
eiffel module
elisp module
elixir module
elm module
elsa module
elvish module
epics_cmd module
epics_db module
erlang module
facility module
familymarkup module
fastbuild module
faust module
fe module
fennel module
fidl module
fin module
firrtl module
fish module
flamingo module
fluentbit module
foam module
forth module
fortran module
fsh module
fsharp module
func module
fusion module
gab module
galvan module
gap module
gaptst module
gdscript module
gdshader module
gemfilelock module
gherkin module
git_config module
git_rebase module
gitcommit module
gitignore module
gleam module
glimmer module
glint module
glsl module
gn module
gnuplot module
go module
gobra module
goctl module
gomod module
gooscript module
gosum module
gotmpl module
gowork module
gpg module
gram module
graphql module
gren module
gritql module
groovy module
gstlaunch module
hack module
haml module
hare module
haskell module
haxe module
hcl module
heex module
helm module
hjson module
hl7 module
hlsl module
hlsplaylist module
hocon module
hoon module
html module
htmlaskama module
htmldjango module
http module
http2 module
hungarian module
hurl module
hy module
hylo module
hyprlang module
i3config module
idl module
idris module
ignis module
ini module
ink module
inko module
integerbasic module
internal
automation/grammar
Package grammar holds the main structure that the automation operates with: the Grammar.
Package grammar holds the main structure that the automation operates with: the Grammar.
automation/util
Package util holds a few useful functions both for the forest package as well as for automation.
Package util holds a few useful functions both for the forest package as well as for automation.
ispc module
jai module
janet module
janet_simple module
jasmin module
java module
javascript module
jinja module
jinja_inline module
jq module
jsdoc module
json module
json5 module
jsonc module
jsonnet module
jule module
julia module
just module
kamailio_cfg module
kanshi module
kappa module
kcl module
kconfig module
kdl module
koan module
koka module
kon module
kos module
kotlin module
koto module
kusto module
lalrpop module
lart module
lat module
latex module
latte module
ldg module
ledger module
leo module
lexc module
lexd module
lilypond module
linkerscript module
liquid module
liquidsoap module
lithia module
llvm module
lookml module
lox module
lua module
luadoc module
luap module
luau module
m68k module
magik module
make module
mandbconfig module
markdown module
marte module
matlab module
mcfuncx module
menhir module
merlin6502 module
mermaid module
meson module
mlir module
modelica module
moonbit module
moonscript module
motoko module
move module
mustache module
muttrc module
mxml module
mylang module
nasm module
nelua module
nesfab module
nftables module
nginx module
nickel module
nim module
ninja module
nix module
norg module
note module
nqc module
nu module
objc module
objdump module
ocaml module
ocamllex module
odin module
org module
ott module
pact module
pascal module
passwd module
pdxinfo module
pem module
perl module
perm module
pgn module
php module
php_only module
phpdoc module
pic module
pint module
pioasm module
pkl module
plantuml module
po module
pod module
poe_filter module
pony module
postscript module
poweron module
powershell module
printf module
prisma module
problog module
prolog module
promql module
properties module
proto module
proxima module
prql module
psv module
pug module
puppet module
purescript module
pymanifest module
pyrope module
python module
qbe module
ql module
qmldir module
qmljs module
quakec module
query module
quint module
r module
racket module
ralph module
rasi module
rbs module
rcl module
re2c module
readline module
regex module
rego module
requirements module
rescript module
risor module
rnoweb module
robot module
robots module
roc module
ron module
rstml module
rsx module
rtx module
ruby module
runescript module
rust module
sage module
scala module
scfg module
scheme module
scss module
sdml module
sflog module
simula module
sincere module
slang module
slim module
slint module
smali module
smith module
smithy module
sml module
snakemake module
snl module
sol module
solidity module
sop module
soql module
sosl module
sourcepawn module
sparql module
spicy module
sql module
sql_bigquery module
sqlite module
squirrel module
ssh_config module
starlark module
strace module
structurizr module
styled module
superhtml module
surface module
surrealql module
sus module
svelte module
sway module
swift module
sxhkdrc module
syphon module
systemtap module
t32 module
tablegen module
tact module
talon module
tcl module
teal module
templ module
tera module
terra module
test module
textproto module
thrift module
tiger module
tlaplus module
tmux module
tnsl module
todolang module
todotxt module
toml module
tort module
tsv module
tsx module
tup module
turtle module
twig module
twolc module
typescript module
typespec module
typoscript module
typst module
udev module
uiua module
ungrammar module
unison module
ursa module
usd module
uxntal module
v module
vala module
vento module
verilog module
vhdl module
vhs module
vim module
vimdoc module
virdant module
virgil module
vrl module
vue module
walnut module
wbproto module
wgsl module
wgsl_bevy module
wing module
wit module
woml module
wtf module
xcompose module
xfst module
xml module
xresources module
yadl module
yaml module
yang module
yaral module
yarnlock module
yuck module
zathurarc module
zeek module
zig module
ziggy module
ziggy_schema module
zoomba module

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL