When it comes to use cases like quick code formatting and syntax highlighting across many languages, tree-sitter is an excellent tool. But it does so much more than that. At Bearer, we use it as the base for our static code analysis feature.
In this article we’ll look at tree sitter, how to use it, and how to avoid some of mistakes we made when implementing it. This should help you in making the decision if tree sitter is a good choice for your use case.
What is tree sitter?
Tree sitter is a parser generator tool and an incremental parsing library. It enables you to generate concrete syntax tree from any language it supports. The official site has a full list of the supported languages and parsers.
And output a concrete tree that looks like this:
Looking at the above results, you can see that tree-sitter generated a bunch of tokens, and that each token has 2 points.
Those points contain a column and line number for the token’s start and end. As each language has their own set of rules which differ slightly from one to another, tokens will be different depending on the language you are currently parsing.
The reason why this tree is valuable is because it provides a level of abstraction, on top of which we can build our data extraction and detection engine, all while keeping the context—like line numbers.
Simple tree sitter queries
A tree on its own can be useful, but where it really shines is the ability to query nodes within the tree. Tree sitter has its own limited query language. It is somewhat of a cross between SQL and JSON querying. Tree sitter is optimized to query the tree on each file save, so you can expect blazing fast performance.
You can see the full spec for tree sitter’s query on their official page, but if like me you are looking to jump right into it we can walk trough some examples.
A basic example would be to look for each time the object property is accessed. You can do this by querying member_expression:
You assign a node name to the @param_expression variable, then from the consuming program you can access the @param_expression value. As a result of that query you get the highlighted nodes:
Those simple queries are quite easy to construct. Things get a bit more complicated as you try to build more complex queries.If you try to get the parent and child of each object, the naive way of building the query might be:
But you will soon find out that a query like that doesn’t work for nested objects. For example, if you change the earlier code to find the address instead of the name, you get a different structure.
And here’s the new result:
As you can see, the query no longer matches this use case. Instead of the object property being identifier, you have nested member_expressions. With one object resolving to a property of child member_expression.
There is no easy way of solving this problem using a tree sitter query without those queries getting huge. As part of our ongoing data type detection at Bearer, we needed those relationships in order to run our classification algorithms and detect personal data usage. We had to build something on top of tree sitter.
We solve this concrete example by programmatically walking through the tree from the bottom nodes upwards and assigning child properties to the parent objects.
Query performance tuning
Tree sitter’s query language is simple by design, and you should treat it as such. While it is optimized for performance, if you try to do too much, you could set yourself up for some problems.
Let's take a look at an example of an html structure:
If you want to query adjacent elements, you might end up with this query:
You might expect to only match valid pairs like "parent, parent", "child1, child2", "child2, child3", "child3, child4", and the second pair of "child1, child2".
If you execute that query, you’ll quickly realize that instead you get every possible combination. This means you would get "child1, child2", "child1, child3", "child1, child4" and so on.
This might not be an issue when running simple queries against small datasets, but since the number of returned results is exponential, your runtime will also be exponential. For 100kb of HTML you might find yourself at 10-100s runtime. That can quickly become a problem.
The only efficient way to get around this is to write a custom program (both NodeJS and Go have great tree-sitter binding libraries) which navigate the syntax tree generated by tree sitter and pulls out the element siblings you need.
Compare to other tool
If your only need is markup languages such as XML, JSON, YAML you might get away without using tree sitter at all. In that case, the JSON parser of your programming language may be enough.
If you do need more control and support, using tree sitter for static code analysis is magnitudes better than using grep and regular expressions. You can learn more on how our tool differs from that approach in our CTO’s article on how we perform static code analysis.
The market for AST tooling isn’t exactly crowded. There are multiple language server protocols (LSP) available for each language, and some are even based on tree sitter. The main difference between tree sitter and LSP is that while tree-sitter is more about understanding a file, LSP is more about understanding the project. So while it isn’t uncommon for LSP to take a couple of seconds to perform its duties, tree-sitter responds in milliseconds.
Hopefully, you now have a better understanding of tree sitter and how to approach using it. It can seem like overkill for some projects, and not quite powerful enough for more advanced use cases, but it succeeds as a foundation for which to build multi-language tooling.
Are you using tree-sitter or a similar tool to parse languages for an interesting use case? We’d love to hear about it! Let us know on @trybearer.