Veneto Devlog #1: Fun times with parsing and macros

Yeehaw! The parser is done for now! That's a pretty big milestone for the early development of this project; it unblocks all of the stuff that needs it, which is like, everything else. The other half of the v0.1 release is the Test Client, which means now I get to try out the real fun stuff - compiling this Rust code to WASM and linking that into a VSCode extension. More on that coming soon!

Let's talk about the parser, though. Parsing is Cool and Fun. It's also an extremely dense topic (take a look at any Wikipedia article on the subject if you don't believe me), so I'd like to provide an entry-level rundown of how it works and how I personally approach it.

Background

Here's a sample of the language - the entire standard library, in fact:

/**
 * A `Ref` is a reference to another existing resource.
 *
 * This is most useful in conjunction with `List`s, as a `List<Ref<T>>`.  
 * That allows you to remove an item from the list by `DELETE`ing its `Ref`, 
 * rather than deleting the entire item itself. 
 */
resource Ref<T> { 
    embed T
    
    /// GET is not mandatory for this type
    GET -> #405;

    /// Removes the referenced item
    DELETE;
}

/**
 * This is a paginated list.  
 */
resource List<T> {

    embed T[]

    links { 
        /**
         * References the next page of the list, if such a page exists.
         * 
         * If empty, this is the final page of the list. 
        */
        next? -> @self, 
    }

    /// Renders the current page of the list.
    GET;

    /// Creates a new item within the list. 
    POST T -> #201 T;
}


/**
 * This is essentially a placeholder for anything that's not data.
 *
 * For example, this is appropriate to allow users to upload and download images. 
*/
resource Media { 
    GET -> @media;
    PUT @media;
}

/// A simple POST-only link with no request body. 
resource Action { 
    POST -> #204;
}

This language describes families of HTTP endpoints, called Resource Classes and denoted here with the resource keyword. Within the Resource Classes we specify the data that they embed with the embed keyword, their HTTP methods, and how they link to other resources with the links keyword. Line comments // and block comments /* */ work like they do in most other languages.

Hopefully this makes enough sense at a glance - I still need to come up with some better documentation and examples for how this might be used in the real world. For now, don't worry too much about the specifics of the language; just take it at face value that it's a good idea to come up with our own language for this purpose, and let's walk through that process together!

So You Want To Design A Language

Languages are defined by their grammar, so let's start there. Formally, a grammar is a set of rules about replacement. Grammars define syntactic variables (or just "variables" for short) which can be replaced by a string (sequence) consisting of symbols (individual units), which themselves might also be variables.

A grammar looks something like the below, where anything on the left side of the = sign can be replaced by whatever's on the right. The unquoted words are variables. The quoted bits are literals - a literal is a symbol that represents the quoted contents. The | operator is called alternation - x | y can evaluate to x or y. In this format, rules always end in a semicolon.

verb = "throw" | "eat" ;
subject = "I" | "You" ;
object = "the apple" | "the football" ;

sentence = subject verb object ;

A language is the set of strings that can be generated by the grammar. Generally, there's one top-level variable - here it's sentence - that defines the entire language. The language generated by sentence consists of these strings:

  • I eat the apple
  • I eat the football
  • You eat the apple
  • You eat the football
  • I throw the apple
  • I throw the football
  • You throw the apple
  • You throw the football

That format above is pretty similar to something called EBNF (Extended Backus-Naur Form), which is very popular for describing computer languages. The main difference is that in standard EBNF you're supposed to use commas to concatenate symbols together, which I don't like, and also that we're making assumptions about the whitespace between the symbols in the language. For our purposes, variables are a single word like foo or bar_baz, and each symbol can be separated by any amount of whitespace - you'll see what I mean later.

It's important to note another assumption we've made, though - that each syntactic variable is defined only once, and that it's the only thing on the left side of the = in each rule. I talk more about that in the footnotes, but let's carry on for now.

Recursion and Parsing

Righto, so languages can also be recursive - this is when it starts to get interesting. Consider this grammar:

digit = '1' | '2' | '3' ;
group = '(' sentence ')' ; 
sentence = digit digit | digit group digit ;

In plain English, this grammar says: a sentence can be two digits, or: one digit, followed by another sentence in parentheses, followed by another digit. This grammar is recursive through group and sentence - there's an infinite loop back and forth between those rules. The language described by sentence is now infinitely big! Check out some of the possibilities here:

  • 1 2
  • 1 ( 1 2 ) 3
  • 1 ( 2 ( 3 3 ) 2 ) 1

So how do we interpret this? Our previous grammar was nice and easy, but now there's an infinite set of possibilities. If you're a computer, all you know is a sequence of characters, but you need to understand the structure of the text as a whole. That's what parsing is all about - taking raw text, applying grammar rules to it, and deriving a syntax tree from it.

Take that last example, 1 ( 2 ( 3 3 ) 2 ) 1. The syntax tree for that sentence looks like this:

sentence
┣ digit: "1"
┃
┣ group
┃   ┣ "("
┃   ┣ sentence
┃   ┃   ┣ digit: "2"
┃   ┃   ┃
┃   ┃   ┣ group
┃   ┃   ┃   ┣ "("
┃   ┃   ┃   ┣ sentence
┃   ┃   ┃   ┃   ┣ digit: "3"
┃   ┃   ┃   ┃   ┗ digit: "3"
┃   ┃   ┃   ┗ ")"
┃   ┃   ┃
┃   ┃   ┗ digit: "2"
┃   ┗ ")"
┃
┗ digit: "1"

We've just done manual parsing! If you've ever had to diagram a sentence in school, you've done it before too.

If we can make a data structure that represents the above in memory, then the computer fully understands the text. Suppose this were a programming language - rather than having to try to figure out what the text means one character at a time, it can step through each node in the syntax tree and say, "ah, this is just a simple sentence, two digits" or "nope, this sentence has a group in the middle of it, I have to do something else first". Of course, real grammars are much bigger and more complicated than this, but that concept you've just grasped is the core idea underlying all computer languages.

One quick note: Above, we have a Concrete Syntax Tree (CST) - it includes things like the parentheses in group. When writing a computer language interpreter, however, we don't need that information (we already know it's a group, therefore it must have parentheses), and we have a bit of wiggle room as far as what goes where. For that reason, the representations that most parsers derive are called Abstract Syntax Trees or ASTs, because they don't strictly represent the grammar; rather, just the important parts.

Optionals and repetition

The only pieces we're missing are optional symbols (sometimes this might be here, sometimes it might not!) and repetition (sometimes there can be more than one of this symbol!). In EBNF and many pseudo-EBNF formats like the one I created, optionals are denoted with [square-brackets] and repetition is denoted with {curly-braces}. They can also be combined [{ like this }] to indicate that it's both - there can be zero, one, or multiple of that symbol.

Congratulations, you now understand formal language theory!

This concept of using recursive grammars to describe patterns is actually thousands of years old; at least as old as the ancient Indian grammarian Pāṇini. In 1956, Noam Chomsky published some important work defining different classes of grammars and how they can be processed, which is the foundation of computer language theory today. It's an incredibly rich topic that plenty of brilliant people have devoted their entire careers to studying. It even has applications far beyond human and computer languages, including things like genome sequencing and studying geological formations. But now, if I've made enough sense up to this point, you have a strong enough grasp of it to implement these ideas on your own. Let's get to it!

How do I teach the computer to do it?

There are lots of compiler generators out there that will do this work for you if you give them a well-defined grammar, but this is a blog post about writing parsers. I decided to write this one by hand as a learning experience, and also so that it can be easily customized to fit whatever needs we end up having.

Before we dive into our implementation, though, I need to describe the techniques we will use.

Lexing

Again, all the computer knows is a sequence of individual characters, one at a time. For us as language implementers, though, it's more useful to reason about entire words, rather than also having to piece them together from characters at every step of the way. In theory, you could define what words look like in your grammar itself, but in practice most parser implementations like to "chunk" characters together to form tokens.

This process is called lexing, and different languages go about it in different ways. I'll gloss over most of the details here, but in Veneto, there are a few important properties we want our lexer to understand:

  • Each token is the longest possible sequence of characters of the same "type" - these types are:
    • Number, which is a sequence of digits
    • Punctuation, which are non-alphanumeric characters - these must match a predefined list of operators or an error is returned
    • Words, which start with a letter and then consist of alphanumeric characters or an underscore - these can either match a predefined set of Keywords or they can be used as an Identifier instead
  • At the end of the file, I chose to have a special EOF token type. Instead, the entire thing could be wrapped in an Option, or be null in languages that have it, but I found it more convenient and readable to have a proper EOF token.
  • Whitespace can be completely ignored wherever it doesn't separate a token (I don't like whitespace-sensitive languages (looking at you, Python))
  • Comments are also (mostly) ignored, and can basically be treated as whitespace in that regard
  • String literals are a thing, and in general it's easier to treat them as a single token, especially if you need to do things like escape quotation marks

With that out of the way, our parser also doesn't have to worry about whitespace or comments, which cleans everything else up a lot. If you're curious, the full description of how Veneto's lexer works can be found here.

Token Streams & Recursive Descent

The most popular way to interface with a lexer is as a token stream, an object which emits tokens. Ours exposes a simple interface with these two methods:

  • peek, which returns the current token without moving on to the next one
  • next, which returns the current token and moves on to the next one

The basic idea of our parser is that we have parsing functions that correspond to rules in the grammar, and these functions accept the token stream as an argument. When invoked, these functions start parsing a clause at the stream's current position, and return an AST node of the appropriate type. Since we're just passing around the same token stream, each part of the parser can just grab tokens however it needs to, without having to care about what's happening anywhere else.

This concept is called recursive descent, and it's my favorite. There's a huge variety of parsing techniques out there, but in my opinion, recursive descent with a token stream is the easiest to read and write. I go into more detail in the footnotes about why this works, but if you're with me so far, let's keep going.

How we use this technique

With these tools in our belt, we have everything we need to parse a clause in our grammar. We start right after the = sign of our rule, and handle each symbol we run into, from left to right. If the symbol is a literal, that's easy - we just check for a token that matches it. If it's a variable, on the other hand, we can call another function to parse the corresponding clause.

There's a few grammatical circumstances that we have to handle:

  • If the symbol is not optional, we can just expect it to be there, and raise an error if we don't find it.
  • If it is optional, then we have to use the peek operation to see if it's there, and then handle it accordingly
  • To handle repitition, we start a loop, handling all of the optional and non-optional symbols like we would otherwise, and then peek for the next non-repeated symbol so that we know when to terminate the loop.
  • To handle alternation, we can just peek for each possibility accordingly. Generally, we want to try the possibilities in order, from left to right - this is called leftmost derivation.

So, with that in mind, there's two sets of tools I'll introduce to help us out.

Peeking for tokens

With this method, a common pattern arises very quickly: sometimes you need to peek a token, and advance the stream only if it matches a certain variant you're expecting. To help with this, I introduced a set of helper operations that I call "peeking for" a token:

  • peek_for_identifier peeks at the current token in the stream; if it's an Identifier, then it returns the string containing its contents and advances the stream - otherwise, it doesn't advance the stream and returns None.
  • peek_for_punctuation peeks for a specific kind of Punctuation; if the current token matches it, it advances the stream and returns true - otherwise, it doesn't advance the stream and returns false. We return a boolean here because, unlike Identifier which can have any value the user wants, here we are looking for a specific predefined type of punctuation.

Traits: Expecting and Peeking clauses

Here's where we properly dive into the specifics of our parser. The parser is written in Rust, but hopefully it's understandable even if you're not familiar.

Our parsing functions are defined as associated functions on our AST nodes, which are just data structures that we create. There's two flavors of those functions, each with their own trait:

pub trait Expectable where Self: std::marker::Sized { 
    fn parse_expect(stream: &mut TokenStream) -> ParseResult<Self>;
}

pub trait Peekable where Self: std::marker::Sized { 
    fn parse_peek(stream: &mut TokenStream) -> ParseResult<Option<Self>>;
}
  • Expectable provides the parse_expect function, which expects there to be a valid representation of the AST node at the current position in the stream, and returns an error if it finds anything it doesn't expect.
  • Peekable provides the parse_peek function, which peeks the next token to see if this type of AST node could be present. If it finds the token it's looking for, it carries on parsing; if it doesn't, it returns None without advancing the token stream. Not every AST node is peekable, but this is useful in certain situations - we'll get more into that later.

Get on with it, already!

Yes, yes, right! Let's finally write some actual parsing code. We'll focus on one particular aspect of the language as an example.

The language

Let's consider the links statement. It's in the sample at the very top of this article, but here's a better example: Imagine you're building the next Amazon - each item available for sale has an API resource, and that API output contains links to other resources and endpoints. In Veneto, you might represent that like this:

links { 
    seller -> Merchant, 
    addToCart? -> Action, 
    reviews -> List<Review>, 
}
  • The word on the left of the -> is the relation ("rel") of the link - in essence, its name.
  • The stuff to the right of the -> is the type of resource that the link "points to".
  • The ? means it's optional - in this case, maybe addToCart isn't always available because the item could be out of stock.
  • The Veneto standard library contains definitions for Action and List, while Merchant and Review are things that you would create, but that's not super important right now - we just know that it's a reference to another resource class.

The grammar

The Veneto grammar for this statement looks like this:

rc_links    = 'links' links_block  ;
links_block = '{' links_list '}' ; 
links_list  = [  link [',' [links_list]]  ] ;
link        = Identifier ['?'] '->' rc_reference ;

There's a lot going on there! Let's discuss these rules in plain English:

  • A links_block is a links_list enclosed in braces.
  • A links_list can be:
    • Nothing, or:
    • There might be a first link;
    • if there is a first link, it might be followed by a comma;
    • if there is a comma, it might be followed by another links_list!
    In summary, this describes a comma-separated list, with an optional trailing comma at the end.
  • A link has an Identifier (which is a type of token), then maybe a ?, then a ->, then an rc_reference which is defined elsewhere.

Let's start with the link rule. Everything in the links_list rule is optional, which means that we need to be able to see if the link inside of it is present or not, so it's a good idea to go ahead and make it Peekable.

Peeking a link

Here's our AST node:

pub struct Link { 
    pub rel: String, 
    pub optional: bool, 
    pub typ: RCReference,
    // 👆 unfortunately, `type` is a reserved keyword in Rust, so we have to call it something else
} 

And the grammar for this clause, as we mentioned above:

link = Identifier ['?'] '->' rc_reference ;

We use the grammar to parse the input into a Link like this:

  • The Identifier in the grammar becomes the Link's rel
  • optional becomes true if the ? is present
  • We parse the rc_reference clause separately, which becomes the typ

And here's the implementation! We just follow the grammar, from left to right:

impl Peekable for Link { 
    fn parse_peek(stream: &mut TokenStream) -> ParseResult<Option<Self>> {

        let Some(rel) = stream.peek_for_identifier()? else { 
            return Ok(None)
        };

        let optional = stream.peek_for_punctuation(Punctuation::Optional)?; 

        stream.next()?.expect_punctuation(Punctuation::Arrow)?; 

        Ok(Some(Link { 
            rel, 
            optional, 
            typ: RCReference::parse_expect(stream)?,
        }))
    }
}
  1. Have a peek at the current token - if it's an identifier, we capture its contents as rel and continue. Otherwise, there's no Link here - return None.
  2. Check if the Optional operator (?) is present - if it is, we mark it as true in the AST node. Since peek_for_punctuation returns a boolean for us, all we need to do is call it.
  3. Regardless, we expect there to be an Arrow (->) next. expect_punctuation is a helper function that converts the token to the appropriate error if it's not there - this might be rendered to the user like "expected '->' at line 12, column 5".
  4. Finally, we use parse_expect to obtain an RCReference. This happens inside of the implicit return expression that creates the Link AST node.

Expecting a LinksBlock

Time to move on to the LinksBlock! First, though, a quick detour:

Comma-separated lists

Let's talk about this comma-separated list idea. This is actually a super common pattern, used all over the place in our grammar - it doesn't make very much sense to implement it every time we need it. Instead, we can make one function that does this for us everywhere. This is a recursive grammar, and we're writing a recursive descent parser for it, so let's whip up a recursive function to do the job!

The function I made here is generic over any Peekable type, so it will work for our Links. It takes a reference to a Vec (the variable list type in Rust) where we're putting the things we find, and a reference to the stream.

To recap, here's original the grammar for our links_list:

links_list  = [  link [',' [links_list]]  ] ;

Our function is generic, though, so here's the same grammar replaced with a generic T to make it a bit easier to follow:

T_list  = [  T [',' [T_list]]  ] ;

And here's the function!

fn parse_list_into<T: Peekable>(vec: &mut Vec<T>, stream: &mut TokenStream) -> ParseResult<()> { 

    if let Some(node) = T::parse_peek(stream)? { 
        vec.push(node); 

        if stream.peek_for_punctuation(Punctuation::Comma)? { 
            parse_list_into(vec, stream)?; 
        }
    } 

    Ok(())
}
  1. The whole thing is optional, so first we peek for a T first. If we find it, then we add it to the vec, and we continue.
  2. Next, there might be a comma, so we peek for it.
  3. If there's a comma, there might be another list, so we recurse!

Actually implementing LinksBlock

Here's the AST node; this one is super simple, it's just a Vec of Links:

pub type LinksBlock = Vec<Link>;

And the grammar again:

links_block = '{' links_list '}' ; 

And here's the implementation! Factoring out the list part means this one is a piece of cake.

impl Expectable for LinksBlock { 
    fn parse_expect(stream: &mut TokenStream) -> ParseResult<Self> {
        stream.next()?.expect_punctuation(Punctuation::BraceOpen)?; 

        let mut links = LinksBlock::new(); 
        parse_list_into(&mut links, stream)?; 

        stream.next()?.expect_punctuation(Punctuation::BraceClose)?; 
        Ok(links) 
    }
}

Tying it all together

Those two parse nodes are the easiest to understand at a granular level, but if you understood that, you can understand the entire parser! Now, let's demonstrate how it fits into the larger picture.

This LinksBlock node that we've implemented belongs to a Resource Class - a complete example in Veneto could look like this:

resource Item { 

    data { 
        name: string, 
        price: decimal, 
    }

    links { 
        seller -> Merchant, 
        addToCart? -> Action, 
        reviews -> List<Review>, 
    }

}

The Peekable implementation for resource classes is a lot to get into, since there are many other components in play, but essentially it peeks for the resource keyword and then carries on parsing the rest of the clause, all the way until the closing brace.

Here is where we plug in all of the pieces: resource classes are one potential part of a Document, which is the top-level rule of the Veneto grammar. Its definition is pretty simple - just a repeated alternation of all of its parts:

document = [{ 
    use | 
    type_alias | 
    interface_decl | resource_class | rc_entry_directive
 }];

So here's the AST node, which is the root of the tree:

pub struct Document { 
    pub imports : Vec<UseTree>, 
    pub types: HashMap<String, Type>,
    pub interfaces : HashMap<String, InterfaceExpression>, 
    pub resource_classes : Vec<ResourceClass>, // 👈 the thing we just talked about!
    pub entry_points : Vec<EntryPoint>, 
}

Finally, to wrap it all up, there's a special parse method just for the top level. If you speak Rust, notice that the stream parameter is moved and not borrowed - we consume the whole thing for this operation:

impl Document { 
    pub fn parse(mut stream: TokenStream) -> ParseResult<Self> { 

        let mut doc = Self { 
            imports: Vec::new(), 
            types: HashMap::new(), 
            interfaces: HashMap::new(), 
            resource_classes: Vec::new(), 
            entry_points: Vec::new(), 
        };

        loop { 

            if stream.peek_for_keyword(Keyword::Use)? { 
                // ... (not relevant)
            }
            else if stream.peek_for_keyword(Keyword::Type)? { 
                // ... (not relevant)
            }
            else if stream.peek_for_keyword(Keyword::Interface)? { 
                // ... (not relevant)
            }
            else if let Some(rc) = ResourceClass::parse_peek(&mut stream)? { 
                // 👋 Here!
                doc.resource_classes.push(rc); 
            }
            else if stream.peek_for_keyword(Keyword::Entry)? { 
                // ... (not relevant)
            }
            else { 
                let next = stream.next()?; 
                if next.kind == TokenKind::EOF { 
                    return Ok(doc)
                } else { 
                    return Err(next.as_err_unexpected())
                }
            }

        }

    }
}

We simply loop, peeking for each possible component of the Document. If it's a ResourceClass, we add that to our list! At the end, we haven't matched any of those components, so we have a simple choice: if it's the end of the file, successfully return the parsed Document, otherwise, return an "unexpected" error. And that's it! That's the whole parser!

To use this, we just create a TokenStream and pass it along to that parse method. Below, you can see how the standard library is loaded. Since it's pretty small, we use the handy include_str! macro to bundle the file into the compiled library as a string, and create a Chars iterator from the string since that's what we need to create a TokenStream.

pub fn parse_std() -> ParseResult<Document> { 
    let stream = TokenStream::new( include_str!("std.veneto").chars() ); 
    Document::parse(stream)
}

For userland files, rather than the standard library, we will have to create a Chars iterator from an opened file instead. This isn't quite as straightforward as it sounds, but there are some crates that may help us; we'll burn that bridge when we come to it.

Two more fun things

While developing this parser, I made use of two Rust features that made the codebase much nicer, and I couldn't resist sharing them here.

Combining Peekable and Expectable

You may have noticed that Peekable and Expectable have a lot in common: in general, if you can peek for something, you should be able to expect it too. That would just mean peeking for it and then throwing the appropriate error if you don't find it.

The most obvious idea is to just throw an error if parse_peek returns None, but there's a problem: As we discussed earlier, errors in our parser have to be constructed from the token they're thrown at, so that the error can have useful information like the location (line & column number) describing where it occurs. What we want is a solution that does that for us while keeping good encapsulation and reusing as much code as possible for the sake of easy maintenance.

Not everything is quite as straightforward, but lots of clauses in Veneto follow a particular pattern. They start with a specific kind of punctuation, like an opening brace - like we did with LinksBlock above. To make it Peekable, you can peek for that token, return None if it's not there, and otherwise carry on with the parsing. Expectable is the same deal, except that you're expecting that token to be there.

The solution I found comes courtesy of one of my favorite features in Rust: blanket implementations. I think this is best explained by example. I've created a new trait called Finishable, which provides an initial token type (a constant), and a function called parse_finish which finishes parsing the clause after that initial token is consumed. It looks like this:

pub trait Finishable where Self: std::marker::Sized{ 
    const INITIAL_TOKEN: Terminal; 

    fn parse_finish(stream: &mut TokenStream) -> ParseResult<Self>; 
}

Terminal basically just means "Punctuation or Keyword", and it has its own peek_for_terminal and expect_terminal methods.

Here's where the magic happens. In Rust, trait implementations are generic, so we can create a blanket implementation for all types that satisfy criteria we provide. Here, we automatically implement Expectable for everything that has a Finishable implementation:

impl<T> Expectable for T where T: Finishable { 
    fn parse_expect(stream: &mut TokenStream) -> ParseResult<Self> {
        stream.next()?.expect_terminal(Self::INITIAL_TOKEN)?; 
        T::parse_finish(stream)
    }
}

It's super simple! We just expect the INITIAL_TOKEN, then finish the parsing with parse_finish and return the result. Peekable follows a similar pattern:

impl<T> Peekable for T where T: Finishable { 
    fn parse_peek(stream: &mut TokenStream) -> ParseResult<Option<Self>> {
        if stream.peek_for_terminal(Self::INITIAL_TOKEN)? { 
            Ok(Some(T::parse_finish(stream)?))
        } else { 
            Ok(None) 
        }
    }
}

Now we have two trait implementations for the price of one! It's super useful - some clauses in Veneto need to be peeked for or expected in different places depending on context. Now we have two different functions to do that for us, each with a name that clearly conveys the intent behind it - this kinda stuff goes a long way towards a readable and maintainable code base in the long term.

Blanket implementations were a true stroke of brilliance on the part of the language designers who created them, and it's one of the many reasons I prefer the trait system to traditional object-oriented programming.

I was promised macros!

Right! So, earlier we talked about peek_for_punctuation - the method that peeks a token, checks if it is the type of punctuation you asked for, and advances the token stream only if that is the case. It's a really common pattern in the code. Very often, you need to branch off into one of several different clauses, depending on if there is a certain punctuation token present, but you need to continue as normal if none of them are there. One option is to just use a massive if/else block for this - here's a simple example, similar to what we did above in our LinksBlock loop:

if stream.peek_for_punctuation(Punctuation::BraceClose)? {
    // Advance to the next token, and return the thing we've parsed
    return Ok(Some(parsedthing))
} else if stream.peek_for_punctuation(Punctuation::Comma)? {
    // Advance to the next token, and continue the loop
    continue;
} else {
    // DO NOT advance to the next token; instead, expect a "Foo" clause to start right here
    foos.push(Foo::parse_expect(stream)?)
}

Eughh! That gets messy. It's really not a big deal, but it inspired me to finally get over my fear of Rust macros. I haven't found a super nice beginner-friendly macros tutorial yet, and I plan to contribute one to the official docs, but I was able to fight my way through it and come up with this. It's called peek_match!, and it emulates the syntax of normal match expressions, basically expanding them into the if/else blocks like you saw above:

macro_rules! peek_match {
    ($stream:ident.$fn:ident { $first:expr => $first_then:expr, $($other:expr => $other_then:expr),+ , _ => $else_then:expr }) => { 
        if $stream.$fn($first)? { $first_then }
        $(
            else if $stream.$fn($other)? { $other_then }
        )+
        else { $else_then }
    }
}

So instead, we can write stuff like this:

peek_match!(stream.peek_for_punctuation { 
    Punctuation::BraceClose => return Ok(Some(parsedthing)),
    Punctuation::Comma => continue, 
    _ => foos.push( Foo::parse_expect(stream)? )
})

I hope that makes sense! I'm proud of it anyways, I find it to be a lot nicer to look at. I wish I could go into more detail, but this post is way too long already. If you want another post explaining it, give me a shout at jake@jakeconley.com and I'd be happy to write one!

Thanks for reading!

This was a lot to get into, and there's so much detail I wanted to include in this post that I couldn't quite fit. I hope you learned something or at least enjoyed reading this. As always, questions and comments are welcome at jake@jakeconley.com.

For more updates on the Veneto project, keep checking back here or give the GitHub organization a follow, and of course reach out if you're interested in contributing.

I plan to put out another devlog soon talking about the lessons I've learned so far in this project. I also might set up a mailing list or something for this blog if anyone expresses interest. Cheers!


Footnotes

Here's a few miscellaneous technical observations that I felt the article would be incomplete without.

Symbols and Context

You may have noticed that there are two kinds of symbols - those that can be replaced by something else, like variables, or those that cannot, like literals. Symbols that can be replaced are called nonterminal symbols, which is pretty much a synonym for syntactic variables; the other ones are caled terminal symbols, the idea being that you can stop looking for other replacements once you've found them.

In theory, you could define a grammar like this:


subject = "I" | "You" ;
subject verb = "We eat" | "She throws" ;
verb = "eat" | "throw" ;
object = "the apple" | "the football" ;

sentence = subject verb object ;

See how that second line defines two nonterminal symbols, and that subject is defined in two places? This is called context sensitivity, and it's a pain to deal with - now when you're looking at the rule sentence you have to look in multiple places and think about what that rule means in the context of the other stuff you're analyzing.

Instead, for computer languages, we prefer to design a grammar where each nonterminal symbol is defined exactly once, and it's the only thing on the left side of that = sign. These are called context-free grammars, and they're all we want to work with as language designers.

How do we know this will work?

Great question! It actually depends on the grammar you're parsing. There's a name for this: LL(k) grammars can be parsed by looking at only k tokens at a time; we only need one, so our grammar is LL(1). We like that because it's easy.

In theory, you can prove that a grammar is LL(1) by demonstrating certain properties about the grammar's recursion. The way I see it, as far as actually knowing that your grammar is LL(1), you have three options:

  1. Mathematically prove it yourself
  2. Rely on something else to mathematically prove it (there are lots of tools out there that will do this for you if you feed them a standard form like EBNF)
  3. Empirically demonstrate it with rigorous and thorough testing

For now, I'm going with option #3. I've never proven that Veneto's grammar is LL(1) - the draft isn't written in standard EBNF, and it's constantly changing at this time. I've written a solid bunch of tests so far, and I'm always adding more. Furthermore, I've written the entire parser this way and I have yet to run into any dealbreaking issues, so I'm fairly confident that this grammar is LL(1) by definition.

If you're drafting your own grammar but have the luxury of making changes as you go, the general idea is that you need to know where your clauses start and end. There's a bit of wiggle room depending on context, but if you know what tokens start or end a certain clause, then you can always just look for that token and you'll know what to do. I had to go back and revise the grammar a few times for this reason, but it was always easy after that.

If you're drafting your own grammar but you need to get it right the first time, make sure you can prove that it's LL(1) 😉