Parsing C declarations

The declaration syntax of the C programming language is a source of confusion and is frequently criticized for being over-complicated. There are various explanations floating on the internet that try to make it easier to understand for programmers, like this one. On this page I demonstrate how declarations are simpler than these explanations suggest.

I'm currently writing a parser for a C-like language, and the experience has helped lift the last bit of mystery around C variable and type declarations for me. C type declarations are straightforward and easy to implement. Ease of implementation is probably what drove the design. Here is an example from the inventor of the language himself: Quoting Dennis Ritchie,

        int i, *pi, **ppi;

    declare an integer, a pointer to an integer, a pointer to a pointer to an
    integer. The syntax of these declarations reflects the observation that i,
    *pi, and **ppi all yield an int type when used in an expression.

Let me demonstrate what that means for implementation.

Let's suppose we're parsing the following code

    int *a[20];

This is not a daunting task! The first token the parser sees is the identifier "int".

    int *a[20];

After considering its internal symbol table, the parser recognizes that this is an existing type name. At this point the parser knows that this is a variable declaration.

How is this implemented? Quite simple - the parser proceeds by parsing a plain old expression statement. It calls the same parsing code that can also parse expressions like foo(*a[3]+7, "bla") * baz++; [1] -- or even ones containing comma operators, like i, *pi, **ppi;.

The result from parsing the above expression statement (*a[20];) will be a parse tree like the one from my own parser:

            (VALUE a)
            (VALUE 20)))

We need to write only little extra code to convert this into a variable declaration. We drill down the expression tree and build a's type as we go. We start with int (the typename that lead us to decide that this is a variable declaration). Seeing DEREF we construct Pointer(int). Encountering SUBSCRIPT, and 20 in its index field, we make ARRAY(20, Pointer(int)). Finally there is VALUE a in the subscript expression's base pointer field. Thus we complete the declaration to a := ARRAY(20, Pointer(int)).

    (VARDECL a
        (ARRAY 20
            (TYPENAME int))))

Essentially, we're turning the tree inside out. That's possible because any valid expression has only one declared variable name, and since it's a tree there's only one path to it from the root of the tree.

Some expressions are of course invalid in variable declaration. For example, foo(*a[3]+7, "bla") * baz++; is not valid in a variable declaration context since it contains infix operators.

Let's try something more involved

    Foo (*test(void))(int,float);

The expression parser returns

        (UNOP DEREF
            (BINOP FUNCALL
                (VALUE test)
                (VALUE void)))
        (BINOP COMMA
            (VALUE int)
            (VALUE float)))

Following the above algorithm, we start with Foo. We turn this into Function(Foo, [int,float]), a function with two input values and one return value. Following the DEREF we get Pointer(Function(Foo, [int,float])). Following the inner function this turns into test := Function(Pointer(Function(Foo, [int,float]), [void]). A function that takes no arguments and returns a pointer to a function that takes an int and a float and returns an int!

    (VARDECL test
                    (RET Foo)
                        [(VALUE int)
                         (VALUE float)])))


The code that achieves the transformation from the expression tree to the variable declaration is straightforward code. Just by walking the parse tree, we can parse any complicated variable declaration! [2]

Although I still think that variable declarations can be confusing to humans in more complicated cases, I do have a soft spot for the beauty and simplicity of this approach to variable declaration syntax. It requires almost no additional code and provides a nice guarantee that every constructible type is also usable in an expression context. And, the confusing instances are more theoretical than practical. The syntax is optimized for the common cases. If my type definitions are confusing, I know that my approach is too complicated. In C, you normally don't return function pointers from functions!

By the way, the approach works for typedefs, too: Suppose we have

    typedef int *a[20];

We proceed in the same way - the only difference is that a is declared as a type, not a variable.

    (TYPEDEF a
        (ARRAY 20
                (TYPENAME int)))


[1] Programmers that do not know C might think that this expression is insanity by itself. But the expression parser already exists, and C expressions have turned out to be useful and convenient. Unless one wants to write the bulk of the plain old boring day-to-day code in a language like LISP or (god forbid!) Forth, one is bound to need a C expression parser. It's actually not that hard to implement, we just need: prefix, infix and postfix expressions, and parentheses. This can be implemented with the Shunting Yard algorithm.

[2] The cautious reader might notice a catch: While function declarations work like function calls, in their argument list they need types (or variable declarations). I do not know yet how to implement this in a simple fashion, but I hope it's possible.

Created: 2018-05-06