Skip to main content


Topic: I wrote a lexer generator (Read 1650 times) previous topic - next topic

I wrote a lexer generator
At some point early this year, I decided that all code editors were terrible. MSVC is slow, XCode is really weird and a little buggy, Notepad++ and Pe and the Programmers Notepad are not cross platform, and Code::Blocks and Codelite are super buggy.

So I decided to write my own. Which has turned out to also have a few tough bugs in it. But in the process, I wrote a lexer generator. I started using Flex, but holy cow it is ancient. It's just so...dusty. It works in such a strange way, and generates such huge files. Plus, it's really annoying to use, since it's so flexible that doing anything at all is really difficult.

I looked at Sci Lexer, but I then found it to be similarly weird.

Probably Sci and Flex are really normal and I just have a weird idea of what a lexer should look like. But that's probably a good reason to write my own--if I am wrong, I will see concretely why I am wrong and why the right way is right.

But I think I may be right...Lunar generates small and fast lexers that cover what I need for syntax highlighting. I wrote a few other tools with it, such as a tool to check for all the functions that a certain JavaScript file will call, and Lunar's own file parser (not really done yet). It seems quite capable to me.

I based the API on what most nedit editors use for styling (I think). That is, you call:

Code: (c) [Select]

int ParseString(const char * string, unsigned long len, void (*parse_callback)(const char *, unsigned long, int, void *), void * arg);

And when it hits something to lex, it calls:
Code: (c) [Select]

void ParseCallback(const char *str, unsigned long len, int type, void *arg)

Telling you the position, the length, and the type of the token it found.
I really need to add a couple more features before it becomes really useful, mostly allowing spans to be optionally inclusive or exclusive, and allowing a callback for parts that didn't fit any of its rules.
But it's working so far, and it was fun to write!
  • Last Edit: May 13, 2015, 08:28:12 pm by Flying Jester

  • Radnen
  • [*][*][*][*][*]
  • Senior Staff
  • Wise Warrior
Re: I wrote a lexer generator
Reply #1
Check out:

You can describe a language as a set of regular expressions. When you lex it, you are doing lexical analysis, so the tool "Lex" is sufficient. I used this technique in  my JS based lexer. This 80 line file can indeed be used to tear apart any language described by a set of regex's in only 80 lines of code.

The next step is to create the AST. Once you have the tokens you again use regex's that work on a token-based level to create the AST. This includes among other things the ability to handle operator precedence rules. I haven't implemented this yet, instead going for a hand-rolled recursive decent parser. But an LALR parser generator such as YACC is also sufficient to use. My compilers professor from Portland State created a Java based version called JACC. These two parser generators take an input in the form of tokens, what classes they represent, and the operator precedences they need in order to create a well-informed AST. I have half a mind to try and create something like this soon.

But anyways, after you have the AST the last step you kind of do on your own. I mean the AST is like OOP. You'll have a tree of classes (this is a BinExpr, that is a IntExpr, etc.) you create the desired methods that either outputs the assembly code for the architecture you support (AKA a compiler), or just performs the code right then and there (AKA an interpreter).
  • Last Edit: May 13, 2015, 08:29:51 pm by Radnen
If you use code to help you code you can use less code to code. Also, I have approximate knowledge of many things.

Sphere-sfml here
Sphere Studio editor here

Re: I wrote a lexer generator
Reply #2
I've worked with AST's before. I've even written a couple (terrible, but somewhat function things that could be described as) compilers before. I've also written a simple bytecode VM, too. That's how I know that I'm trying to do something somewhat different here.

You can describe a language as a set of regular expressions.

I know I am in the minority, but I'm just not a fan of regular expressions. I find it's almost always more expressive and easier to read to write a little more code that does a more specific job. This a big part of why I do not like Lex/Flex and Yacc/Bison.

I also want to do a simpler operation here. I'm not really concerned if it can parse literally any regular language, but more concerned that I can write description files to generate syntax parsers or tokenizers for a most regular languages. It's enough to check for matches to a certain set of words and spans.

  • FBnil
  • [*][*]
Re: I wrote a lexer generator
Reply #3
@FJ: Being a Perl fanatic, regexps are hammer for every nail I see. However, the maxim of Perl is And there is nothing wrong with that. In fact, remember OOP fad, it solved some problems; that was then, now with ARM and less CPU power, OOP is slow(er), and so the decisions are not clear-cut. Take the latest accelerated JIT JS... versus ductape that uses a minimal amount of ram... very good for embedded and portable devices that do not want to extra memory and empty your battery within the hour.
(which, as soon as I learn to add libraries to ARM/Debian, and crosscompilation, I'll try to run on OpenPandora)

Re: I wrote a lexer generator
Reply #4
After I wrote this, I learned to write much better recursive decent parsers, rather than this kind of pseudo LALR parser. But I'm glad I tried this out, mostly because it made me less afraid to write code that generated more code.

(which, as soon as I learn to add libraries to ARM/Debian, and crosscompilation, I'll try to run on OpenPandora)

Have you tried NetBSD? :)

  • Rahkiin
  • [*][*][*]
Re: I wrote a lexer generator
Reply #5
Tried Sublime? Works on Win, Lin and OSX, and is awesome.