Read about this on my blog?I was recently thinking about how hard it would be to make a JIT compiler. The first question is, how would I actually generate code? As in, actually get arbitrary machine code put into memory at run time to execute?
Turns out it's not that hard.
https://gist.github.com/FlyingJester/0e6549a20a141900915bNote that all these snippets assume you have a Unix-like environment, an amd64 CPU, and are compiling for 64 bits.
Here, I'm making up an array of raw bytes. They are all NOPs (no operations, the CPU sees this and does nothing about it), and finally a 'ret' statement. As long as we are in 64-bits or a have a normal calling convention, 'ret' is just like the proper keyword 'return'.
The kind of funky thing is that we not only need to mark the memory we want to execute as executable (which makes sense, here it's the call to mprotect()), we can't do that on just any memory. Normally, all mapped memory is read/write, but not executable.
In Unix/Posix, we can ask for a memory page with mmap(). This ensure we get a whole page, and assures that the address returned meets a bunch of special rules that we aren't too concerned with the details of. The important part is that addresses returned by mmap can be mprotect'd to arbitrary access usage.
Conveniently, we can mmap a page for read/write, copy our machine code to it, and then mark it as executable without too much hassle.
All we have to do then is explain to our C++ compiler that the address
lPage can be called like a function (which it kind of is). Interestingly, you NEED a C-style cast here. C++'s wonderful casts simply don't allow you to cross the data/instruction barrier this way.
But, that's not really a compiler of any sort, it's just injecting arbitrary code into a program.
Well, the array of chars that is our machine code could be modified. Say we want to make up some machine code that adds two arbitrary numbers, but we don't want to load the numbers, we want them written into the machine code itself once they are known.
It would look something like this:
https://gist.github.com/FlyingJester/369647a80d62ea5c7e62So that's actually much cooler. Now, we are generating machine code on the fly!
But you know what would be even cooler? If we made the code's behaviour even more dynamic. Just modifying data is cool and all, but we could have just coded in addresses and used pointers in our machine code (that also would have been kind of cool, given that now our machine code would have embedded instance-specific addresses...). What if we actually change both instructions and data to generate our code?
https://gist.github.com/FlyingJester/1f6d3464f391045c7a41Now that's much more like it.
So, what did I learn from this adventure?
Dynamic code generation and execution is frighteningly easy. I didn't expect this to be so simple, or to work so easily.
Of course, this is bordering on the kind of black magic that could destroy any project. It's ridiculous and completely unnecessary. Don't actually do this unless it is the intended product of your program.
...But it's also really fun to do!