So I started getting weird segfaults in minisphere after a recent change, but they'd only show up in a release build and were 100% non-reproducible in an unoptimized build. I figured I was invoking undefined behavior somewhere, but skimming my code everything looked fine.
I have the following declaration at the top of debugger.c:
static vector_t* s_sources;
This vector maintains a list of sources without a backing script file, e.g. scripts embedded in a map file. Keeping that code in memory enables SSJ to download it later for display. The vector is properly initialized on startup:
s_sources = vector_new(sizeof(struct source));
And if I print out its length right then, I get zero as expected. However I was getting segfaults later on accessing that same vector, verified by adding debugging information to the release build, and examining its contents showed it had insane garbage values. Very strange. Now mind you, by specification in C static variables are always supposed to be initialized to zero on startup if they don't have an explicit initializer. However, making the following change fixes the crash...
static vector_t* s_sources = NULL;
...and introduces a new crash when accessing a completely unrelated static vector later on! I'm pretty sure the optimizer is at fault here, because turning off optimization makes the segfaults stop, even still using the release CRT. Are the uninitialized static variables being optimized away even though they are assigned to later on? If so, why didn't this issue show up long before now since the minisphere codebase has always had a lot of static variables? I'm completely baffled. I mean, even if the pointer isn't
actually initialized to zero, it still shouldn't be crashing since it's assigned a valid value before anything else is done with it.
I would recommend using either (or both) the clang static analyzer or valgrind. They tend to help me a lot when I hit impossible crashes, either confirming that I'm hitting a real tool bug or helping me decipher why some crash actually makes sense.
I've actually tried Valgrind, I can't reproduce the crash at all under Linux regardless of gcc optimization settings, and no out-of-bounds accesses are reported. I haven't tried any static analysis yet though.
It definitely seems like the MS compiler is at fault here though. I was able to reproduce the crash with optimization disabled, but inlining turned on (this was great, as it avoided variables being optimized away). Running under AppVerifier, I get something to the effect "corrupt start stamp in heap block", suggesting a buffer underrun, however Valgrind reports no such clobbering. Turning off inline expansion makes the crash go away.
I finally found the bug, by compiling with /GS
to enable bounds checks. It then crashed exactly where the errant write happened (normally it doesn't crash until free(), long after the cause was out of sight). Here was the cause:
static char retval[SPHERE_PATH_MAX];
strncpy(retval, compiled_name, SPHERE_PATH_MAX - 1);
retval[SPHERE_PATH_MAX] = '\0'; // <-- oops!
It was an out of bounds write to a local static buffer, which I guess clobbered some other static variable at file scope. Apparently adding initializers can affect the ordering of static variables in the compiled code, since as mentioned above adding a NULL initializer for the affected variable shifted the problem to a different one.
What's weird is that Valgrind didn't catch this!
I used to think heap buffer overflows were the most evil thing ever to debug, I now see that a static overflow is much more insidious because apparently it can clobber global state in completely unrelated code!
Now that that's over, I'm now going to sheepishly go add /GS to my debug builds to catch this kind of thing in the future... :-[
Reminds me of this other debugger:
Congrats... hard to find bugs take a while to find... thanks for your time!