Even calling uninitialized data “garbage” is misleading. You might expect that the compiler would just leave out some initialization code and compile the remaining code in the expected way, causing the values to be “whatever was in memory previously”. But no - the compiler can (and absolutely will) optimize by assuming the values are whatever would be most convenient for optimization reasons, even if it would be vanishingly unlikely or even impossible.
At high enough optimization levels, the function compiles to “mov eax, 9485; ret”, which sets both a=13 and b=37 without testing the condition at all - as if both branches of the test were executed. This is perfectly reasonable because the lack of initialization means the values could already have been set that way (even if unlikely), so the compiler just goes ahead and sets them that way. It’s faster!
Even the notion that uninitialized memory contain values is kind of dangerous. Once you access them you can't reason about what's going to happen at all. Behaviour can happen that's not self-consistent with any value at all: https://godbolt.org/z/adsP4sxMT
I have bumped into this myself, too. It's really annoying. The biggest footgun isn't even discussed explicitly and it might be how the error got introduced - it's when the struct goes from POD to non-POD or vice-versa, the rules change, so completely innocent change, like adding a string field, can suddenly create undefined behaviour in unrelated code that was correct previously.
Many years had a customer complaint about undefined data changing value in Fortran 77. It turned out that the compiler never allocated storage for uninitialized variables, so it was aliased to something else.
Compiler was changed to allocate storage for any referenced varibles.
Symbian's way of avoiding this was to use a class called CBase to derive from. CBase would memset the entire allocated memory for the object to binary zeros, thus zeroizing any member variable.
And by convention, all classes derived from CBase would start their name with C, so something like CHash or CRectangle.
The two fields in the struct are expected to be false unless changed, then initialize them as such. Nothing is gained by leaving it to the compiler, and a lot is lost.
I think the point is that sometimes variables are defined by the language spec as initialized to zero, and sometimes they aren't.
Perhaps what you mean is, "Nothing is to be gained by relying on the language spec to initialize things to zero, and a lot is lost"; I'd agree with that.
Yeah... but I wouldn't characterize the bug itself (in its essential form) as UB.
Even if the implementation specified that the data would be indeterminate depending on what existed in that memory location previously, the bug would still exist.
Even if you hand-coded this in assembly, the bug would still exist.
The essence of the bug is uninitialized data being garbage. That's always gonna be a latent bug, regardless of whether the behavior is defined in an ISO standard.
Yeah I agree. This is a classic “uninitialized variable has garbage memory value” bug. But it is not a “undefined nasal demons behavior” bug.
That said, we all learn this one! I spent like two weeks debugging a super rare desync bug in a multiplayer game with a P2P lockstep synchronous architecture.
Suffice to say I am now a zealot about providing default values all the time. Thankfully it’s a lot easier since C++11 came out and lets you define default values at the declaration site!
I prefer language constructs define that new storage is zero-initialized. It doesn't prevent all bugs (i.e. application logic bugs) but at least gives deterministic results. These days it's zero cost for local variables and near-zero cost for fields. This is the case in Virgil.
Even calling uninitialized data “garbage” is misleading. You might expect that the compiler would just leave out some initialization code and compile the remaining code in the expected way, causing the values to be “whatever was in memory previously”. But no - the compiler can (and absolutely will) optimize by assuming the values are whatever would be most convenient for optimization reasons, even if it would be vanishingly unlikely or even impossible.
As an example, consider this code (godbolt: https://godbolt.org/z/TrMrYTKG9):
At high enough optimization levels, the function compiles to “mov eax, 9485; ret”, which sets both a=13 and b=37 without testing the condition at all - as if both branches of the test were executed. This is perfectly reasonable because the lack of initialization means the values could already have been set that way (even if unlikely), so the compiler just goes ahead and sets them that way. It’s faster!Even the notion that uninitialized memory contain values is kind of dangerous. Once you access them you can't reason about what's going to happen at all. Behaviour can happen that's not self-consistent with any value at all: https://godbolt.org/z/adsP4sxMT
How is this an "optimization" if the compiled result is incorrect? Why would you design a compiler that can produce errors?
It’s not incorrect.
The code says that if x is true then a=13 and if it is false than b=37.
This is the case. Its just that a=13 even if x is false. A thing that the code had nothing to say about, and so the compiler is free to do.
It's not incorrect. Where is the flaw?
I have bumped into this myself, too. It's really annoying. The biggest footgun isn't even discussed explicitly and it might be how the error got introduced - it's when the struct goes from POD to non-POD or vice-versa, the rules change, so completely innocent change, like adding a string field, can suddenly create undefined behaviour in unrelated code that was correct previously.
Many years had a customer complaint about undefined data changing value in Fortran 77. It turned out that the compiler never allocated storage for uninitialized variables, so it was aliased to something else.
Compiler was changed to allocate storage for any referenced varibles.
Symbian's way of avoiding this was to use a class called CBase to derive from. CBase would memset the entire allocated memory for the object to binary zeros, thus zeroizing any member variable.
And by convention, all classes derived from CBase would start their name with C, so something like CHash or CRectangle.
The two fields in the struct are expected to be false unless changed, then initialize them as such. Nothing is gained by leaving it to the compiler, and a lot is lost.
I think the point is that sometimes variables are defined by the language spec as initialized to zero, and sometimes they aren't.
Perhaps what you mean is, "Nothing is to be gained by relying on the language spec to initialize things to zero, and a lot is lost"; I'd agree with that.
Please don't be pedantic. Compilers implement the standard, otherwise it's just a text document.
Great post. It was both funny and humble. Of course, it probably wasn't at all funny at the time.
tldr; the UB was reading uninitialized data in a struct. The C++ rules for when default initialization occurs are crazy complex.
I think a sanitizer probably would have caught this, but IMHO this is the language's fault.
Hopefully future versions of C++ will mandate default initialization for all cases that are UB today and we can be free of this class of bug.
Yeah... but I wouldn't characterize the bug itself (in its essential form) as UB.
Even if the implementation specified that the data would be indeterminate depending on what existed in that memory location previously, the bug would still exist.
Even if you hand-coded this in assembly, the bug would still exist.
The essence of the bug is uninitialized data being garbage. That's always gonna be a latent bug, regardless of whether the behavior is defined in an ISO standard.
Yeah I agree. This is a classic “uninitialized variable has garbage memory value” bug. But it is not a “undefined nasal demons behavior” bug.
That said, we all learn this one! I spent like two weeks debugging a super rare desync bug in a multiplayer game with a P2P lockstep synchronous architecture.
Suffice to say I am now a zealot about providing default values all the time. Thankfully it’s a lot easier since C++11 came out and lets you define default values at the declaration site!
I prefer language constructs define that new storage is zero-initialized. It doesn't prevent all bugs (i.e. application logic bugs) but at least gives deterministic results. These days it's zero cost for local variables and near-zero cost for fields. This is the case in Virgil.
C & C++ run on systems where it may not be zero cost. If you need low latency startup it could be a liability to zero out large chunks of memory.
For now, best strategy is to initialize everything explicitly.