开发者

Is undefined behavior worth it?

开发者 https://www.devze.com 2022-12-29 00:14 出处:网络
Many bad things happened and continue to happen (or not, who knows, anything can happen) due to undefined behavior. I understand that this was introduced to leave some wiggle-room for compilers to opt

Many bad things happened and continue to happen (or not, who knows, anything can happen) due to undefined behavior. I understand that this was introduced to leave some wiggle-room for compilers to optimize, and maybe also to make C++ easier to port to different platforms and architectures. However the problems caused by undefined behavior seem to be too large to be justified by these arguments. What are other arguments for undefined behavior? If there are none, why does undefined behavior still exist?

Edit To add some motivation for my question: Due to several bad experiences with less C++-crafty co-workers I have gotten used to making my code as safe as possible. Assert every argument, rigorous const-correctness and stuff like that. I try to leave as little room has possible to use my code the wrong way, because experience shows that, if there are l开发者_运维知识库oopholes, people will use them, and then they will call me about my code being bad. I consider making my code as safe as possible a good practice. This is why I do not understand why undefined behavior exists. Can someone please give me an example of undefined behavior that cannot be detected at runtime or compile time without considerable overhead?


I think the heart of the concern comes from the C/C++ philosophy of speed above all.

These languages were created at a time when raw power was sparse and you needed to get all the optimizations you could just to have something usable.

Specifying how to deal with UB would mean detecting it in the first place and then of course specifying the handling proper. However detecting it is against the speed first philosophy of the languages!

Today, do we still need fast programs ? Yes, for those of us working either with very limited resources (embedded systems) or with very harsh constraints (on response time or transactions per second), we do need to squeeze out as much as we can.

I know the motto throw more hardware at the problem. We have an application where I work:

  • expected time for an answer ? Less than 100ms, with DB calls in the midst (say thanks to memcached).
  • number of transactions per second ? 1200 in average, peaks at 1500/1700.

It runs on about 40 monsters: 8 dual core opteron (2800MHz) with 32GB of RAM. It gets difficult to be "faster" with more hardware at this point, so we need optimized code, and a language that allows it (we did restrain to throw assembly code in there).

I must say that I don't care much for UB anyway. If you get to the point that your program invokes UB then it needs fixing whatever the behavior that actually occurred. Of course it would be easier to fix them if it was reported straight away: that's what debug builds are for.

So perhaps that instead of focusing on UB we should learn to use the language:

  • don't use unchecked calls
  • (for experts) don't use unchecked calls
  • (for gurus) are you sure you really need an unchecked call here ?

And everything is suddenly better :)


My take on undefined behavior is this:

The standard defines how the language is to be used, and how the implementation is supposed to react when used in the correct manner. However, it would be a lot of work to cover every possible use of every feature, so the standard just leaves it at that.

However, in a compiler implementation, you can't just "leave it at that," the code has to be turned into machine instructions, and you can't just leave blank spots. In many cases, the compiler can throw an error, but that's not always feasible: There are some instances where it would take extra work to check whether the programmer is doing the wrong thing (for instance: calling a destructor twice -- to detect this, the compiler would have to count how many times certain functions have been called, or add extra state, or something). So if the standard doesn't define it, and the compiler just lets it happen, witty things can sometimes happen, maybe, if you're unlucky.


The problems are not caused by undefined behaviour, they are caused by writing the code that leads to it. The answer is simple - don't write that kind of code - not doing so is not exactly rocket science.

As for:

an example of undefined behavior that cannot be detected at runtime or compile time without considerable overhead

A real world issue:

int * p = new int;
// call loads of stuff which may create an alias to p called q
delete p;

// call more stuff, somewhere in which you do:
delete q;

Detecting this at compile time is imposisible. at run-time it is merely extremely difficult and would require the memory allocation system to do far more book-keeping (i.e. be slower and take up more memory) than is the case ifwe simply say the second delete is undefined. If you don't like this, perhaps C++ is not the language for you - why not switch to java?


The main source of undefined behaviour are pointers, and that's why C and C++ have a lot of undefined behaviour.

Consider this code:

char * r = 0x012345ff;
std::cout << r;

This code looks very bad, but should it issue an error? What if that address is indeed readable i.e. it's a value I obtained somehow (maybe a device address, etc.)?

In cases like this, there's no way to know if the operation is legal or not, and if it isn't, it's behaviour is indeed unpredictable.

Apart from this: in general C++ was designed with "The zero overhead rule" in mind (see The Design and Evolution of C++), so it couldn't possibly impose any burden on implementations to check for corner cases etc. You should always keep in mind that this language was designed and is indeed used not only on the desktop but also in embedded systems with limited resources.


Many things that are defined as undefined behavior would be hard if not impossible to diagnose by the compiler or runtime environment.

The ones that are easy have already turned into defined-undefined behavior. Consider calling a pure virtual method: it is undefined behavior, but most compilers/runtime environments will provide an error in the same terms: pure virtual method called. The defacto standard is that calling a pure virtual method call is a runtime error in all environments I know of.


The standard leaves "certain" behaviour undefined in order to allow a variety of implementations, without burdening those implementations with the overhead of detecting "certain" situations, or burdening the programmer with constraints required to prevent those situations arising in the first place.

There was a time when avoiding this overhead was a major advantage of C and C++ for a huge range of projects.

Computers are now several thousand times faster than they were when C was invented, and the overheads of things like checking array bounds all the time, or having a few megabytes of code to implement a sandboxed runtime, don't seem like a big deal for most projects. Furthermore, the cost of (e.g.) overrunning a buffer has increased by several factors, now that our programs handle many megabytes of potentially-malicious data per second.

It is therefore somewhat frustrating that there is no language which has all of C++'s useful features, and which in addition has the property that the behaviour of every program which compiles is defined (subject to implementation-specific behaviour). But only somewhat - it's not actually all that difficult in Java to write code whose behaviour is so confusing that from the POV of debugging, if not security, it might as well be undefined. It's also not at all difficult to write insecure Java code - it's just that the insecurity usually is limited to leaking sensitive information or granting incorrect privileges over the app, rather than giving up complete control of the OS process the JVM is running in.

So the way I see it is that good software engineering requires discipline in all languages, the difference is what happens when our discipline fails, and how much we're charged by other languages (in performance and footprint and C++ features you like) for insurance against that. If the insurance provided by some other language is worth it for your project, take it. If the features provided by C++ are worth paying for with the risk of undefined behaviour, take C++. I don't think there's much mileage in trying to argue, as if it was a global property that's the same for everyone, whether the benefits of C++ "justify" the costs. They're justified within the terms of reference for the design of the C++ language, which are that you don't pay for what you don't use. Hence, correct programs should not be made slower in order that incorrect programs get a useful error message instead of UB, and most of the time behaviour in unusual cases (e.g. << 32 of a 32-bit value) should not be defined (e.g. to result in 0) if that would require the unusual case to be checked for explicitly on hardware which the committee wants to support C++ "efficiently".

Look at another example: I don't think the performance benefits of Intel's professional C and C++ compiler justify the cost of buying it. Hence, I haven't bought it. Doesn't mean others will make the same calculation I made, or that I will always make the same calculation in future.


Compilers and programming languages are one of my favorite topics. In the past I did some research related with compilers and I have found many many times undefined behavior.

C++ and Java are very popular. It does not mean that they have a great design. They are widely used because they took risks in detriment of their design quality just to gain acceptance. Java went for garbage collection, virtual machine and pointer-free appearance. They were the partly pioneers and could not learn from many previous projects.

In the case of C++ one of the main goals was to give object oriented programming to C users. Even C programs should compile with a C++ compiler. That made a lot of nasty open points and C had already many ambiguities. C++ emphasis was power and popularity, not integrity. Not many languages give you multiple-inheritance, C++ give you that although not in a very polished way. Undefined behavior will always be there to support its glory and backwards compatibility.

If you really want a robust and well defined language you must look somewhere else. Sadly that is not the main concern of most people. Ada for example is a great language where a clear and defined behavior is important, but hardly anyone cares about the language because of its narrow user base. I am biased with the example because I really like that language, I posted something on my blog but if you want to learn more about how a language definition can help to to have less bugs even before you compile have a look at these slides

I am not saying C++ is a bad language! It just have different goals and I love working with it. You also have a large community, great tools, and much more great stuff such as STL, Boost and QT. But your doubt is also the root to become a great C++ programmer. If you want to be great with C++ this should be one of your concerns. I would encourage you to read the previous slides and also this critic. It will help you a lot to understand those times when the language is not doing what you expect.

And by the way. Undefined behavior goes totally against portability. In Ada for example, you have control about the layout of data structures (in C and C++ it can change according machine and compiler). Threads are part of the language. So porting C and C++ software will give you more pain than pleasure


It's important to be clear on the differences between undefined behavior and implementation-defined behavior. Implementation defined behavior gives compiler writers the opportunities to add extensions to the language in order to leverage their platform. Such extensions are necessary in order to write code that works in the real world.

UB on the other hand exists in cases where it is difficult or impossible to engineer a solution without imposing major changes in the language or big differences from C. One example taken from a page where BS talks about this is:

int a[10];
a[100] = 0; // range error
int* p = a;
// ...
p[100] = 0; // range error (unless we gave p a better value before that assignment)

The range error is UB. It is an error, but how precisely the platform should deal with this is undefined by the Standard because the Standard can't define it. Each platform is different. It can't be engineered to an error because this would necessitate including automatic range checking in the language, which would require a major change to the language's feature set. The p[100] = 0 error is even more difficult for the language to generate a diagnostic for, either at compile- or run-time, because the compiler can't know what p really points to without run-time support.


I asked myself that same question a few years ago. I stopped considering it right away, when I tried to provide a proper definition for the behavior of a function that writes to a null pointer.

Not all devices have a concept of protected memory. So you can't possibly rely on the system to protect you via a segfault or similar. Not all devices have read only memory, so you can't possibly say that the write simply does nothing. The only other option I could think of is to require that the application raise an exception [or abort, or something] without help from the system. But in that case, the compiler has to insert code before every single memory write to check for null unless it can guarantee that the pointer has not changed since the list memory write. That is clearly unacceptable.

So, leaving the behavior undefined was the only logical decision I could come to, without saying "Compliant C++ compilers can only be implemented on platforms with protected memory."


Here's my favourite: after you've done delete on a non-null pointer using it (not only dereferencing, but also castin, etc) is UB (see this question).

How you can run into UB:

{
    char* pointer = new char[10];
    delete[] pointer;
    // some other code
    printf( "deleted %x\n", pointer );
}

Now on all architectures I know the code above will run fine. Teaching the compiler or runtime to perform analysis of such situations is very hard and expensive. Don't forget that sometimes it might be millions lines of code between delete and using the pointer. Settings pointers to null immediately after delete can be costly, so it's not a universal solution as well.

That's why there's the concept of UB. You don't want UB in your code. Maybe works maybe not. Works on this implementation, breaks on another.


There are times when undefined behavior is good. Take a big int for example.

union BitInt
{
    __int64 Whole;
    struct
    {
        int Upper;
        int Lower; // or maybe it's lower upper. Depends on architecture
    } Parts;
};

The spec says if we last read or wrote to Whole then reading/writing from Parts is undefined.

Now, that's just a tad silly to me because if we couldn't touch any other parts of the union then there is no point in having the union in the first place, right?

But anyway, maybe some functions will take __int64 while other functions take the two separated ints. Rather than convert every time we can just use this union. Every compiler I know treats this undefined behavior in a pretty clear way. So in my opinion undefined behavior isn't so bad here.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号