Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!usc!snorkelwacker.mit.edu!bloom-picayune.mit.edu!news From: scs@adam.mit.edu (Steve Summit) Newsgroups: comp.unix.programmer Subject: more on error checking (was: Complexity of reallocating storage) Message-ID: <1991Feb10.071120.9520@athena.mit.edu> Date: 10 Feb 91 07:11:20 GMT References: <1991Feb07.013637.6542@convex.com> Sender: news@athena.mit.edu (News system) Reply-To: scs@adam.mit.edu Organization: Thermal Technologies, Inc. Lines: 147 In article <1991Feb07.013637.6542@convex.com>, Tom Christiansen writes: >As was mentioned earlier, *every* system call should *always* >be checked, even if you "know" it can't fail. Indeed. However, at the risk of confusing the issue further just when this basic fact is starting to sink in, it's important to remember, when you set out to check for every error everywhere, that the thing to do upon finding one may not be to complain and abort. I'll provide several examples. My own code is usually liberally sprinkled with assertions and sanity checks. One fruitful place to put them is in the default case of a switch, when the default should "never" be reached. Suppose I have a little routine which decodes the value of some internal, numeric flag (presumably so it can be printed out somewhere): #include "status.h" char * status_str(sts) int sts; { switch(sts) { case STS_GOLDEN: return "STS_GOLDEN"; break; case STS_DEMENTED: return "STS_DEMENTED"; break; case STS_ANHYDROUS: return "STS_ANHYDROUS"; break; default: panic("status_str: bad status %d", sts); } } The panic is there, of course, so that next week when I add another status code but forget to update the status_str() routine, my mistake will be very noticeable, and there will be a strong reminder and incentive to fix it. But does it really have to be a fatal error? Probably not, especially if there's any chance that the "impossible case" won't be triggered by me, but by some (hopefully beta-test) user. A better way to handle this particular "impossible case" is nonfatally: default: { static char tmpbuf[15]; oddity("status_str: bad status %d", sts); sprintf(tmpbuf, "status %d", sts); return tmpbuf; } In my code, "oddity" is just like panic (i.e. it prints debugging messages that are to be construed as the developer's problem, not the user's) but it's nonfatal. There's still a noisy message, there's still incentive and reminder to fix the bug, but at least the program can continue to run. There was a case like this in the 4.1bsd tty driver. (I don't know if it's still there.) The driver tries to do CRT erase processing (i.e. the backspace-space-backspace stuff that wipes characters off the screen as you press the delete or rubout or backspace key or whatever) perfectly. The tricky part is that you might need 1, 2, or up to 8 backspaces, depending on whether there is a normal character, a control character (echoed as "^X"), or a tab on the screen being erased. The code for figuring out how many backspaces to emit involved a big array full of action codes. There were a few impossible cases. Sure enough, they were handled with a panic. I don't know about you, but I think that aesthetically pleasing crterase processing, though nice, is a bit of a frill, and probably one of the last things that it's vital for an operating system kernel to do perfectly. I would be very, very annoyed (well, to tell the truth, I'd also be ridiculously amused) if a machine crashed, dropping several users on the floor and losing lots of their work, just because some poor schlep tried to hit the delete key to get rid of some "impossible" character on his command line. (No, it never happened, but once I'd seen that code in the tty driver, I was forever slightly haunted with the fear that the system was going to crash because of it. In fact, hitting the impossible case would have represented a programming error, not an impossible character. Still, the panic seemed needlessly draconian.) Finally, consider write(1). On the recipient's screen, it prints something like message from scs on tty23 at 12:34 ... Now, whatever method it uses to find out what terminal the sender is logged into (probably ttyname(3)), it's not going to work if the sender doesn't have one. How could the sender not have one? Actually, the burden of proof should be on the author of write(1) to show why the program can't proceed if the tty is not known, not on the user to suggest circumstances under which it might be meaningful not to have one, but to satisfy the skeptics out there I'll mention a real-world, non-fabricated case: at jobs. (Actually, this case, or another one involving write(1), came up on the net just a week ago.) I sometimes like to send myself notification that one of my at jobs has finished by including echo "at job is done" | write scs in the at script. However, at scripts run with no controlling tty, so write(1) aborts and refuses to send the message, just because it won't be able to say where it's from. Just now I tried write(1)ing to myself from within emacs, so that I could see the exact "message from..." message on my screen. But since I'm running the Gigantic Neanderthal Ubiquitous emacs, which runs shell escapes in some kind of weird subshell with pipes, all I got was the "write: can't find your tty" message in the *Shell Command Output* window. Some time ago I wrote my own version of write(1), mostly so that I could put in the wrinkle that it leave out the "on ttyxx" part, rather than falling back and punting, if it couldn't determine it. (It also prints to several ttys if the user is logged in multiple times, rather than picking the wrong one. Now, of course, in these security-conscious days, private versions of write(1) don't work well, because they should be setgid.) So yes, check for all error returns. And yes, print noisy messages (unless the error is truly benign, as in the write(1) example). But think for a minute whether the error really has to be fatal, whether there's any reason the program can't continue. If it can't; if the error means that vital data has been corrupted such that continuing would be futile or worse; then by all means abort. But if you can safely continue with little or no effort, please do. Steve Summit scs@adam.mit.edu P.S. I am aware that there are better ways of decoding numeric flags than by using a big switch statement as in the first example. (In fact, I usually use a lookup table.)