Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!tut.cis.ohio-state.edu!snorkelwacker.mit.edu!ai-lab!life!burley From: burley@pogo.ai.mit.edu (Craig Burley) Newsgroups: comp.software-eng Subject: Re: Effect of execution-speed on reliability/testing Message-ID: Date: 19 Dec 90 14:35:13 GMT References: <1990Dec19.102005.11830@engin.umich.edu> Sender: news@ai.mit.edu Organization: /home/fsf/burley/.organization Lines: 95 In-reply-to: zarnuk@caen.engin.umich.edu's message of 19 Dec 90 10:20:05 GMT In article <1990Dec19.102005.11830@engin.umich.edu> zarnuk@caen.engin.umich.edu (Paul Steven Mccarthy) writes: Speed has a direct impact on reliability. Systems that are too slow to test thoroughly, just don't get tested. Great example you give. I have a similar, though funnier I think, example. At one company I was at, they had an in-house bulletin board system kind of like this newsgroup system, but less sophisticated and using the built-in file networking of the systems instead of "real" network calls. Anyway, it was written and maintained on an "on the programmer's free time" basis, and the "programmer" at one point was someone who apparently had little free time except to "keep shoveling", as it were, to keep his head above the water, since the system required several kinds of maintenance (dealing with problem postings, creating new categories, and so on). One of the tasks, an annual one, was to run a "renumber" program that would renumber all the postings so searches would go much faster. Problem was, this program was written in the system's command language, the equivalent of a unix shell script or an MS-DOS .BAT (?) file I guess, and that made it a) slow, and b) unable to deal with simple errors. I had mentioned this to the programmer in charge, but he said that because he was so busy, he couldn't take the two weeks or so he thought it might take to rewrite it in a compiled language. (And, since he was a compiler person or some such thing, that made sense, because he wasn't an expert on the OS-level file system calls that would comprise most of the application.) So, one night around the New Year, he had started the thing running over the central bulletin board data base at around 7 pm. This task was projected to take over 6 hours, based on earlier runs, and prevented people from using the bulletin board during that time. Meanwhile, in our group we had a local bulletin board data base that also needed renumbering, and I was kind of in charge of that one, having set it up. Rather than just run the renumber program, I decided to figure it out. After doing that, I decided to write a new version in a high-level language. (Want a good laugh? I used PL/I Subset G.) I then used a copy of our smaller local data base to debug and test it, including the error recovery features I implemented that weren't in the original. Because it ran fast enough, it was easy to do this -- purposely trash postings, user profiles, whatever, in the sample data base, run my program, and make sure it did something sensible. When I was confident in the program, I ran it on our live local data base, and it worked great. So I decided to give it a really big test. I made a copy of the huge central data base and ran my program on that. Some 5-10 minutes later (maybe it was 15), it was done, and had fixed some "problems" it encountered that I had programmed it to deal with (and let me know it had done so, so I could check the original and make sure it did the right thing). The result was a new renumber program that was around two orders of magnitude faster than the old one, just as easy to maintain (I think), far better at error recovery, and not much larger (the executable image -- though this wasn't an important consideration). How long did it take me to do this? Well, at the time I knew the OS-level file system calls quite well -- I was writing a manual on these in fact -- so it didn't take me long at all. In fact, it took only a few hours! When I was finished, the "production" run of the original renumber program was still going on the original data base. So I left my copy of the "renumbered" data base around in case the other programmer wanted it, sent him mail about it and my new program, and prepared to head home. Before I got home, I took another look at the system either running the program or with the disk holding the live central data base, and noticed the operations staff was just about to shut it down -- thus terminating the program, which would require restarting it from (essentially) the beginning. But there wasn't any need to ask operations to delay shutdown -- the program had already crashed! It had encountered one of those problems that the new, faster version of the renumber program also had encountered, but instead of fixing it, it just died. The other programmer was appreciative, because not only did he have a new renumber program -- one he'd wanted to write himself, but hadn't had the time -- but he also had a ready-to-use renumbered central data base to replace the old one with, so people could use the bulletin board system again, and he wouldn't have to deal with the deluge of email asking why it wasn't back again! The moral? Just because something is slow, unwieldy, old, and appears to get the job done, doesn't mean it isn't worth rewriting it to be fast, new, and better able to do the same job. Even if the program gets run only once or so per year! -- James Craig Burley, Software Craftsperson burley@ai.mit.edu