Path: utzoo!dptcdc!jarvis.csri.toronto.edu!mailrus!tut.cis.ohio-state.edu!ucbvax!hoptoad!tim
From: tim@hoptoad.uucp (Tim Maroney)
Newsgroups: comp.sys.mac.programmer
Subject: Re: Reading Between the Lines
Message-ID: <7021@hoptoad.uucp>
Date: 16 Apr 89 22:16:07 GMT
References: <451@biar.UUCP> <28839@apple.Apple.COM> <4012@ece-csc.UUCP> <6987@hoptoad.uucp> <4015@ece-csc.UUCP> <7015@hoptoad.uucp> <2551@cps3xx.UUCP>
Reply-To: tim@hoptoad.UUCP (Tim Maroney)
Distribution: na
Organization: Eclectic Software, San Francisco
Lines: 101

In article <2551@cps3xx.UUCP> rang@cpswh.cps.msu.edu (Anton Rang) writes:
>1.  Why should an OS provide newline support when high-level languages
>    also provide it?  To make life easier for the developer of a HLL.
>    Also, suppose that a program uses both C and Pascal, using both
>    fgets() and readln().  If the OS provides the newline support then
>    you don't have (much) duplication of code in the support libraries.

Could be true of Pascal, but not of C.  C's "stdio" buffered i/o library
does a lot more than just read lines.  Most C compilers use code licensed
from AT&T Bell Labs for at least some part of stdio, and this assumes an
underlying OS file system is being used for block-structured reads.  It
would actually be considerably harder (and less efficient) to use the OS
to do line-oriented reads.  So, the OS might make it easier for a Pascal
implementer to write readln, but it wouldn't help a C implementer, nor
would it reduce functional overlap in library code between a program
incorporating both C and Pascal.

>2.  Using individual read calls is slow; why use them?  Well, they're
>    probably always slower than doing stuff at a very low level--I can
>    write my own disk I/O routines and read stuff faster by totally
>    bypassing the file manager.

And break over LANs, other external file systems, new system releases, etc.

>    Just as one answer, maybe there's a
>    reason I don't want to allocate a big fixed-size buffer for
>    reading this file--after all, the smallest size which would make
>    sense for a buffer is a disk block.  Maybe I'm trying to conserve
>    memory in an INIT; maybe I need to read the file without worrying
>    about running out of memory in the process.

First, you allocate the buffer before you do any reading at all, so
there's no chance you can run out in the middle of the operation.
Second, you just get the biggest buffer you can given the current
memory space limitations.  If there's enough for the whole file, go for
it; if there's only 512 bytes in the largest buffer you can allocate,
use that instead.  (Though if you're that low on storage, you probably
won't be able to read in the file anyway....)

>3.  Why do stuff inefficiently during development which we'd make more
>    efficient for a production program anyway?  Perhaps I'm porting a
>    program from another operating system.

To the Mac?  Maybe as an MPW Tool, but everyone who's tried to do this
kind of porting on a real application has wound up with awfully ugly
results.  There's a real philosophical difference between prompt driven
software (the computer telling the user what to do) and event driven
software (the user telling the computer what to do).  I can see porting
specific libraries without user interfaces to the Mac, e.g., a B-tree
database package for developers, but forget about porting ordinary
programs.

>4.  A bit more complex.  Joseph Hall claims that reading as much as
>    possible on each read call isn't necessarily the key to speed.
>    Tim says it's speculation.  One point here--if allocating a 32K
>    buffer to read a text file quickly means swapping out 32K of code
>    from somewhere, this might be true.  A procedure which counts the
>    number of lines in a text file may well find that using a huge
>    buffer is overkill.

I have to admit -- I never swap out code.  I use too many function
pointers and segment unloading seems like an anachronism from the 128K
Mac days.  Now everybody gets a chance to take shots at me for not
using this great feature of the Mac.

One more point -- 32K is hardly a huge buffer on a megabyte machine.

>5.  A final note (of my own).  Tim says that "if you're reading a line
>    at a time on any machine, it's likely you're taking a performance
>    hit."  Just to make things a little more complicated, I'd just
>    like to say that there are systems which do NOT require any
>    specific character to mark the end of a line--if you say writeln()
>    it writes out your data, whether it contains ^M or ^J or whatever.
>    On these systems, reading data block-by-block and trying to figure
>    out the end of a line is either near-impossible or just plain slow.

Er, good point.  You're right.  It's been so long since I've done any
VMS programming that I forgot about line-structured files.  Of course,
the VMS people at DEC finally got around to implementing byte-stream
files a few years ago, and everyone treated this as a great step
forward....

>6.  Tim says "And writing a loop to turn blocks into lines on your own
>    is so easy that a first-semester programmer could do it."
>    Probably true.  But writing an *efficient* loop probably means
>    using assembly language, at least until some decent optimizing
>    compilers are widely available on the Mac.

First, MPW C 3.0 is supposedly a pretty smart optimizer.  Second, I
don't agree.  Any good compiler can create reasonably good code for a
simple loop of this kind.  With an old C compiler, you may have to use
register declarations, but there's no reason a compiler can't produce
code as good as assembler for a "for" loop.  (I refuse to use register
declarations in 1989; the techniques of register optimization have been
well understood for more than a dozen years now, and a compiler that
doesn't use them is brain damaged.  I'm only using LSC now because my
client preferred it.)
-- 
Tim Maroney, Consultant, Eclectic Software, sun!hoptoad!tim
"Next prefers its X and T capitalized.  We'd prefer our name in lights in
 Vegas."  -- Louis Trager, San Francisco Examiner