Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!usc!ucsd!network.ucsd.edu!celit!hutch From: hutch@fps.com (Jim Hutchison) Newsgroups: comp.unix.wizards Subject: Re: Need efficient way to read file in reverse Message-ID: <11217@celit.fps.com> Date: 20 Sep 90 21:10:34 GMT References: <1990Sep18.152818.1303@phri.nyu.edu> Sender: daemon@fps.com Reply-To: hutch@fps.com (Jim Hutchison) Organization: FPS Computing Lines: 72 In <1990Sep18.152818.1303@phri.nyu.edu> roy@alanine.phri.nyu.edu (Roy Smith): > Is there a standard, portable, efficient way to read a file in >reverse? No, yes, and pretty much. There isn't much call for this, atleast visibly, so there isn't enough mass for a standard. You can use constants defined for your system, this makes it relatively portable. You can get efficiency along with portability, really. > I'm doing random seeks in a file and occassionally want to be able >to find the beginning of the line into the middle of which I just seeked. >Right now, I'm using a simple-minded revgetc() I wrote which is basically: > > fseek (fp, -2L, FSEEK_CURRENT); > getc (fp); This will spend a lot of time calling fseek and read. By "a lot" I mean you do (line-length / 4) reads & seeks per line you want to read. Remember that stdio trys to read-ahead, so each time you back-up it has to start over. It could be smarter, but depending on that would be very non-portable. >but which turns out to loose big, since it looks like it does a read(2) for >every call to getc(). Life is made worse by the fact that occasionally I >have very long (over 1kbyte) lines. My application is currently spending >about 98% of its CPU time in revgetc(), according to gprof. > > Obviously, the thing to do is write a buffered revgetc(). Modifying >the standard getc() macro to traverse the buffer in the opposite direction >would be easy, but the result would be non-portable. To make it portable, I >want to write it using the stdio functions, and that's the hitch. The >warning in the fseek/ftell man page about stdio read pointers possibly being >magic cookies makes it hard for me to calculate how far back to seek to read >in the preceeding, say, 1024 bytes. What to do? The ideal block size for reading your file can be gotten by doing a stat() or fstat() on it, and using stat->st_blksize (see stat(2)). If you setbuffer() with a buffer that size, you should have a quasi-optimal block size. If you backup to the previous block boundary. You should be able to getc to the current offset, remembering the last time you found the start of the line, at worst you may have to repeat this step many times. This minimizes the disk I/O and system calls, which are expensive. start_of_string = 0 -- If nowhere, it must start at 0 trim = offset % blksize; -- how far from block boundary do { end = offset; -- current end of search if (trim) offset -= trim; -- trim back to boundary else offset -= blksize; -- next boundary is previous block fseek(fp, offset, FSEEK_ABSOLUTE); -- backup while (offset != end) { -- while position is not to the end c = get(c); -- get char from our buffer if (start_of_string(c)) -- decide if you have found a start start_of_string = offset; -- remember current offset offset++; -- move on to next position } } while (start_of_string == 0 && offset > 0); -- find it until BOF You'd get better performance if you read() the file yourself, and used your own buffer instead of using getc(). That way you could proceed in your direction of choice, backwards. I used getc() here because you seem to want to use stdio. You can also back down from "blksize" if it gets to be to gross in terms of your average line length (I'd guess greater than 4x is probably too hefty). - -- - Jim Hutchison {dcdwest,ucbvax}!ucsd!fps!hutch Disclaimer: I am not an official spokesman for FPS computing