Newsgroups: comp.sys.cbm
Path: utzoo!utgpu!jarvis.csri.toronto.edu!godzilla.eecg.toronto.edu!leblanc
From: leblanc@eecg.toronto.edu (Marcel LeBlanc)
Subject: Re: SEQ file access speedup
Message-ID: <89Feb14.171816est.2394@godzilla.eecg.toronto.edu>
Summary: Interleave is important for standard routines on 1541/4040/8050
Keywords: Interleave, fast SEQ access
Organization: EECG, University of Toronto
References: <89Feb10.182100est.2732@godzilla.eecg.toronto.edu> <7124@killer.DALLAS.TX.US> <7143@killer.DALLAS.TX.US>
Date: Tue, 14 Feb 89 17:18:03 EST

In article <7143@killer.DALLAS.TX.US> elg@killer.Dallas.TX.US (Eric Green) writes:
>This is the results of benchmarking
>
>a) loading, and
>b) doing GETIN until EOF, from ML, doing nothing inbetween.
> ...
>My basic thought was that sequential file access can take place just
>as fast as LOAD'ing. The benchmark confirms that for IEEE drives and
>the standard 1541. There's a couple of constraints here. First of all,
 ...
>clrchn/chkin/clrchn/chkout for each individual byte -- a 4-to-1
>overhead). When you do that, SEQ access isn't slow at all.. just look
>at these timings:
>
>C-64, IEEE Flash, SFD-1001, 'load"bbs",8' :   18 seconds
> """"""""""""""""""""""""  ML loop, seq read: 18 seconds
					      ^^^^
>C-64, C-LINK II, SFD-1001    LOAD: 14 seconds
>                             READ: 14 seconds
				   ^^^^
	Yes, times are identical.  Please read on.

>in 64 mode, with 1571: load -- 60 seconds
>                       read -- 62 seconds

	Almost identical.  This supports what I said in my original posting
on this subject.  Here's an excerpt:

	... as much of a speed increase as is possible on LOADs.  This has
	less to do with the transfer protocol, than with the LOW PERFORMANCE
	limitations of the C64 kernal.  To remain compatible with ...

As you pointed out in an earlier posting, the standard C64 load routine does
nothing but repeatedly call ACPTR!  This is a LOW PERFORMANCE limitation
when you have a decent transfer protocol, but it's of no importance when you
have to use the standard serial protocol of the C64!  But then you say
SFD-1001 (IEEE interface) isn't low performance?  It is reasonably fast, but
since they just speed up ACPTR/CIOUT without changing the LOAD routine, the
ML loop that you have written should give the same results as LOAD (since
it's basically the same loop), and it does.  This DOESN'T mean that SEQ read
is as fast as block transfers (LOAD), it just means that you have to
optimize ("speed up") the block transfer software as well as the transfer
protocol.  This is even better demonstrated by the following numbers:

>128 Ramdos: READ: 9 seconds
>64 Ramdos: READ: 8 seconds

	AND, as you stated before, LOAD is virtually instantaneous!

>128 -- 1571 -- load -- 8 seconds
>               read -- 26 seconds
>64 mode, with Epyx fastload cart. --
>                       LOAD 26 seconds
>         with Mike J. Henry's "fastboot v2": 26 seconds

Doesn't it seem unusual that C128 fast serial, Epyx FastLoad, and Mike
Henry's fastboot all take the same amount of time (26 secs)?  [This really
isn't intended to sound like a flame.]  Here's what you wrote earlier in the
article:
>All tests were done with a 98 block file consisting of the main
>body of a BBS program. It was just the handiest program that I had
>available on both SFD and 1541 formats. I put it onto a blank 1541
>disk, to prevent fragmentation. It was already first on the SFD disk

>From the numbers listed above, I would guess that you copied from the
SFD-1001 to a _1571_.  Unless you set the interleave yourself (using
"U0>"+chr$(interleave#)), the 1571 saves using a 6 sector interleave, even
when it's in 1541 mode.  The C128 burst mode can easily keep up with a 6
sector interleave, but Mike Henry's fastboot needs at least 8 sectors to
decode and transfer, and Epyx FastLoad needs at least 10 (the 1541
standard).  On a fresh disk, the program would be saved near the directory
track.  On this part of the disk, I think the sectors/track is 18.  Since
FastLoad, fastboot V2, and C128 fast serial can't keep up with an interleave
of 6 sectors, they are forced to wait a full revolution or 18+6 = 24
sectors!  The interleave forces a speed difference of 24/6 = 4 times
slowdown!  Of course, in a 98 block file, not all sectors can be stored
exactly 6 apart, so this is just a GOOD approximation.  This is very
close to the 26 sec/ 8 sec ratio (3.25) given by the above numbers.

Weren't we talking about SEQ file speed up? :-) What I'm getting at is that
1541/71 and SFD-1001 aren't good drives to use when studying byte-at-a-time
transfer overhead.  That's because the dos in those drives only buffers a
sector at a time, which forces it to use an interleave scheme.  It's
possible to get around this (and Super Snapshot V4 does, for LOAD only), but I
haven't seen an implementation yet that attempts to do this for SEQ file
accesses.  And since the transfer times involved in this performance range
(about 4-5 secs for 100 blocks, far beyond standard CBM IEEE) are less than the
overhead for byte-at-a-time transfers, you wouldn't be able to get close to
LOAD speedup for SEQ accesses.  Here's my speedup summary:
(by "State of the Art" I mean 1541 interleave INDEPENDENT serial fast
loaders and C128 burst mode with optimal interleave, NOT IEEE.)

				std blocks	non-std blocks
A. "State of the Art"  LOAD	 12-15x		 20-25x
B. not yet attempted,
   "State of the Art"  READ	 6-7x (guess)	 8-9x (guess)
C. Classical Fast I/O  LOAD	 5-6x		 n.a.
   e.g. Epyx FastLoad
   (interleave = 10)
D. Classical fast I/O  READ	 3-4x (guess)	 n.a.
E. Standard	LOAD & READ	 1x		 n.a.

The IEEE interfaces that various people have discussed so far probably fit
in with "C".  This is only because they are using the standard load routine
with faster ACPTR (to get any speedup they would have to SAVE with a tighter
interleave or execute custom LOAD routines within the IEEE device, but I
doubt that any of the IEEE owners on the net would want to have anything to
do with this :-) ).

A good way to see byte-at-a-time overhead is to use RAMDOS or a 1581, which
buffers half a track (one physical cylinder, not a full logical track).

>Unfortunately I couldn't see if the Super Snapshot was faster than the
>Epyx or fastboot product. My brother sold ours because it was ...
>... "it wasn't any faster than the fastload cartridge" (his words,

SS V1 and SS V2 were "classical" fast loader implementations, so the speed
was only marginally faster than Epyx FastLoad (5.5x vs. 5x).  The actual
transfer routines were significantly faster, but the 10 sector interleave of
the 1541 limited all these products to the same speed range.  The marginal
speedup came from significantly faster head stepping routines.  You could
SAVE at a different interleave to get some extra speedup, but it wasn't
usually worth the trouble.

SS V3 and SS V4 use a MUCH faster interleave independent technique.  The
speedup over Epyx FastLoad and similar products is very noticeable.

>Some trivia: the main difference between LOAD'ing (burst mode) and
>READ'ing (fastmode) on the 1571 is that fast mode negotiates a
>transaction for each byte, while burst mode negotiates on a per-block
>basis. Burst mode is unique in that manner -- even the IEEE drives
>negotiate on a per-byte basis (probably why they're slower than burst
>mode, despite fairly equivalent hardware). 

I agree, per-block is the only way to get great speed.  You have probably
noticed that the Burst mode examples in the 1571 user's manual avoid using
subroutine calls to get each byte as it arrives.  With transfer rates in the
range used by Burst mode, this could slow you down.  However, it turns out
that there's a fair bit of time to waste at the bit rate that CBM decided to
use for Burst mode.

>Some other trivia: Using ACPTR should be faster than using GETIN, if
>subroutine overhead is as big a problem as some hint. GETIN has to
>do all sorts of testing to see where to dispatch to -- is it keyboard,
>or is it disk? This overhead should be noticible when compared to
>LOAD, which calls ACPTR directly. But for both the IEEE drives and the
>1541, there was no significant difference between LOAD and GETIN
>times, implying that transfer speed, and not internal Kernal overhead,
>was the limitation. 

Again, this just shows how slow the standard ACPTR routine is, and how
important interleave limitations are no matter how fast the transfer
protocol is.  Once you have overcome the limitations of interleave, either
by buffering whole tracks or by doing other nasty manipulations :-), the
real speed of the transfer protocol can really shine.  After all, IEEE
interfaces should be capable of much faster transfers.

For those who don't believe that interleave is as important as I've said,
try the following:  Create a file on a 1541, then compare the time required
to LOAD it using a classical fastloader like Epyx FastLoad with the time
required to SCRATCH the file.  You should get the same results from a C128
with a 1571 (using burst mode vs. SCRATCH).  SCRATCH has to follow the chain
of sectors that are used in the file.  Since the only transfers involved in
SCRATCH are internal, all the time required is to follow the 10 sector
interleaved chain (6 if you're using a 1571).

I think this posting was already too long about half way through! :-)

Marcel A. LeBlanc	  | University of Toronto -- Toronto, Canada
leblanc@eecg.toronto.edu  | also: LMS Technologies Ltd, Fredericton, NB, Canada
-------------------------------------------------------------------------------
UUCP:	uunet!utai!eecg!leblanc    BITNET: leblanc@eecg.utoronto (may work)
ARPA:	leblanc%eecg.toronto.edu@relay.cs.net  CDNNET: <...>.toronto.cdn