Path: utzoo!mnetor!motto!ecijmm!eci386!clewis From: clewis@eci386.UUCP Newsgroups: comp.unix.i386 Subject: Re: ESDI controller recommendations Message-ID: <1989Aug29.230048.19130@eci386.uucp> Date: 29 Aug 89 23:00:48 GMT References: <121@mdi386.UUCP> <1474@wb3ffv.ampr.org> <4843@looking.on.ca> Reply-To: clewis@eci386.UUCP (Chris Lewis) Organization: R. H. Lathwell Associates: Elegant Communications, Inc. Lines: 110 In article <4843@looking.on.ca> brad@looking.on.ca (Brad Templeton) writes: >While cylinder or track caching is an eminently sensible idea that I >have been waiting to see for a long time, what is the point in the >controller or drive sotring more than that? > >Surely it makes more sense for the OS to do all other cache duties. >Why put the 512K in your drive when you can put it in your system and >bump your cache there? Other than the CPU overhead of maintaining the >cache within the OS, I mean. I would assume the benefit from having >the cache maintained by software that knows a bit about what's going >on would outweigh this. I've had quite a bit of exposure to the DPT caching disk controllers so I'll outline some of the interesting points. Some of these pertain generally to DPT, or only the models I was playing with (ESDI and ST506 disk interface versions with SCSI host interface), or more generally. 1) Write-after caching: Most systems do their swapping and/or paging raw. Thus they must *wait* for a write operation to complete before reusing the memory. Eg: avg 28 ms with ST506 drive. With write-after, you can reuse memory in .5 ms no matter how slow your drive is (unless the cache really fills). I installed one of these suckers on a Tower 600 with 4Mb running Oracle. We were able to immediately double the number of users using Oracle (from 4 to 8 simultaneous actions with considerably better response for all 8. Oracle 4.1.4 is a pig! So was the host adapter at the time - 3-6ms to transfer 512 bytes!). A look at the controller statistics showed that the system was swapping like mad, but virtually *no* physical disk I/O's actually occurred. Eg: blocks were being read back so fast that the controller never needed to write them out. Of course, this can be similarly done by adding physical memory to the system, however, DPT memory is cheaper than Tower memory... 2) Host memory limitations - how does 16Mb of main memory almost exclusively for use by programs and 12Mb of buffer cache strike you? (AT-style system limitations) Otherwise there's lots of tricky trade-offs. On the other hand, when faced with lots of physical memory on the host, it makes far more sense to use it for program memory than a RAM swap disk. 3) If your kernel panics, the controller gets a chance to flush its buffers - handy particularly if you make the kernel buffers small. Was sort of scary to see, for the first time, a Tower 400 woof its cookies (so I'm not a perfect device driver writer ;-) and see the disk stay active for another 30 seconds... 4) If you have a power failure, having the cache on the controller is a bad idea, because the kernel does make some assumptions about the order in which I/O occurs. With the models I was using it made economic sense to place a UPS only on the controller and disk subsystem. I don't know whether this is possible on the AT versions, but on the AT versions it's cheaper to get a whole-system UPS. 5) DPT read-ahead can be cancelled by subsequent read requests. 6) The DPT's algorithms (eg: replacement policy, lock regions, write-after delay times, dirty buffer high-water, cache allocation amongst multiple drives etc.) can be tuned. Most kernels cannot be much. 7) Now we get into the hazy stuff - I'm convinced from the testing that I did with the DPT lashups I built, plus experience inside other kernels, that the DPT has far better caching than most UNIX kernels. Generally speaking, except for look-ahead (which the DPT supports as well) kernel take no special knowledge of the disk *other* than inherent efficiency of file system layout (eg: Fast File System structures) and free list sorting (dump/mkfs/restore anyone?). For example, except for free-list sorting and other mkfs-style tuning, fio.c and bio.c (file I/O and block I/O portions of the kernel) don't know diddly squat about the real disk. Whereas, the DPT knows it intimately - sectors per track, rotational latency etc. The DPT uses the elevator algorithm and apparently a better LRU (page replacement) algorithm, has sector and cylinder skewing and so on. Unfortunately, I no longer have a copy of the report. Further, most of the measurements I was making was with reasonably representative technical measures of performance, but don't give an overall feel for performance. However, one that I remember may be of interest - kernel relinks on the Tower usually took close to 3 minutes. With the DPT, it went to little over 2 minutes. Big hairy deal... However, further examination of "time" results showed that the I/O component *completely* disappeared. Like wow. Some other simple benchmarks showed overall performance increases of up to a factor of 15! The only way to make the DPT system work better would be to make some major deals with fio.c/bio.c and a couple of minor mods to the DPT. For example, multiple lower priority look ahead threads based upon file block ordering. Explicitly cancellable I/O's or look aheads. More, but I forget now. The DPT also has some other niceties: automatic bad-block sparing, single command format/bad blocking, statistics retrieval, and in my case, compatibility with dumb SCSI controllers except for the additional features - the NCR Tower SCSI driver has this neat "issue this chunk of memory as a SCSI command and give me the result" ioctl. Neat stuff the DPT. [No, I don't work for, nor have I ever worked for DPT. Hi Tom!] -- Chris Lewis, R.H. Lathwell & Associates: Elegant Communications Inc. UUCP: {uunet!mnetor, utcsri!utzoo}!lsuc!eci386!clewis Phone: (416)-595-5425