Path: utzoo!utgpu!news-server.csri.toronto.edu!mailrus!wuarchive!zaphod.mps.ohio-state.edu!mips!winchester!mash From: mash@mips.COM (John Mashey) Newsgroups: comp.arch Subject: Re: 64 bits--why stop there? (really, bitfields, LONG) Message-ID: <41233@mips.mips.COM> Date: 31 Aug 90 08:19:31 GMT References: <6106@vanuata.cs.glasgow.ac.uk> <2437@crdos1.crd.ge.COM> <631@array.UUCP> <225@csinc.UUCP> <1372@svin02.info.win.tue.nl> <41188@mips.mips.COM> Sender: news@mips.COM Reply-To: mash@mips.COM (John Mashey) Organization: MIPS Computer Systems, Inc. Lines: 154 In article Chuck.Phillips@FtCollins.NCR.COM (Chuck.Phillips) writes: >Yes, it would break a _few_ programs. (e.g. incrementing a pointer cast to >an int or otherwise directly twiddling pointer bits) Adding a new type C >type called "bit" (or boolean or logical etc.) should be orthagonal to >existing correctly written C code, except for the added keyword. (void *)s >and (char *)s _can maintain the old semantics_ and old correct code need >only be recompiled after a character substition, if necessary. I think >this would fall under your category "b". >If implemented in hardware, bit addressing could provide faster bit access >(bit read vs. byte read + mask + shift), and (at last!) a syntatically and >semantically consistant way to access multi-word bit arrays. Let me try the exercise again: start with an architecture, modify it for bit-addressing, add instructions, see how it fits with the languauges, and see what happens. The above has at least some parts of such a proposal, so let's try it and look at the other pieces. (THis is long, but has a short moral at the end). The comments seem to propose addition of 1 or more instructions that load a bit-field into a register. Let's look at that, and consider adding onto any of the architectures similar to H&P's DLX (but if necesary, I'll use an R3000 a a concrete example). I'll assume that one really wants the following 2 instructions: LFS: load-bitfield-signed register,address,fieldsize LFU: load-bitfield-unsigned register,address,fieldsize and that within a single word, they are the logical equivalents of: load-word register,address shift-left register,register,32-offset shift-right-X register,register,32-fieldsize X = arithmetic (LFS) or logical (LFU) OR, on machines that have one: load-word register,address extract-field register,register, offset, fieldsize Of course, this is ignoring any setup required if everything is dynamic, i.e., it assumes that a compiler is extracting a bitfield from something whose byte address is known. If the bitfield is 1-32 bits, and crosses a word boundary, then longer sequences are needed, including two loads, and then some shifting & masking. (on an R3000, one might use LWLeft, LWRight to get the right 4-bytes back together). (working thru all these cases is left as an exercise :-) Now, consider some low-level consequences: 1) The format of LFS/LFU doesn't fit into the typical 32-bit RISC instruction, because it has an extra 5-bit operand ("fieldsize"). Where do you encode it? a) If you have base+index addressing, use the index register as the fieldsize. b) Steal the high-order 5 bits of the displacement. c) Steal the target register field, so that you always load the result into an implicitly-chosen register. Now, compiler writers will gag on c) (which ends up requiring you to move the result to the desired register some of the time anyway). MIPS would have to use b), some others might use c). The compiler writers must deal with the fact that these instructions have different displacement-limits (b) or are non-indexing, unlike other load instructions. Either b) or c) mean that you've almost certainly created a new format: More hardware to decode the instruction (maybe, maybe not) More hardware to execute the instruction (almost certainly), and this might (or might not) be on the critical path. It is unlikely that either these is a big deal, but you never know. (Given existing isntruction formats, I'd guess HP PA might accomdate the new instruction format easiest.) 2) Adding a bitfield extraction and arbitrary shift amount to the load instruction may well increase the critical path (since load is usually a critical path item). Most existing machines fetch a word, then run it through a network that: extracts the relevant byte, halfword, (and in MIPS case, tribytes), sign-extends or zerofills to 32-bits, and then loads it int othe register. (Except for MIPS LWL/LWR, which are more complicated.) Either: a) You can add the complete variable bit-field extract into the load-path, without time penalty (bu unlikely to be without chip space penalty), Or b) You can't. What do you do with case b)? b1) Lengthen the cycle time b2) Add an additional cycle of load-latency to whatever was there, for b2a) these instructions b2b) all load instructions None of these are attractive, but it is hard to tell which is least bad. b2b is probably awful. b2a may cause additional irregularities in the pipeline design. As usual, serious simulation is needed to figure out which might be best. Note that if you add a cycle of load-latency, you've lost back a cycle of your improvement whenever you can't fill the load delay slots, that is: load (stall) assuming 1-cycle delay extract use bitfield consumes 4 cycles, as does: LFU (stall, stall) 3) Alignment. If you let LFU/LFS cross word boundaries in memory (which is needed to achieve the express goal of convenient access to arbitrary bitfields), then you for sure have instructions that cross boundaries, i.e., you're back with the unaligned-data problem that most of the RISC machines don't want. Solutions: 3a) Bite the bullet, and let everything be unaligned, with know issues of complexity and possible critical path problems & more complex MMUs and maybe exception-handling. 3b) use the IBM RS/6000 approach, and trap if it crosses a cache-line boundary, only. 3c) Add 4 more instructions, a la R3000: LFULeft, LFURight, LFSLeft, LFSRight that put the pieces together via 2 separate fetches. Assume this does not run you out of opcodes.... 4) Stores: we've done the easy part, but it seems like anything that worked this way at all, would want bitfield stores. Here, it is easier: there's only 1 store (SF), or maybe 3 (SF, plus SFLeft, SFRight) if 3c above is used. On the other hand..... Perhaps some of hardware folks out there would like to show typical implementations of: a) Byte-oriented memory with parity b) Byte-oriented memory with ECC versus the bitfield-oriented versions of these things, i.e., ones that permit the processor to issue bitfield stores. Extra signal pins? Extra places where write turns into read-modify-write? (Many possible configurations: best bet is to restrict these instructions to cached memory, use on-chip write-back caches, and never let anything else see it; that way, you only pay the read-modify-write price into the on-chip data cache (assuming byte parity on the cache words). Of course, if you restrict these to cached memory, you'd better be careful what sort of code your compiler generates to handle bitfields in structures .... that might be i/o device descriptions, and thus not cached... MORAL OF THIS TRIVIAL EXAMPLE: What sounds simple (have some instructions to access bitfields) can have all sorts of ramifications, and the only way to figure them out is to track down ALL the details. Simple things can surprise you in cycle time or latency hits. COMMENTS? (esp. maybe some HP PA person would talk about critical-path timing for load+extract, for example) -- -john mashey DISCLAIMER: UUCP: mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash DDD: 408-524-7015, 524-8253 or (main number) 408-720-1700 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086