Path: utzoo!utgpu!news-server.csri.toronto.edu!mailrus!wuarchive!zaphod.mps.ohio-state.edu!mips!winchester!mash
From: mash@mips.COM (John Mashey)
Newsgroups: comp.arch
Subject: Re: 64 bits--why stop there? (really, bitfields, LONG)
Message-ID: <41233@mips.mips.COM>
Date: 31 Aug 90 08:19:31 GMT
References: <6106@vanuata.cs.glasgow.ac.uk> <2437@crdos1.crd.ge.COM> <631@array.UUCP> <225@csinc.UUCP> <1372@svin02.info.win.tue.nl> <41188@mips.mips.COM> <CHUCK.PHILLIPS.90Aug30120102@halley.FtCollins.NCR.COM>
Sender: news@mips.COM
Reply-To: mash@mips.COM (John Mashey)
Organization: MIPS Computer Systems, Inc.
Lines: 154

In article <CHUCK.PHILLIPS.90Aug30120102@halley.FtCollins.NCR.COM> Chuck.Phillips@FtCollins.NCR.COM (Chuck.Phillips) writes:
>Yes, it would break a _few_ programs.  (e.g. incrementing a pointer cast to
>an int or otherwise directly twiddling pointer bits) Adding a new type C
>type called "bit" (or boolean or logical etc.) should be orthagonal to
>existing correctly written C code, except for the added keyword.  (void *)s
>and (char *)s _can maintain the old semantics_ and old correct code need
>only be recompiled after a character substition, if necessary.  I think
>this would fall under your category "b".

>If implemented in hardware, bit addressing could provide faster bit access
>(bit read vs. byte read + mask + shift), and (at last!) a syntatically and
>semantically consistant way to access multi-word bit arrays.

Let me try the exercise again:  start with an architecture, modify it for
bit-addressing, add instructions, see how it fits with the languauges,
and see what happens.  The above has at least some parts of such a proposal,
so let's try it and look at the other pieces.
(THis is long, but has a short moral at the end).

The comments seem to propose addition of 1 or more instructions
that load a bit-field into a register.  Let's look at that, and consider
adding onto any of the architectures similar to H&P's DLX (but if necesary,
I'll use an R3000 a a concrete example).
I'll assume that one really wants the following 2 instructions:
	LFS:	load-bitfield-signed	register,address,fieldsize
	LFU:	load-bitfield-unsigned	register,address,fieldsize
and that within a single word, they are the logical equivalents of:
	load-word	register,address
	shift-left	register,register,32-offset
	shift-right-X	register,register,32-fieldsize
			X = arithmetic (LFS) or logical (LFU)
OR, on machines that have one:
	load-word	register,address
	extract-field	register,register, offset, fieldsize

Of course, this is ignoring any setup required if everything is dynamic,
i.e., it assumes that a compiler is extracting a bitfield from something
whose byte address is known.  If the bitfield is 1-32 bits, and crosses
a word boundary, then longer sequences are needed, including two
loads, and then some shifting & masking.  (on an R3000, one might use
LWLeft, LWRight to get the right 4-bytes back together).
(working thru all these cases is left as an exercise :-)

Now, consider some low-level consequences:

1) The format of LFS/LFU doesn't fit into the typical 32-bit RISC
instruction, because it has an extra 5-bit operand ("fieldsize").
Where do you encode it?
	a) If you have base+index addressing, use the index register
	as the fieldsize.
	b) Steal the high-order 5 bits of the displacement.
	c) Steal the target register field, so that you always load the
	result into an implicitly-chosen register.
Now, compiler writers will gag on c) (which ends up requiring you to move
the result to the desired register some of the time anyway).
MIPS would have to use b), some others might use c).
The compiler writers must deal with the fact that these instructions
have different displacement-limits (b) or are non-indexing, unlike
other load instructions.
Either b) or c) mean that you've almost certainly created a new format:
	More hardware to decode the instruction (maybe, maybe not)
	More hardware to execute the instruction (almost certainly),
		and this might (or might not) be on the critical path.
It is unlikely that either these is a big deal, but you never know.
(Given existing isntruction formats, I'd guess HP PA might accomdate
the new instruction format easiest.)

2) Adding a bitfield extraction and arbitrary shift amount to the
load instruction may well increase the critical path (since load is usually
a critical path item).
	Most existing machines fetch a word, then run it through a network
	that: extracts the relevant byte, halfword, (and in MIPS case,
	tribytes), sign-extends or zerofills to 32-bits, and then
	loads it int othe register.  (Except for MIPS LWL/LWR, which
	are more complicated.)
Either:
	a) You can add the complete variable bit-field extract into
	the load-path, without time penalty (bu unlikely to be without
	chip space penalty),
Or
	b) You can't.

What do you do with case b)?
	b1) Lengthen the cycle time
	b2) Add an additional cycle of load-latency to whatever was there,
	for   b2a) these instructions
	      b2b) all load instructions
None of these are attractive, but it is hard to tell which is least bad.
b2b is probably awful.  b2a may cause additional irregularities in the
pipeline design.  As usual, serious simulation is needed to figure out
which might be best.   Note that if you add a cycle of load-latency,
you've lost back a cycle of your improvement whenever you can't fill
the load delay slots, that is:
	load
	(stall)	assuming 1-cycle delay
	extract
	use bitfield

consumes 4 cycles, as does:
	LFU
	(stall, stall)

3) Alignment.  If you let LFU/LFS cross word boundaries in memory
(which is needed to achieve the express goal of convenient access to
arbitrary bitfields), then you for sure have instructions that cross
boundaries, i.e., you're back with the unaligned-data problem that
most of the RISC machines don't want.  Solutions:
	3a) Bite the bullet, and let everything be unaligned,
	with know issues of complexity and possible critical path
	problems & more complex MMUs and maybe exception-handling.
	3b) use the IBM RS/6000 approach, and trap if it crosses
	a cache-line boundary, only.
	3c) Add 4 more instructions, a la R3000:
		LFULeft, LFURight, LFSLeft, LFSRight
		that put the pieces together via 2 separate fetches.
		Assume this does not run you out of opcodes....

4) Stores: we've done the easy part, but it seems like anything that
worked this way at all, would want bitfield stores.
Here, it is easier: there's only 1 store (SF), or maybe 3
(SF, plus SFLeft, SFRight) if 3c above is used.
On the other hand.....
	Perhaps some of hardware folks out there would like to
	show typical implementations of:
	a) Byte-oriented memory with parity
	b) Byte-oriented memory with ECC
	versus the bitfield-oriented versions of these things,
	i.e., ones that permit the processor to issue bitfield
	stores.
Extra signal pins?
Extra places where write turns into read-modify-write?
(Many possible configurations: best bet is to restrict these instructions
to cached memory, use on-chip write-back caches, and never let anything
else see it; that way, you only pay the read-modify-write price into
the on-chip data cache (assuming byte parity on the cache words).
Of course, if you restrict these to cached memory, you'd better be
careful what sort of code your compiler generates to handle bitfields
in structures .... that might be i/o device descriptions, and thus
not cached...

MORAL OF THIS TRIVIAL EXAMPLE:

What sounds simple (have some instructions to access bitfields)
can have all sorts of ramifications, and the only way to figure them out
is to track down ALL the details.  Simple things can surprise you in
cycle time or latency hits.  

COMMENTS?  (esp. maybe some HP PA person would talk about critical-path
timing for load+extract, for example)
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	 mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash 
DDD:  	408-524-7015, 524-8253 or (main number) 408-720-1700
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086