Path: utzoo!utgpu!news-server.csri.toronto.edu!bonnie.concordia.ca!uunet!mcsun!ukc!acorn!armltd!abaum
From: abaum (Allen Baum)
Newsgroups: comp.arch
Subject: Re: endian etc
Message-ID: <178@armltd.uucp>
Date: 20 May 91 11:01:23 GMT
References: <199@titccy.cc.titech.ac.jp>
Sender: abaum@armltd.uucp
Distribution: comp
Organization: A.R.M. Ltd, Swaffham Bulbeck, Cambs, UK
Lines: 62

In article <199@titccy.cc.titech.ac.jp> mohta@necom830.cc.titech.ac.jp (Masataka Ohta) writes:

>In article <173@armltd.uucp> abaum (Allen Baum) writes:
>
>>>>As a sweeping generalization, the path from cache to registers/forwarding
>>>>path is THE critical path
>
>First, I have noticed that R3000 has LWL and LWR instruction which means
>MUX of arbitrary flipping of bytes already exists on your "THE ciritical
>path".

I agree. They do. They can't be avoided if you have load_byte instructions.
That was my point. On the other hand you shouldn't do anything to make them
worse either.

>>>No. THE critical path is on register/ALU loop, which determines the maximuum
>>>clock speed (see Jouppi).
>>So, another way of saying what I meant to say is: if both ALU and Cache access
>>are going to fit into a cycle, the cache access will be the limiting case.

>I think cache access is much slower than ALU operation.

That is my point!!!!!!!

>>Perhaps not true with a trivial cache, but true for sizes that are useful.

>Are you sure?
>Don't you know that access time increases with the size increase of cache?

Of course I know! Again, that was my pint!!!

>It should also be noted that for a large cache to be useful (fast context
>switch etc.), it should be tagged with physical address. So, you must also
>do TLB lookup.
>For example, on R4000, an ALU operation takes only 1 internal cycle
>(10ns) but cache access takes three internal cycle including tag check.

Which proves my point (and yours, I guess)

>>Again, assuming that you intend everything to run in a single cycle, anything
>>that you put into that path will therefore lengthen your minimum cycle time.
>>Which was the point I was trying to make about sign extension.

>With todays technology, it is a bad idea to access cache in a single cycle.
                                 ^^^ ^^^     ^^^^^^ ^^^^       ^^^^^^ ^^^^^
Ah, now this is a very interesting position, and not one that I can dismiss out
of hand, either. All my arguments were based on single cycle cache access
(not including address formation, so most of todays RISCs have 2 internal cycle
cache access, by your definition).

 In order for this approach (taken by the R4000) to be effective,
you have to be able to schedule loads sufficiently far in advance so they
don't stall. In a 'N'cycle cache access, you need to schedule N-1 cycles in
advance. So, there is a law of diminishing returns there. Furthermore, you
can't block on a cache access in progress, so your cache has to be pipelined.

None of this is impossible- witness the R4000; they did it. They feel that the
gain in cycle time outweighs the extra stalls, and the extra design complexity.

So its probably time for a new thread - costs of multi-cycle cache access in
design complexity and number of stalls vs. the benefits. Any takers?