Path: utzoo!attcan!uunet!ogicse!caesar.cs.montana.edu!uakari.primate.wisc.edu!zaphod.mps.ohio-state.edu!usc!snorkelwacker!bloom-beacon!eru!luth!sunic!mcsun!ukc!acorn!rwilson
From: RWilson@acorn.co.uk
Newsgroups: comp.arch
Subject: Official comments on recent ARM postings
Message-ID: <1713@acorn.co.uk>
Date: 22 Feb 90 11:06:12 GMT
Sender: rwilson@acorn.co.uk
Lines: 90

ARM surfaces again on comp.arch - wonders will never cease :-)

]A related observation: the Acorn ARM has pre/post-increment/decrement (all 4
]combinations). You don't have to pay a penalty in cycles for using them.
]The ARM is a RISC.
]Ge' Weijers

You do pay a slight penalty - instruction code density vs address offset.
The extra bits of control for pre/post & inc/dec use up some bits that might
have been used for a bigger address offset. Studies showed....

}This is indeed addressing modes, but as any load can be done in one cycle,
}Torben Mogensen (torbenm@diku.dk)

Actually, due to all sorts of memory pipeline breaks, all loads in ARM are 3
cycles, all stores are 2 cycles. (for the interested, ARM has a pipelined
memory bus: the address in cycle a refers to the data xferred at the start
of cycle b - at which time ARM is already presenting a new address for cycle
b. This - since load requires one to look at the target, get the value and
then get the next instruction - means 3 cycles of the bus are spent on the
load. And it was simpler to make this 3 cycles internally - avoiding the
extra ports which might have been needed for pre/post & inc/dec - since we
weren't thinking of building ARM3 at the time (1984) (a good guess on the
part of Randell Jesup). Stores can be done in 2 cycles since the processor
needn't wait while the data whizzes about. Looking at ARM address traces on
a logic analyser can be very exciting :)).

>correction - (single register) loads on an ARM take three or four cycles,
>depending on the alignment of the next instruction in the pipeline.
>MEMC (the memory controller in the ARM chipset) loads 4 32-bit words at a
>time, so that relatively slow memories may be used, however this means
>that any out of sequence loads require an extra cycle to be inserted in
>order to reload.
>Alasdair McIntyre (aiadrmi@uk.ac.edinburgh.castle)

As stated above, as far as ARM is concerned load is always 3 cycles. The
memory controller transfers data to and from DRAMs using page mode to speed
up many cycles. But the column address is also used for late translation of
the virtual address which limits the number of page mode cycles in a row to
3 - when we have to do a new row address. Page mode cycles are sequential
"S" cycles, non-page mode are non-sequential "N" cycles. ARM provides an
advance warning signal to the memory controller when it breaks
sequentiality. Clearly LDR breaks, so it takes N, N, S. Clever logic in MEMC
rescues the second N, turning it into an S where possible. STR always does
N, N - coincidentally the same time as LDR in current systems where N=2*S,
but for different reasons.

As Alasdair states, the multiple transfer instructions have the same timing
as the single transfer instructions for just one register. Each additional
value transferred then takes an extra S cycle. LDM instructions are thus
able to operate at nearly peak memory bandwidth.

>Using this technique the maximum sustained data rate achievable for a block
>move operation is over 11 Mbytes/sec on an 8MHz ARM for a word aligned transfer
>Alasdair McIntyre (aiadrmi@uk.ac.edinburgh.castle)

Now current Acorn volume market machines have just about the slowest DRAM
obtainable. N=250nS, S=125nS. Bus bandwidth is maximised by N+3S repeated,
coming in at 25.6MBytes per second. And LDM/STM of (say) 8 registers gets
very close to this, so an Acorn A3000 (649 pounds) can pick up memory *and
put it down again* at around the speed mentionned - provided the video
system isn't after the memory bandwidth as well!

But these aren't ARM limits (or even MEMC limits). I'm using a research
prototype machine with N=166nS, S=83nS. A bus bandwidth of 38.4MBytes per
second. And its DRAMs are just a bit less affordable :-).

Naturally ARM3 refills cache lines using the N+3S transfer method. Thus
speeding up LDR instructions to related addresses. Third party ARM3
upgrades to Acorn machines are circa 500 pounds.... (and falling)

=Remember, the guiding principle of architecture design is not "is it RISC",
=it's "is it faster" (with a little "is it cheaper", "is it easier to
=program", "is it reliable", etc thrown in.)
=Randell Jesup (jesup@cbmvax.cbm.commodore.com)

ARM and its related chip set were designed on a "total system design"
principle. With a great deal of "is it cheaper?" thrown in, since our
objective was to make a true 32 bit computer with performance, hardware
memory management, colour bit mapped screens and flexible IO, as cheap as
possible. RISC just happened to fit the bill on the processor end of things
(68020's cost as much as the whole of the rest of the system put together in
1984), but many other considerations and technologies were valuable. Like we
also wanted to make the assembly language easy to write in, the overall
system readily manufacturable, the amount of people time to design the
chips tractable.... So we ended up with (strange?) things like MEMC with no
data bus, VIDC with no address bus. And zillions of lines of ARM Assembler
(it turned out so easy to write...).

--Roger Wilson