Path: utzoo!attcan!uunet!ogicse!caesar.cs.montana.edu!uakari.primate.wisc.edu!zaphod.mps.ohio-state.edu!usc!snorkelwacker!bloom-beacon!eru!luth!sunic!mcsun!ukc!acorn!rwilson From: RWilson@acorn.co.uk Newsgroups: comp.arch Subject: Official comments on recent ARM postings Message-ID: <1713@acorn.co.uk> Date: 22 Feb 90 11:06:12 GMT Sender: rwilson@acorn.co.uk Lines: 90 ARM surfaces again on comp.arch - wonders will never cease :-) ]A related observation: the Acorn ARM has pre/post-increment/decrement (all 4 ]combinations). You don't have to pay a penalty in cycles for using them. ]The ARM is a RISC. ]Ge' Weijers You do pay a slight penalty - instruction code density vs address offset. The extra bits of control for pre/post & inc/dec use up some bits that might have been used for a bigger address offset. Studies showed.... }This is indeed addressing modes, but as any load can be done in one cycle, }Torben Mogensen (torbenm@diku.dk) Actually, due to all sorts of memory pipeline breaks, all loads in ARM are 3 cycles, all stores are 2 cycles. (for the interested, ARM has a pipelined memory bus: the address in cycle a refers to the data xferred at the start of cycle b - at which time ARM is already presenting a new address for cycle b. This - since load requires one to look at the target, get the value and then get the next instruction - means 3 cycles of the bus are spent on the load. And it was simpler to make this 3 cycles internally - avoiding the extra ports which might have been needed for pre/post & inc/dec - since we weren't thinking of building ARM3 at the time (1984) (a good guess on the part of Randell Jesup). Stores can be done in 2 cycles since the processor needn't wait while the data whizzes about. Looking at ARM address traces on a logic analyser can be very exciting :)). >correction - (single register) loads on an ARM take three or four cycles, >depending on the alignment of the next instruction in the pipeline. >MEMC (the memory controller in the ARM chipset) loads 4 32-bit words at a >time, so that relatively slow memories may be used, however this means >that any out of sequence loads require an extra cycle to be inserted in >order to reload. >Alasdair McIntyre (aiadrmi@uk.ac.edinburgh.castle) As stated above, as far as ARM is concerned load is always 3 cycles. The memory controller transfers data to and from DRAMs using page mode to speed up many cycles. But the column address is also used for late translation of the virtual address which limits the number of page mode cycles in a row to 3 - when we have to do a new row address. Page mode cycles are sequential "S" cycles, non-page mode are non-sequential "N" cycles. ARM provides an advance warning signal to the memory controller when it breaks sequentiality. Clearly LDR breaks, so it takes N, N, S. Clever logic in MEMC rescues the second N, turning it into an S where possible. STR always does N, N - coincidentally the same time as LDR in current systems where N=2*S, but for different reasons. As Alasdair states, the multiple transfer instructions have the same timing as the single transfer instructions for just one register. Each additional value transferred then takes an extra S cycle. LDM instructions are thus able to operate at nearly peak memory bandwidth. >Using this technique the maximum sustained data rate achievable for a block >move operation is over 11 Mbytes/sec on an 8MHz ARM for a word aligned transfer >Alasdair McIntyre (aiadrmi@uk.ac.edinburgh.castle) Now current Acorn volume market machines have just about the slowest DRAM obtainable. N=250nS, S=125nS. Bus bandwidth is maximised by N+3S repeated, coming in at 25.6MBytes per second. And LDM/STM of (say) 8 registers gets very close to this, so an Acorn A3000 (649 pounds) can pick up memory *and put it down again* at around the speed mentionned - provided the video system isn't after the memory bandwidth as well! But these aren't ARM limits (or even MEMC limits). I'm using a research prototype machine with N=166nS, S=83nS. A bus bandwidth of 38.4MBytes per second. And its DRAMs are just a bit less affordable :-). Naturally ARM3 refills cache lines using the N+3S transfer method. Thus speeding up LDR instructions to related addresses. Third party ARM3 upgrades to Acorn machines are circa 500 pounds.... (and falling) =Remember, the guiding principle of architecture design is not "is it RISC", =it's "is it faster" (with a little "is it cheaper", "is it easier to =program", "is it reliable", etc thrown in.) =Randell Jesup (jesup@cbmvax.cbm.commodore.com) ARM and its related chip set were designed on a "total system design" principle. With a great deal of "is it cheaper?" thrown in, since our objective was to make a true 32 bit computer with performance, hardware memory management, colour bit mapped screens and flexible IO, as cheap as possible. RISC just happened to fit the bill on the processor end of things (68020's cost as much as the whole of the rest of the system put together in 1984), but many other considerations and technologies were valuable. Like we also wanted to make the assembly language easy to write in, the overall system readily manufacturable, the amount of people time to design the chips tractable.... So we ended up with (strange?) things like MEMC with no data bus, VIDC with no address bus. And zillions of lines of ARM Assembler (it turned out so easy to write...). --Roger Wilson