Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!utgpu!water!watmath!clyde!rutgers!mit-eddie!think!ames!pioneer!lamaster
From: lamaster@pioneer.UUCP
Newsgroups: comp.arch
Subject: Re: Life with TLB and no PT
Message-ID: <1399@ames.UUCP>
Date: Mon, 27-Apr-87 22:11:00 EDT
Article-I.D.: ames.1399
Posted: Mon Apr 27 22:11:00 1987
Date-Received: Wed, 29-Apr-87 01:31:23 EDT
References: <3027@sdcsvax.UCSD.EDU> <338@dumbo.UUCP> <1366@ames.UUCP> <27302@rochester.ARPA>
Sender: usenet@ames.UUCP
Reply-To: lamaster@pioneer.arc.nasa.gov (Hugh LaMaster)
Organization: NASA Ames Research Center, Moffett Field, Calif.
Lines: 62

In article <27302@rochester.ARPA> crowl@rochester.UUCP (Lawrence Crowl) writes:
:
>
>You are talking eight megabytes of data for what is presumably a small part
>of any real program.  Unless you have a lot of physical memory, this code
>looks like one page fault per statement.  The TLB miss is insignificant.

Well, a medium part actually.  If I had a 4 MIPS machine with 16MB of memory,
typical these days for a Sun, etc., I would probably dimension my arrays about
700x700.  On the other hand, if I could get 32MB, 1Kx1K would be good.  I
think 16-64MB is a very good guess as to the memory size I would expect on a
new architecture in this speed range.  I don't consider 8MB a lot of memory by
today's standards.

>
>Let's assume you have the physical memory and won't page fault.  I will make
>a guess that each iteration of the inner loop takes roughly 18 cycles which
  
18 cycles sounds like a lot to me for a single memory to memory word-sized
move, overhead included.  

>
>Let's reexamine the code.  What you are really doing is b = transpose( a ).
>I feel free to recode the same function in a more efficient manner.  The
>approach is to transpose square submatricies instead of rows (or columns). 

:
(Description of better transpose deleted)

Indeed, there are better ways to transpose, given that you know something
about the machine.  However, the algorithm proposed is also very bad for some
vector machines that I know of.  It is very difficult to optimize code for
more than one machine at a time.  A classic example is the BLAS library
delineated some time ago.  BLAS was expected to be at the correct level for
optimization on many machines.  Strangely enough, at about the same time,
people figured out how to use the fast vector machines that were coming out
then.  It turned out that BLAS was designed around too small a unit of work to
allow the best optimization on vector machines.  Dongarra, who was in on the
BLAS effort, soon proposed an organization for linear algebra problems that
was 3 times faster, in Fortran, than the best assembly language versions based
on BLAS.  Whenever you write specifically optimized code, you are risking
premature optimization. 

But the point of my original posting was that graphics, numerical modeling,
and other engineering and scientific number crunching codes often
reference large amounts of memory in a pseudo-random fashion.  It is not safe
to assume that both code and data have strong locality in reference patterns.
Often, only the code does.


  Hugh LaMaster, m/s 233-9,  UUCP {seismo,topaz,lll-crg,ucbvax}!
  NASA Ames Research Center                ames!pioneer!lamaster
  Moffett Field, CA 94035    ARPA lamaster@ames-pioneer.arpa
  Phone:  (415)694-6117      ARPA lamaster@pioneer.arc.nasa.gov

"In order to promise genuine progress, the acronym RISC should stand 
for REGULAR (not reduced) instruction set computer." - Wirth

("Any opinions expressed herein are solely the responsibility of the
author and do not represent the opinions of NASA or the U.S. Government")