Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!utgpu!water!watmath!clyde!rutgers!mit-eddie!think!ames!pioneer!lamaster From: lamaster@pioneer.UUCP Newsgroups: comp.arch Subject: Re: Life with TLB and no PT Message-ID: <1399@ames.UUCP> Date: Mon, 27-Apr-87 22:11:00 EDT Article-I.D.: ames.1399 Posted: Mon Apr 27 22:11:00 1987 Date-Received: Wed, 29-Apr-87 01:31:23 EDT References: <3027@sdcsvax.UCSD.EDU> <338@dumbo.UUCP> <1366@ames.UUCP> <27302@rochester.ARPA> Sender: usenet@ames.UUCP Reply-To: lamaster@pioneer.arc.nasa.gov (Hugh LaMaster) Organization: NASA Ames Research Center, Moffett Field, Calif. Lines: 62 In article <27302@rochester.ARPA> crowl@rochester.UUCP (Lawrence Crowl) writes: : > >You are talking eight megabytes of data for what is presumably a small part >of any real program. Unless you have a lot of physical memory, this code >looks like one page fault per statement. The TLB miss is insignificant. Well, a medium part actually. If I had a 4 MIPS machine with 16MB of memory, typical these days for a Sun, etc., I would probably dimension my arrays about 700x700. On the other hand, if I could get 32MB, 1Kx1K would be good. I think 16-64MB is a very good guess as to the memory size I would expect on a new architecture in this speed range. I don't consider 8MB a lot of memory by today's standards. > >Let's assume you have the physical memory and won't page fault. I will make >a guess that each iteration of the inner loop takes roughly 18 cycles which 18 cycles sounds like a lot to me for a single memory to memory word-sized move, overhead included. > >Let's reexamine the code. What you are really doing is b = transpose( a ). >I feel free to recode the same function in a more efficient manner. The >approach is to transpose square submatricies instead of rows (or columns). : (Description of better transpose deleted) Indeed, there are better ways to transpose, given that you know something about the machine. However, the algorithm proposed is also very bad for some vector machines that I know of. It is very difficult to optimize code for more than one machine at a time. A classic example is the BLAS library delineated some time ago. BLAS was expected to be at the correct level for optimization on many machines. Strangely enough, at about the same time, people figured out how to use the fast vector machines that were coming out then. It turned out that BLAS was designed around too small a unit of work to allow the best optimization on vector machines. Dongarra, who was in on the BLAS effort, soon proposed an organization for linear algebra problems that was 3 times faster, in Fortran, than the best assembly language versions based on BLAS. Whenever you write specifically optimized code, you are risking premature optimization. But the point of my original posting was that graphics, numerical modeling, and other engineering and scientific number crunching codes often reference large amounts of memory in a pseudo-random fashion. It is not safe to assume that both code and data have strong locality in reference patterns. Often, only the code does. Hugh LaMaster, m/s 233-9, UUCP {seismo,topaz,lll-crg,ucbvax}! NASA Ames Research Center ames!pioneer!lamaster Moffett Field, CA 94035 ARPA lamaster@ames-pioneer.arpa Phone: (415)694-6117 ARPA lamaster@pioneer.arc.nasa.gov "In order to promise genuine progress, the acronym RISC should stand for REGULAR (not reduced) instruction set computer." - Wirth ("Any opinions expressed herein are solely the responsibility of the author and do not represent the opinions of NASA or the U.S. Government")