Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!mnetor!seismo!rochester!crowl
From: crowl@rochester.ARPA (Lawrence Crowl)
Newsgroups: comp.arch
Subject: Re: Life with TLB and no PT
Message-ID: <27302@rochester.ARPA>
Date: Sat, 25-Apr-87 11:32:18 EDT
Article-I.D.: rocheste.27302
Posted: Sat Apr 25 11:32:18 1987
Date-Received: Sun, 26-Apr-87 20:27:51 EDT
References: <3027@sdcsvax.UCSD.EDU> <338@dumbo.UUCP> <1366@ames.UUCP>
Reply-To: crowl@rochester.UUCP (Lawrence Crowl)
Organization: U of Rochester, CS Dept, Rochester, NY
Lines: 58

In article <1366@ames.UUCP> lamaster@pioneer.arc.nasa.gov (Hugh LaMaster) writes:
>In article <338@dumbo.UUCP> hansen@mips.UUCP (Craig Hansen) writes:
>>The MIPS R2000 processor already does [software TLB refill], and with a 4k
>>page size, a 64-entry, fully-associative TLB, and 15-19 cycles to refill the
>>TLB in software from a simple page table, large benchmarks typically spend
>>about 1% of their execution time in the software TLB refill handler.
>
>I don't doubt that the MIPS does well running ls, cat, troff, and awk.  But
>what about this Fortran program:
>
>      real a(1000,1000),b(1000,1000)
>      do 10 i=1,1000
>      do 5  j=1,1000
>      b(i,j) = a(j,i)
>    5 continue
>   10 continue
>
>That looks like one TLB load per statement to me.  (By the way, this is not
>a far fetched example at all.)  Code references are usually local.  Data
>references are frequently far apart.

You are talking eight megabytes of data for what is presumably a small part
of any real program.  Unless you have a lot of physical memory, this code
looks like one page fault per statement.  The TLB miss is insignificant.

Let's assume you have the physical memory and won't page fault.  I will make
a guess that each iteration of the inner loop takes roughly 18 cycles which
is rougly the same as a TLB refill.  This means that the code above spends
about 50% of its time in TLB refill.

Let's reexamine the code.  What you are really doing is b = transpose( a ).
I feel free to recode the same function in a more efficient manner.  The
approach is to transpose square submatricies instead of rows (or columns). 

      real a(1000,1000),b(1000,1000)
      do 40 i1=1,951,50              -- select submatrix
      do 30 j1=1,951,50
      do 20 i2=i1,i1+50              -- transpose submatrix
      do 10 j2=j1,j1+50
      b(i2,j2) = a(j2,i2)
   10 continue
   20 continue
   30 continue
   40 continue

This looks to be about 100 TLB misses in each submatrix transposition, one
one for each row of a and b touched.  This is 0.04 misses per iteration of
the inner loop.  This means that the revised code spends roughly 4% of its
time in TLB refill.  While this is worse than the 1% cited by Craig Hansen,
it is certainly not the 50% implied by the first code.

Is the revised code unreasonable?  Certainly not if it is part of a library.
Given the potential paging problem, programmers may be forced into this type
of approach anyway.
-- 
  Lawrence Crowl		716-275-5766	University of Rochester
			crowl@rochester.arpa	Computer Science Department
 ...!{allegra,decvax,seismo}!rochester!crowl	Rochester, New York,  14627