Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!mnetor!seismo!rochester!crowl From: crowl@rochester.ARPA (Lawrence Crowl) Newsgroups: comp.arch Subject: Re: Life with TLB and no PT Message-ID: <27302@rochester.ARPA> Date: Sat, 25-Apr-87 11:32:18 EDT Article-I.D.: rocheste.27302 Posted: Sat Apr 25 11:32:18 1987 Date-Received: Sun, 26-Apr-87 20:27:51 EDT References: <3027@sdcsvax.UCSD.EDU> <338@dumbo.UUCP> <1366@ames.UUCP> Reply-To: crowl@rochester.UUCP (Lawrence Crowl) Organization: U of Rochester, CS Dept, Rochester, NY Lines: 58 In article <1366@ames.UUCP> lamaster@pioneer.arc.nasa.gov (Hugh LaMaster) writes: >In article <338@dumbo.UUCP> hansen@mips.UUCP (Craig Hansen) writes: >>The MIPS R2000 processor already does [software TLB refill], and with a 4k >>page size, a 64-entry, fully-associative TLB, and 15-19 cycles to refill the >>TLB in software from a simple page table, large benchmarks typically spend >>about 1% of their execution time in the software TLB refill handler. > >I don't doubt that the MIPS does well running ls, cat, troff, and awk. But >what about this Fortran program: > > real a(1000,1000),b(1000,1000) > do 10 i=1,1000 > do 5 j=1,1000 > b(i,j) = a(j,i) > 5 continue > 10 continue > >That looks like one TLB load per statement to me. (By the way, this is not >a far fetched example at all.) Code references are usually local. Data >references are frequently far apart. You are talking eight megabytes of data for what is presumably a small part of any real program. Unless you have a lot of physical memory, this code looks like one page fault per statement. The TLB miss is insignificant. Let's assume you have the physical memory and won't page fault. I will make a guess that each iteration of the inner loop takes roughly 18 cycles which is rougly the same as a TLB refill. This means that the code above spends about 50% of its time in TLB refill. Let's reexamine the code. What you are really doing is b = transpose( a ). I feel free to recode the same function in a more efficient manner. The approach is to transpose square submatricies instead of rows (or columns). real a(1000,1000),b(1000,1000) do 40 i1=1,951,50 -- select submatrix do 30 j1=1,951,50 do 20 i2=i1,i1+50 -- transpose submatrix do 10 j2=j1,j1+50 b(i2,j2) = a(j2,i2) 10 continue 20 continue 30 continue 40 continue This looks to be about 100 TLB misses in each submatrix transposition, one one for each row of a and b touched. This is 0.04 misses per iteration of the inner loop. This means that the revised code spends roughly 4% of its time in TLB refill. While this is worse than the 1% cited by Craig Hansen, it is certainly not the 50% implied by the first code. Is the revised code unreasonable? Certainly not if it is part of a library. Given the potential paging problem, programmers may be forced into this type of approach anyway. -- Lawrence Crowl 716-275-5766 University of Rochester crowl@rochester.arpa Computer Science Department ...!{allegra,decvax,seismo}!rochester!crowl Rochester, New York, 14627