Path: utzoo!attcan!uunet!cs.utexas.edu!samsung!uakari.primate.wisc.edu!uflorida!mephisto!udel!udccvax1!mccalpin From: mccalpin@vax1.acs.udel.EDU (John D Mccalpin) Newsgroups: comp.sys.sgi Subject: Re: Processor efficiency Message-ID: <6604@vax1.acs.udel.EDU> Date: 15 Jun 90 14:30:12 GMT References: <9006150334.AA03405@physics.phy.duke.edu> Reply-To: mccalpin@vax1.udel.edu (John D Mccalpin) Organization: College of Marine Studies, Univ. of Delaware Lines: 64 In article <9006150334.AA03405@physics.phy.duke.edu> rgb@PHY.DUKE.EDU ("Robert G. Brown") writes: > >We have a Power Series 220S [....] >[....] small jobs [...] run at around 3.5 MFLOPS (as advertised). >[...] if one takes these jobs (typically a loop containing just one >equation with a multiply, a divide, an add, and a subtract) and >scales them up by making the loop set every element of a vector and >increasing the size of the vector and the loop, there is a point >(which I have not yet tried to precisely pinpoint) where the speed >degrades substantially -- by more than a factor of two. This degradation is a bit larger than is typical, but it is exactly what one expects to find with many algorithms on a cached machine. On my 4D/25 I typically see 25% slowdowns on dense linear algebra benchmarks when the cache size is exceeded. (Side note: It is unfortunate that SGI put a 32kB data cache in the 4D/25 as it is just a bit too small to handle the 100x100 LINPACK benchmark case. The rated performance is 1.6 MFLOPS for the 64-bit case, while the Sparcstation I is rated at 2.6 MFLOPS. Despite these ratings, the 4D/25 is faster than the Sparcstation on almost every realistic FP benchmark that I have run. Also, re-arranging the LINPACK test case to run in block mode produces performance of up to 3.1 MFLOPS for the same test case.) (Side Question: Does anyone at SGI want to tell me what the cache line size and refill delays are for the 4D/25? Thanks for any info!) >My current hypothesis is that this phenomenon is caused by saturation >of some internal cache on the R3000. Has anyone else noticed or >documented this? Dr. Robert G. Brown rgb@phy.duke.edu Here are some numbers from the port of LAPACK that I have been playing with on my 4D/25 (32 kB data cache). These use hand-coded BLAS routines from earl@mips.com. size factor solve total mflops ------------------------------------------------ 32 0.000E+00 9.398E-03 9.398E-03 2.542E+00 50 2.819E-02 0.000E+00 2.819E-02 3.133E+00 100 1.692E-01 0.000E+00 1.692E-01 4.059E+00 150 6.296E-01 1.880E-02 6.484E-01 3.539E+00 200 1.626E+00 2.819E-02 1.654E+00 3.273E+00 250 3.411E+00 4.699E-02 3.458E+00 3.048E+00 300 6.137E+00 6.578E-02 6.202E+00 2.931E+00 500 2.904E+01 1.692E-01 2.921E+01 2.870E+00 I get a bit more than 25% degradation going to the larger problems. So what does one do about it? Mostly it depends on the problem. If you are doing problems that make extensive use of reduction operations (sums and dot products) then you should be able to improve the cache locality by unrolling the outer loops. This is roughly equivalent to the block-mode algorithms used in LAPACK. If your operations are vector<-vector+vector, then you are basically out of luck and your problem will be memory bandwidth-limited..... Please let me know if I have not made myself clear! -- John D. McCalpin mccalpin@vax1.udel.edu Assistant Professor mccalpin@delocn.udel.edu College of Marine Studies, U. Del. mccalpin@scri1.scri.fsu.edu