Re: More 2.4.4 benchmarks

Valdis.Kletnieksat_private

On Wed, 09 May 2001 00:33:49 PDT, Crispin Cowan said:
> For the most part, they show very little degradation due to LSM.  Good.

> The curious metric is main memory latency:  it shows considerable degradation,
> when this is one metric I would expect to be unaffected by LSM.  I don't have a
> decent conjecture of what would cause this.

OK.. Speculating on the *weird* causes first. ;)

I have to wonder if there's some sort of second-order effect here. For instance,
if without the patch, the benchmark exhibits "nice" cache behavior, but
adding the patch causes an extra 15 or 20 cache lines to be used, so instead
of using (for example) 255K of a 256K cache, it now uses 257K, causing
additional cache misses.

I've also seen (on older machines with smaller caches) where odd timing
constraints would cause odd results - there was a tight loop that took
almost exactly one timer tick per iteration. For one version of a program, it
would end up doing this:

     <syscall just before entering the loop - caused lockstep to timer>
Loop: <read a big chunk of data, flushing the L1/L2 cache>
     <crunch numbers>
     <syscall, and take a timer interrupt while in there - cache flush>
     repeat

So effectively, the cache got totally wiped once per iteration.  A very
small change removed the initial syscall, which caused the timer interrupt
to pop at a differnt point in the cycle, causing an effective 2 flushes
per iteration...

I could even bring up Prof. Eytan Baruch's problem with the Cornell Theory
Center's IBM 3090-600J supercomputer, where he got good results when
running at night but program crashes during the day.  During the day,
more interrupts happened - and although the machine saved the vector registers,
it truncated the guard digits, and the resulting loss of precision from
(effectively) 132 bit down to 128 bit caused numerical instability....

Or it might just be poor benchmarking methodology ;)

It might be productive to walk through the code and do a 'digits of precision'
analysis - I strongly suspect that although a lot of places in the
benchmark output show 4 or 5 digits, there;s only 1 or 2 digits of real
accuracy (for instance, look at the 'Page Fault' column - 3.0000 all the
way down.  *immediatly* suspect as a 1-digit value.

Hand-checking the .0 and .4 data files, it looks like the differing
memory latency values are the root cause of it - for all strides, the two
files show near-identical number up to the 1.0 value, and then at 1.5 and
higher there's a sudden dip for the -lsm kernel.  I *really* have to question
if there's really as many significant digits in the data as it's reporting.

Time to go can-opener the lmbench code and start looking, I guess....
-- 
				Valdis Kletnieks
				Operating Systems Analyst
				Virginia Tech