Path: utzoo!utgpu!jarvis.csri.toronto.edu!rutgers!apple!vsi1!daver!mips!mark From: mark@mips.COM (Mark G. Johnson) Newsgroups: comp.lsi Subject: Arbiter / Synchronizer failure; MTBF Summary: sometimes the required mtbf is thousands of years Message-ID: <27581@obiwan.mips.COM> Date: 15 Sep 89 01:28:18 GMT References: <26811@obiwan.mips.COM> <2280011@hpsal2.HP.COM> Reply-To: mark@mips.COM (Mark G. Johnson) Organization: MIPS Computer Systems, Sunnyvale, CA Lines: 79 In the previous posting <<26811@obiwan.mips.COM>> I used an approximation- formula to simplify the probability expressions, without explicitly calling attention to the approximation. Article <2280011@hpsal2.HP.COM> by saxena@hpsal2.HP.COM (Nirmal Saxena) pointed out the inexactitude; unfortunately, his modification was incorrect. Sticklers-for-mathematical-precision might perhaps be interested in the exact expressions, without using approximation formulae. They appear below. Engineering approximations were given in <26811@obiwan.mips.COM>. I recommend the engineering approach; among other advantages, it provides expressions that are far easier to invert. A single part whose Mean Time Between Failures is "m" units of time: *************************************************************************** * Prob of a failure between time 0 and T is P(fail) = 1 - exp(-T/m) * * Prob of not-failure is P(not-fail) = exp(-T/m). * *************************************************************************** To compute the probability that one or more units out of a population of 50,000 will fail within 5 years, we simply compute the probability that zero units will fail, and then realize that P(one or more failures) is equal to 1.0 - P(no fails). The probability of 0 failures among 50,000 units, is just the probability that the first one doesn't fail, times the probability the second one doesn't fail, times..... (i.e. P(no-fail) to the 50,000 power). If the MTBF is 100 years and we want to find the prob of 0 failures after 5 years: P(0 failures in 50,000 units) = [exp(-5/100)] ** 50000 == exp(-2500) So the probability that there are one or more failures in the 50,000 units is one minus P(no-fails); that is, [1.0 - exp(-2500)]. (very nearly 1). In general we want to know the probability of (fewer than K failures) over a specified time interval. The original article stipulated that the Big Boss would fire the engineer if, during the 5-year product lifetime there were 100 or more failures out of 50,000 installations in the field. Thus the engineer wanted to have a large probability of (fewer than 100 failures). In the example we solved for the MTBF that gave a probability of (100 or more failures) equal to 0.33; that is, the probability of (fewer than 100 failures) was 0.67. If each of N identical parts has an MTBF equal to "m" units of time, ***************************************************************************** * * * P(out of N parts, fewer than K failures from time 0 to time T) = * * * * Sum from i=0 to i=(K-1) {C(N,i) * (1 - exp(-T/m))^i * (exp(-T/M)^(N-i)} * * * ***************************************************************************** where the binomial coefficient C(N,i) is N! / (i! (N-i)!) and C(N,0)==1 So, in our example we set the probability equal to 0.67 and solve for m. {Now you see why the engineering approximation is sometimes useful; solving for m in the exact expression above is messy}. Utilizing a numerical solution method, we find that m = 2619.6 years is the required MTBF to give an 0.67 probability of (fewer than 100 failures over 5 years among 50,000 parts). Recall that the chip vendors proudly boast "1 century MTBF". So, using the exact formula we find that this MTBF is 26 times too small; the Big Boss will fire the design engineer. The engineering solution agreed; it was a bit more conservative, dictating an MTBF of 75.7 centuries to achieve fewer than 100 failures among 50,000 parts over 5 years. -- -- Mark Johnson MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086 (408) 991-0208 mark@mips.com {or ...!decwrl!mips!mark}