Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!utgpu!water!watmath!clyde!rutgers!sri-spam!ames!aurora!labrea!decwrl!pyramid!weitek!wyse!bigboy!mikew From: mikew@bigboy.UUCP Newsgroups: comp.arch Subject: Re: Double-bit errors and ECC memory Message-ID: <172@bigboy.UUCP> Date: Thu, 24-Sep-87 12:31:24 EDT Article-I.D.: bigboy.172 Posted: Thu Sep 24 12:31:24 1987 Date-Received: Sat, 26-Sep-87 13:44:51 EDT References: <686@obiwan.UUCP> <8637@utzoo.UUCP> <8638@utzoo.UUCP> Reply-To: mikew@bigboy.UUCP (Mike Wexler) Organization: Wyse Technology - 3571 Corp Lines: 23 In article <8638@utzoo.UUCP> henry@utzoo.UUCP (Henry Spencer) writes: >...Has anybody thought >about doing the correction (as opposed to detection) part of ECC in software? >Clearly this is viable only if ECC's purpose is to handle infrequent soft >errors and provide fail-soft behavior in the presence of newly-arrived hard >errors; ... Given that restriction on its domain of >application, though, it seems like it might work. I was thinking about this a few days ago, and I came up with some interesting techniques for implementing this. For single bit errors you could just use the ECC bits to correct them, the advantage comes if you multiple bit errors. The first step is to see if the page is dirty(different than on the page device). If it isn't, just page it in. This is very likely to work since there are a lot of pages that never get changed(executable code) and a lot that is infrequently changed. If this fails and the error was in the data space of a user process, just terminate the user process. If all else fails, and the error is in the code space of the kernel, you can always generate a panic(or the equivalent on your OS). Does anybody implement a scheme like this? It would seem to greatly reduce the problems caused by memory errors. -- Mike Wexler UUCP: wyse!mike ATT: (408)433-1000 x 1330