Account Links: Cart | Register | Log In

Skip to content

Red Hat Knowledgebase
Red Hat Knowledgebase Search
Currently Being Moderated

What is the Error Detection and Correction (EDAC) Support that is available in Red Hat Enterprise Linux 4 Update 4?

Article ID: 3897 - Created on: Mar 7, 2006 6:00 PM - Last Modified:  Aug 5, 2009 7:48 AM

NOTE:

EDAC support is available from Red Hat Enterprise Linux 4 Update 4, although it was announced with the release of Red Hat Enterprise Linux 4 Update 3. For the users that would expect this feature, the minimum level of support is from Red Hat Enterprise Linux 4 Update 4.


Background

 

Memory error checking on a memory module used to be accomplished with a parity checking bit that was attached to each byte of memory. The parity bit was calculated when each byte of memory was written, and then verified when each byte of memory was read. If the stored parity bit didn't match the calculated parity bit on a read, that byte of memory was known to have changed. Parity checking is known to be a reasonably effective method for detecting a single bit change in a byte of memory.

 

EDAC Support in Red Hat Enterprise Linux 4 Update 4

 

x86 and x86_64 only

 

Red Hat Enterprise Linux 4 Update 4 includes the error detection and correction (EDAC) functionality for x86 and x86-64 systems, which in recent releases modules are loaded by default. For the users that wish to use the functionality of panic in case of unrecovered NMI, please do so by toggling the proc interface: /proc/sys/kernel/panic_on_unrecovered_nmi

 

The kernel, on supported chipsets, can now detect and report ECC single bit errors, and report and panic on multi-bit errors. This update does not affect IA-64 which already handles this functionality in a different way.

 

Kernel Reaction to Types of Error Reported

 

The kernel response to an ECC or parity error is determined by the chip support in the kernel and the motherboard support. There are three possible configurations:

 

  1. motherboard with EDAC supported chipset;

  2. motherboard with no EDAC support, but has on-board NMI support;

  3. motherboard with no EDAC support, and no on-board NMI support.

 

The responses can be categorized in the follow ways:

 

Table 1: Kernel reaction to types of error reported

 

ECC Error

PCI Parity Error

EDAC Supportes Chipset

Detailed diagnostic + Uncorrectable Err Panic (optional)

Detailed diagnostic

Non supported chipset. Board with NMI reporting

Error Message

Unknown NMI error Error Message

Non supported chipset. No on-board NMI support

No action

Unknown NMI error Error Message

 

All detailed diagnostic cases also include optional support for causing a kernel panic.

ECC error detection and recovery requires the use of suitable memory modules and may also require board level support.

PCI parity error detection is not support in Red Hat Enterprise Linux 4 or 5.

 

EDAC Supported Chipsets

 

Table 2: EDAC Supported Chipset*

 

AMD

AMD 76x

K8 (Opteron)

Intel

7520, 7525, 7320

7500, 7501, 7205

i82860

i82875p

Radisys

82600

 

"chipsets" is also known as the northbridge on the motherboard. They are usually a group of chips. EDAC implementation is specific targeted for various low-level errors that are reported in the CPU or support chipsets: memory error, cache errors, PCI bus errors, and thermal throttling, and etc.

 

Configuration

 

  • The behaviour of the kernel is controlled by "/sys/devices/system/edac/mc*" files; and the default value for panic_on_ue, uncorrected error, is set to "1", "ON". Therefore, the default behaviour of the kernel is to panic on a double bit ECC error.

  • Also, by default, the PCI parity errors are not scanned for.

  • If the EDAC kernel modules are loaded then the kernel provides control interfaces via /sys/devices/system/edac/mc that allow debugging and logging to be controlled at runtime and a /sys/devices/system/edac/mc directory providing statistics.

 

Diagnostic Output

 

1. Positive Error Report

 

When the EDAC core monitoring module and various supported chipset drivers are loaded, whenever an error has been detected, an error message will get logged in the syslog message file.  For example, a single bit error will get reported as the following message in the syslog:

 "Non-Fatal Error DRAM Controller
 Test row 3 Table 0 255 2 255 4 255 6 255
 Test computed row 8
 MC0: row 3 not found in remap table
 MC0: CE page 0xc6397, offset 0x0, grain 4096, syndrome 0xfc1, row 4,
 channel 0, label "": e752x CE"
 

 

To decode this message:

    • "Non-Fatal Error" - Recoverable error

    • "DRAM Controller" - In memory controller module

    • "MC0: row 3 not found in remap table" - its not a page magically mapped over 4GB

         "MC0: CE page 0xc6397, offset 0x0, grain 4096, syndrome 0xfc1, row 4, channel 0,
        label "": e752x CE":  

       

    • MC0 - Memory controller 0

    • CE - Correctable Error Page - in memory page

    • 0xc6397 Offset - offset into that page at 0x0

    • Grain - accuracy of reporting

    • Syndrome - error bits from controller (specific)

    • Channel - which memory channel (often only channel 0 on machines)

    • Row - which DIMM row. How that maps to a chip is vendor specific but often as simple as row 0/1 -> DIMM0 row 2/3 > DIMM1 etc

    • label "" - description of this DIMM (NULL string in U3)

    • e752x - chip type (eg e7501, AMD76x)

 

Footnote: If the system only has NMI support, without EDAC support, the system will panic when an error has been detected. Please see "Kernel Reaction to Types of Error Reported" section for more detailed behavior.

 

2. False Positive Error Report 
  In case of false positive report, the system will panic with the error message. Please note below is an example of uncorrected error which may happen either on False Positive Error or Positive Error. False Positive Error can occur on Correctable Error as well:

 Panic: MC0: Uncorrected Error
 

Either due to buggy BIOS or hardware, please boot the system into single user mode, and modify the following lines in /etc/modprobe.conf file:

 alias e752x_edac /dev/null
 alias edac_mc /dev/null
 options edac_mc panic_on_ue=0
 

Reboot the system. The system should boot up properly.

Tags: rhel4
Feedback from users like yourself is a critical factor in helping us make the Red Hat Knowledgebase as useful as possible.

More Like This

  • Retrieving data ...