like we know ECC (Error-Correcting Code) is a great feature we have to tackle soft errors happening in RAM, without causing OS to fail, but ECC protects us from single soft error in a memory block at a time, so if there are multiple soft errors in a single memory block or hard errors on one or multiple cells in main memory, this will surely cause OS Kernel to panic. This will then result into longer downtime in terms of going through variety of steps to identify faulty module to replace it. Or resetting OS (in case of multiple soft errors) with all it’s application services. If this was a virtualised environment with VMware ESXi or any such other hypervisors then we know there will be multiple VMs running and the all go down along with the hypervisor.
This is where Reliable Memory Technology (RMT) comes into picture which is basically a hardware feature which works along with supported OS (like ESXi 5.5 or higher). This will make sure that if during ongoing operations, any multi-bit soft error in a DIMM or hard error occurs, it will be detected by RMT and will take corrective actions in such a way so it won’t trigger OS Kernel to panic.
For Example, let’s say there was a hard error in one of the DIMM, system will detect it and mark faulty cell and some cells around it as non-usable. so that current OS operations will continue but after next reboot, OS will not see those faulty cells because hardware is not even presenting those anymore.
RMT proves to be really great when it comes to minimise downtimes due to Memory Fault related kernel panics, and also avoiding to replace whole memory module due to hard errors in DIMM.
if you have vSphere 5.5 or higher with Enterprise or Enterprise plus edition license, and if you hardware has it, then in your ESXi host you will be able to see Reliable memory using following command.
ESXCLI HARDWARE MEMORY GET.