RAM failures: How to detect and fix
If your system breaks in unexpected ways and you can't understand why, it might be the RAM issue. In this article, I will try to explain without going too deep into details. why RAM is getting broken, how to detect it, and how to fix it.
I've started struggling with segfaults and other problems more often after years of using my laptop. It was not critical, but annoying and it turned out that the problem was a broken ram module. I tested it and then after figuring out that it was broken I removed it completely. I had only 2GiB of RAM available after that, but I got to the conclusion that 2GiB is enough for me to do the work if to apply some optimizations to my GNU+Linux system.
What RAM is used for
RAM is used to execute and store program memory. When you run the binary all CPU instructions are being placed into RAM and then all that instructions are being read sequentially one after another. While executing your program it will store intermediate results such as storing variables, data structures, and so on into RAM. After finishing your program RAM is cleared from program instructions and intermediate results of execution of these instructions.
Why RAM can be damaged
I can't say why exactly it can be damaged, overheating is probably one of the factors. There is a chance that RAM itself is not damaged at all. Actually it looks like that it breaks rarely. What can be "damaged" is RAM module contacts and it can be easily fixed. What you need to do is to take out ram modules and use an eraser to clean its contacts. But how to understand that you have a problem in the first place?
How to understand that RAM is broken
The most expected way to see RAM fail is BIOS signaling that it is broken. It should beep a special signal using a PC Speaker. You can read your motherboard manual to understand what it mean. Usually, it means that the computer won't start with a "completely" broken RAM module.
If the system loads just fine, but you experience problems along the way such as random segfaults and programs crash, kernel panics, and so on, you might have broken segments of RAM. To detect such segments you can use several programs listed below. RAM checking is usually not a fast process, so you will probably need to leave your device running for several hours.
Memtest86+
Memtest86+ runs from the Grub menu before running your OS. It needs to run in such way because it needs the whole range of RAM and your running system is using that RAM range. It runs a lot of checks and checks every segment of your ram. While checking it logs the list of broken segments that you can write down.
You can install it using your GNU+Linux package manager such as apt. The
package is usually called memtest86+
. But there is a small caveat. If
you use the old version it won't work on UEFI systems.
If it doesn't work you can download memtest86+ newer version distribution to your USB stick and load memtest from that. It should work on UEFI and BIOS systems. It can be downloaded from the offical website.
Memtester
It has the same purpose as memtest86+, but it runs while your system is
running, so it doesn't check the whole RAM range, but only specified
free ram available at your system at the moment of running this
program. It can be installed using your package manager of choice by
typing memtester
as a package name.
How to fix broken RAM
First of all, if memtest86+ or(and) memtester doesn't show you any error, congratulations! You don't have any problems with your RAM.
If it shows a small number of errors like one or two, you can let Linux
Kernel ignore such segments of RAM, so programs don't use such broken
segments and work stable all the time. You need to use for that memmap
kernel argument in your grub configuration. For example:
memmap=0x100000$762ce9c38420,0x100000$34e03060,0x100000$87fce060,0x100000$23c63060,0x100000$87b6c060
. There
is also grub config unit called GRUB_BADRAM
, but it looks like it is
deprecated and memmap is prefered.
For more details about blacklisting bad segments of RAM read this comprehensive Stack Overflow answer.
If it shows a big number of errors, like many thousand, it means that probably one of your sticks of RAM is broken. To detect which one is broken exactly you can probably figure it out by looking at addresses or running another test using a specific stick(s) of RAM and seeing if errors are gone.
Be aware, that if you leave with one RAM stick there is a chance, that it will only boot in a specific RAM slot. Read your motherboard manual if something doesn't work.
If you have a RAM memory stick with tons of errors, you can try to repair it. I can't tell how exactly it is being done and why it is done in the way it should be done. You can find videos on fixing RAM sticks on YouTube and other resources. Here is the link to one of such video.
RAM Optimizations of GNU+Linux system
If your RAM was broken and you left with much less memory than you expected, don't run and buy new RAM sticks. There is a chance that even with less RAM the system will work completely fine. Linux is pretty good at working on low-end machines and it has different ways to handle a lack of memory. There is often a situation in a modern world, when a person has devices that outperform their tasks, like working on gaming laptop with very powerful CPU and GPU, that are used mostly to render text in a text editor.
Swap
Swap is a partition on your hard drive that is being used in a situation when there is no RAM left. It is used for other reasons too and such partition is recommended to have on most GNU+Linux systems.
You can configure how often linux system will use swap changing
swappiness
. You can read about changing that setting and learn about
swap in general in the link below.
Zram
Zram is something that stays in between RAM and Swap in terms of performance. It helps your system to stay performant if it uses swap a lot, but it increases the CPU usage because of that. I use Zram on a machine with 2GiB of RAM and 16GiB swap and it works great even with many programs opened at the same time (text editor, browser, docker container, messenger).
Less bloat software
Also as an alternative way you can simply use less bloat software, so you don't need so much RAM in the first place. In many cases, good software doesn't require a lot of RAM, but bad software always leaks memory, so you would need many GiBs of RAM to use it properly. The most bloated software is a web browser such as chromium and firefox and browser-based apps done in electron such as Slack, VSCode, and other proprietary products.
Conclusions
Now you have directions about what to do when you suspect RAM failure. That knowledge can also be used for testing when you buy used memory sticks from someone else.