Google: Your RAM is Out to Get You!

Your computer just crashed taking with it that document you’ve been working on for the past hour. Naturally, you cry out in anguish over all those terrible software bugs that conspire to crash your computer and drive you nuts. But new evidence suggests that something else could be the cause: bad memory.
Google just announced the results of a new study based on the thousands of servers they manage in their data centers, and the results are surprising. Memory error rates are thousands of times higher than anyone previously believed. (CNET has a good summary of the research here.)
For software testers, this represents a significant new wrinkle in identifying bugs. Wonky behavior and computer crashes could really be a hardware problem, and these sorts of hardware issues are way more common than we previously thought.
So what does that mean for your testing? Here are a few thoughts that could help.
Memory errors aren’t always a problem.
For most people, a memory error isn’t a real issue. An error usually manifests itself as a single bit flipping when it shouldn’t, and out of the many gigabytes of memory on a typical computer, it’s incredibly unlikely that a single bit is all that important. More likely, that memory space contains things like image data, multimedia, or even nothing at all.
Of course, the likelihood that any given section of memory contains something of value goes up with the more applications you run at once or when those applications are working with very critical data. Servers have the most to fear, but a “power user” could just as easily run into issues.
Memory errors are usually related.
Conventional wisdom is that most memory errors happen because of things like cosmic radiation, stray neutrinos, or some other kind of unpredictable event. Hardware errors, on the other hand, are supposedly the least common failures. Google’s study proves the opposite – that memory errors occur in clusters and are highly correlated on the same DIMM. That points to bad DIMMs being more common than previously believed. If your computer reports any kind of memory problem, then replace the affected DIMMs as soon as possible.
ECC memory is great – when you can afford it.
ECC memory catches and corrects certain kinds of errors before they become a problem. The downside is that it’s expensive and power hungry. But if you need reliability, then Google’s study proves that it’s absolutely positively worth it. For testers on desktop systems, ECC memory could be a valuable addition.
Good hardware is critical.
Another outcome of the study is that memory failures are correlated with motherboard types. That means that some motherboards are more likely to cause a failure than others. Why that’s true remains uncertain, but it’s likely related to internal electromagnetic interference. Determining the most stable motherboard to use for software testing is pretty tough at this point, but the best advice is to pay very close attention to reviews and reports about system uptime and performance for each brand under consideration.
For laptops and mobile devices, your work may be even harder because vendors will change hardware suppliers at the drop of a hat. Look at the stability of the device as a whole, and be wary of devices with a large number of reported failures.
Reproduce, reproduce, reproduce.
With a study like this, it’s now even more important for testers to try to reproduce bugs they identify. A one time failure or event is less valuable than something that happens over and over again. For intermittent bugs, that means taking good notes about the crashes and failures you experience and then looking for trends and patterns.
What else?
I am very interested in learning what the community thinks. In many ways, this information changes very little because we’ve lived with this problem all along. On the other hand, it creates a new wrinkle in the world of software testing and reliability when it comes to tracking down the root cause of a problem. What do you think?








would be ineteresting to know where these failing components were manufactured .. am betting overseas so I can rant about the evils of outsourcing/off-shoring
This is where the hardware guys stepup and do burn ins of new equipment. And where dev teams step up and do real load testing on systems, especially when changes are made to the underlying infrastructure. And speaking of changes, this is where companies have a good change management program and IT groups are communicating with each other.