Cosmic ray threat?

I was intrigued by the post on Ksplice blog titled Attack of the Cosmic Rays . In it author tracked an error in the expr program, that occasionally had a segfault, to a single randomly flipped bit in the RAM cache of the program. The author guesses it was due to cosmic ray hit and offers ZDNet article DRAM error rates: Nightmare on DIMM street as a clue. It also states that such errors are quite common, 1 bit per 4GB per day, cited from On the need to use error-correcting memory , written in 2010.

That sounded strange to me. At this rate in today sizes of memory there should be such errors happening regularly. Do they not, or do I just not notice them? There is not lack of claims on the internet about the consequences of cosmic rays induced errors. The problem with sudden acceleration with Toyota might be linked to cosmic rays .Cosmic rays are suspected as a possible cause for two dives of an Airbus A330 in 2008. Even the Voyager 2 communication problems in 2010 have been attributed to a singlememory location whose value has been changed from a 0 to a 1 "unexpectedly". Mars Express orbiter got hit one time leaving a stripe on one of the images , which sparked a conspiracy theory. However, none of this sources is an original research.

How often does it really happen, say, on Earthsurface? Where did the rate of 1 bit per 4GB per day come from? Can I plausibly blame cosmic ray for some bug I made in software? :) I decided to dive intoGooglescholar and browse for some research papers. I noticed again a big problem in double checking the facts: a lot of research is not free and I couldn't check neither the results, nor the methodology. We really need more open science.

Review

There were only a handful of freely available papers that I could review. Different papers have different units of measure for bit upsets per amount of memory per amount of time. I will convert all of them in bits/GB/day so that they are consistent and because this would be easy for subsequent back-of-the-envelope calculations.

Cosmic Rays Don't Strike Twice: Understanding the Nature of DRAM Errors and the Implications for System Design [ pdf ] from 2012 by Hwang, Stefanovici and Schroeder brings even higher values, but doesn't attribute them all directly to cosmic rays. Since the vast majority of all memory errors (65% - 82%) were hard errors, the paper stops researching soft errors and we're only left with clues. Theassumptionwas that repeat errors at the same location would be statistically likely to be hard errors because it is unlikely that the same location would be hit twice by cosmic ray or other random noise.The soft error rates, among which there must have been cosmic ray events, were at most 3.72 and 11.98 upset bits per GB per day. That is the data from two different supercomputers, IBM Blue Gene/L and IBM Blue Gene/P, respectively. There's also data from Google banks, but it is limited to the machines that have experienced memory errors and the results from that system don't apply here. I suspect that the data gathered from the other two systems favored ones where errors occurred, since the soft error rate is too high intuitively. On a reference system with 8GB that I used above, IBM BlueGene/L would produce a soft error about every two minutes. This was an interesting research to read, by the way. Do that, if you can spare the time (or you can read good recap here: DRAM errors soft and hard from storagemojo.com), but I don't think it is a reliable source for learning cosmic rays incident rate.

Single Event Upset at Ground Level [pdf] paper from 1996 by Eugene Normand measured approximately 0.04 bits/GB/day. For an upset to happen, someone with 8GB of fully occupied RAM would have to wait a little more then 3 days. Radiation-Induced Soft Errors in Advanced Semiconductor Technologies [ pdf ] from 2005 by Robert C. Baumann states 1.23 upset bits per GB per day. For the setup above, wait about 3 hours for this one to strike. Soft Errors in Advanced Computer Systems [ pdf ] paper from 2005 by Robert Baumann again cites 0.27 bits per GB per day for a single chip. With 8GB one would need to wait for about half a day for an error to occur. A 2007 paper A Memory Soft Error Measurement on Production Systems [ html ] by Li et al summarizes some other papers and cites error rates between 0.039 bits/GB/day and 0.98 bits/GB/day (from about 3 days to about 3 hours of waiting for the first event), but their measurement offered a really low probability of a soft error at 0.0001 bits/GB/day. This is a few years of waiting for someone with 8GB RAM. Li et al research is cited by 2009 DRAM Errors in the Wild: A Large-Scale Field Study [ pdf ] paper bySchroeder, Pinheiro and Weber (The Toronto study) which measured a correctable error rated of 4.91 bits/GB/day on a lower bound! This is half an hour until the first error occurs. They put the higher bound at unbelievable 14.74 bits/GB/day. They do state though, that the difference in measurements is partly caused by them measuring all correctable errors, that is hard and soft errors.

The test

If soft error rates are as common as one in every few days, this should be fairly easy to reproduce even in environments where cosmic rays are not at their maximum. I did the test on a workstation computer with 4 x 2GB PC3-10600U RAM chips without ECC, inside the building, on about 300 meters elevation and 46° latitude. I concurrently ran two separate experimental programs, written in C. Each of them reserved a gigabyte of RAM, then one set all memory bits to 1 and the other set all bits to 0.I turned off pagefile in Windows, to avoid the memory being stored on disk instead of on the RAM chip. I ran the experiment for about 6 weeks during the summer.


// ZERO
void main( void )
{
    int size = 1024*1024*1024; // gigabyte
    char* mem = (char*)malloc(size);
    memset(mem, 0, size);
    int i = 0;
    while(true) {
        if(mem[i] != 0) {
            time_t seconds = time(NULL);
            printf("\nCosmic ray attack!!\n");
            printf("Time:  %ld\n", seconds);
            printf("At:    %ld\n", &mem[i]);
            printf("Value: %x\n", mem[i]);
            mem[i] = 0;
        }
        ++i;
        if( i >= size ) {
            printf(".");
            i=0;
            Sleep(15*60*1000); // wait 15 minutes, no hurry
        }
    }
}
// ONE
void main( void )
{
    int size = 1024*1024*1024; // gigabyte
    unsigned char* mem = (unsigned char*)malloc(size);
    memset( mem, UCHAR_MAX, size );
    int i = 0;
    while(true) {
        if(mem[i] != UCHAR_MAX) {
            time_t seconds = time (NULL);
            printf("\nCosmic ray attack!!\n");
            printf("Time:  %ld\n", seconds);
            printf("At:    %ld\n", &mem[i]);
            printf("Value: %x\n", mem[i]);
            mem[i] = UCHAR_MAX;
        }
        ++i;
        if(i >= size) {
            printf ("|");
            i=0;
            Sleep(15*60*1000); // wait 15 minutes, no hurry
        }
    }
}

Nothing happened. I didn't detect a single error.


Previous: How to find anagrams in Python
Next: Solving Sudoku with A*