Skip to content

Attempt to find modulo bias in .NET randomization

Dice five

Dice five (Photo credit: @Doug88888)

Modulo bias is an enemy of all randomized algorithms. This can easily introduce systematic randomization bias where we expect none. Is there modulo bias in .NET randomization? I tried to discover it in Random::Next(int) function. This function takes a range from which it returns a random number and if the initial random number is narrowed down to specified range by modulus operator, maybe I could show some modulo bias signal. I was testing .NET Framework version 3.5 SP1.

tl;dr: There is clear bias toward some numbers, the order of magnitude shows it might be modulo bias effect, but the distribution does not conform to the one you would expect from modulo bias. Maybe it was accounted for, but fix failed to do it correctly. It is still pretty safe to implement it into applications as are poker or roulette game, where only small number of different values are considered.

edit: After publishing this article, some comments appeared bellow and on reddit, that I could use dotNetPeek tool to look into mscorlib.dll and actualy see the algorithm without lengthy analysis. An explanation is that the original random number is a double in the range of [0,1) and it is this value that is later multiplied by the maxValue given by user of the Random::Next(max) function. The bias I have shown is not modulo, but is the effect of unprecision in representation of floating point numbers. Read on only if you’re interested in explanation of modulo bias or in the metodology I employed to find it. And, of course, pretty diagrams.

Introduction to modulo bias

This bias is caused by modulus operator trying to squeeze big random number or pseudorandom number generator result into smaller set of numbers. If we wanted to generate random number that is a result of dice being thrown, our only possible outcomes are 1-6. A random number generators oftentimes don’t accept limit to maximum acceptable number. For example, C++ standard library has a rand() function, that will return a random number between 0 and RAND_MAX (which is defined in the library itself). Such a number can be squeezed into one of the numbers on dice by modulus: number % 6. This however will not produce uniform distribution of random numbers.

Now say the RAND_MAX is 10 for illustration purposes. That will leave as with two possible events in which the dice will result in 2, 2 % 6 = 2 and 8 % 6 = 2, but only one possible event for the dice to result in 5, 5 % 6 = 5 (also 11 % 6 = 5 would, but 11 is beyond RAND_MAX and can’t be given). This program will obviously favor smaller numbers and do it so systematically.

It matters. Quicksort for instance can efficiently sort an array, given the correct pivot is chosen. Straightforward implementation of quicksort will take the first item from the array, declare it to be pivot element and partition around it. We would hope there is great chance for this pivot element to be about in the middle of sorted array, since only in that case is qsort any more efficient than any naive attempt. In practice however there are lots of arrays that are almost sorted to begin with. (Think about list of web site members, sorted by date of sign-up, for instance.) An already sorted array will result in worst-case scenario for quicksort. We might try to avoid this by choosing random pivot, but the bias will favor the start of the array anyway. We might try to shuffle it, but because of the bias, numbers from the start of array are smaller and are more likely to be chosen for switching. All these factors will go against us and degrade quicksort efficiency in reality.

Demonstration

Let’s test it on a simpler case, by throwing dice, as is the habit in attempts to explain some statistical matters.

 

static void ModuloDemo(int trials, int max_random, int mod)
{
  int num = 0;
  Random r = new Random();
  int[] count_num_mod = new int[max_random];
  for (int i = 0; i < trials; i++)
  {
    num = r.Next(max_random);
    ++count_num_mod[num%mod];
  }
  using (System.IO.StreamWriter file =
    new System.IO.StreamWriter("C:\\matej\\modulo_demo.txt"))
  {
    for (int i = 0; i <= mod - 1; i++)
    {
      file.WriteLine("{0}\t{1}", i, count_num_mod[i]);
    }
  }
}

 

20 million dice throws, just because it is that easy, with RAND_MAX = 10 and modulus 6:
numbers 4 and 5 appeared half as many times, as did other numbers.

 

0 3998026
1 4001105
2 3999043
3 4000268
4 2000263
5 2001295

 

If we didn’t know what the modulus is (and let’s say we know that there is a modulus), how could we know it, so that maybe we could try to exploit it? We as users of a function, that generated random numbers, could surely control two variables: number of trials and mod limit. Set modulus should be easy to discover. Let’s do some more tests, this one with changing maximum acceptable number, that is by changing modulus, while the RAND_MAX remains the same at value 10. Rows represent events results, columns are different modulus numbers used to get random numbers in specified range:

 

5 6 7 8 9 10
0 1998554 1999419 2001234 1998785 1999650 998746
1 2000138 1999411 2001294 1999674 1000746 1001346
2 2001390 2000699 1997880 1001123 1000804 999706
3 2001194 2001232 999332 999972 999518 999412
4 1998724 1000272 999619 1000236 999531 999639
5 998967 999951 999449 1000266 999213
6 1000690 1000155 998846 998720
7 1000606 1000447 1001008
8 1000192 1001706
9 1000504

 

We see, that by repeating trials we can’t exactly pinpoint the RAND_MAX, but at least we can get a number from the same set modulus. The data also shows clear intuitive formula for calculating one number from this set modulus: to the mod we selected we add number of experiment outcomes, that are statistically significantly bigger that others. Take mod 7 for instance: 3 results are twice as large, and 7+3 is in the same set modulus of which the real RAND_MAX 10 is also a member. If no event stands out, this can mean three things: either we stumbled upon a number in set modulus, the modulo bias has been accounted for or the algorithm doesn’t work with modulus.

Detecting modulo bias

If algorithm does modulus and the modulus bias is not corrected, there is a great chance we will discover this. However in the table above such a dramatic difference is visible only because RAND_MAX is relatively small compared to selected modulus number. If RAND_MAX is big, and probably will be in real application, the modulo bias gets diluted and hard to notice. If we were to squeeze 101 into mod 5, there are 25 fair chances to select the first number and only 1 unfair chance. In other words, 25/26 percent of count for the first number would be fair, only 1-(25/26) would be modulo bias signal. This means, that if I want to discover this signal in real-life library, I would have to select mod large enough for noticing. How much is then large enough? I’d say about 3/4 of RAND_MAX, so that second half overflows for half of the mod, but that doesn’t matter, since I don’t know what it is anyway.

This is also a part of what test will have to try. Of course, in a not that remote chance that RAND_MAX is really really big, the test doesn’t have to store all random results as the code above did to illustrate the problem. Comparing the first and the last possible outcome should suffice. If there is a modulo bias effect, it will show on first number and it will not show on the last. That makes the test all that easier, because we don’t have to search for a place in the distribution, where the bias signal stops and numbers fall to the fair distribution levels, just to see whether there is a measurable effect. Also the need for O(n) buckets for storing all random numbers can quickly exceed available memory when big modulus is tested. For each possible outcome (bucket) there should be, say 1000 hits, to really get the percentage difference. I’ve tried with only a hundred, but then the noise is too big to see the signal. That means, that the number of trials will have to be about 1000 times the modulus.

Next question was, how to increment test modulus? Each successive modulus should be half as big as previous. This scheme should find bias in any case, because if the previous test just missed RAND_MAX, the following should overflow by half (follows from my 3/4 intuition mentioned earlier, but then it didn’t yet matter.)

 

static void ModuloTest( Random r, ulong trials, int max_random)
{
	int num = 0;
	int first = 0;
	int last = 0;
	int last_number = max_random - 1; // to avoid billions of same calculations
	for (ulong m = 0; m < trials; m++)
	{
		num = r.Next(max_random);
		if (num == 0)
			++first;
		else if (num == last_number)
			++last;
	}
	using (System.IO.StreamWriter file =
		new System.IO.StreamWriter("C:\\matej\\modulo_test.txt", true))
	{
		file.WriteLine("{0}\t{1}\t{2}\t{3}", max_random, first, last, (float)first / (float)last);
	}
	Console.WriteLine("{0}\t{1}\t{2}\t{3}", max_random, first, last, (float)first / (float)last);
}

static void Main(string[] args)
{
    for (int i = 2; i < (int.MaxValue / 4) && i >= 0; i = (int)(i*1.5)) // + half -> my 3/4 rule
    {
	    ulong population = 1000 * (ulong)i;
	    ModuloTest(r,population, i);
    }
}

 

It didn’t show any clear signs of modulo bias up to modulus of a few million, and also it was really slow. I didn’t expect it anyway, since the maximum random number .NET can return is maximal value of int, so I figured out the initial random number must be at least as big. I stopped this test in favor of one, that could consider some estimation. The second time around I tried to predict the RAND_MAX number in.NET and according to that, pick a prime that is significantly large in comparison. Then I would waste billions of random numbers on acquiring precision instead of the most representative modulus as was the case in previous test. This is the code for my second test:

 

static void ModuloTest2(Random r, int max_random)
{
	// this should converge to small modulo bias with
	// larger and larger population of random numbers
	// control experiment is last and pre-last numbers
	int num = 0;
	int first = 0;
	int last = 0;
	int prelast = 0;
	int last_number = max_random - 1; // to avoid billions of same calculations
	int prelast_number = max_random - 2; // to see inherent variation; that which is not modulo bias, noise
	for (ulong m = 0; m < ulong.MaxValue; m++)
	{
		num = r.Next(max_random);
		if (num == 0)
			++first;
		else if (num == prelast_number)
			++prelast;
		else if (num == last_number)
			++last;
		if ((m % 1000000000)==0) // for each N attempts, print some info
		{
			using (System.IO.StreamWriter file =
		new System.IO.StreamWriter("modulo_test_another.txt", true))
			{
				file.WriteLine("{0}\t{1}\t{2}\t{3}\t{4}\t{5}", m, first, prelast, last, (float)first / (float)last, (float)prelast / (float)last);
			}
			Console.WriteLine("{0}\t{1}\t{2}\t{3}\t{4}\t{5}", m, first, prelast, last, (float)first / (float)last, (float)prelast / (float)last);
		}
	}
}

 

First attempt was to run it with modulus of 15,485,863 which is a prime and a large number comparing to my previous test, that didn’t even reached 10 million. I estimated the maximum possible integer value divided by this number should give the first number an advantage of approximately .7% which, if population of random numbers is sufficiently large, should be visible as a signal of modulo bias.

After 3.5*10¹² randomly generated numbers later the statistics emerged: the first number was generated about .8% more than the last, the first/last ration was 1.008. There was a caveat though: as a control experiment I also compared next to last and last numbers. I expected those numbers to be about the same and the ratio approximately 1. But they weren’t. The ratio of next to last / last was 1.006. The first number was still consistently larger then next to last, so there was obviously some bias to the smaller numbers, as shown in the diagram below, but the control tests should have a ratio much smaller if I want to proclaim this as a signal of modulo bias.

 

Ratio of results from .NET randomization. First and last is in regards to the range of possible return values of Randomize::Next function. Diagram shows how the buckets for specific numbers accumulated over repeated trials in favor of first number, consistently. The last result would be just as good, since there is no history in independent events, but there is never enough of shiny diagrams.

 

I ran the test again. The modulus used this time was 217,645,177 which is already about 1/10 of the largest int value in C#.  This should give me a strong signal of about 10% bias toward first number, and if the noise in randomization is so big it can cover .7% signal, this should answer the question. This test needed a lot of random numbers to put into this many buckets, but I had time. I left them overnight on workstation at work and ran four of them, just because it has 8 core CPU. Here are the results:

 

Pop*10⁹ First Pre-last Last First/Last Pre-Last/Last
2434 11475 11129 10119 1.134005 1.099812
2434 11169 11237 10049 1.111454 1.118221
2429 11233 11116 10123 1.109651 1.098094
2434 11354 11304 10258 1.106843 1.101969

 

I don’t know what to think of it. The order of magnitude of the bias is correct for both maximums in all 5 tests. However, the bias is consistently only towards the last number and not the way you would predict when accounting modulo bias. To me this smells of some off-by-one error. Maybe they were trying to correct for the modulo bias, but didn’t cut out at the right index? If I am correct, then all the numbers should be distributed uniformly, but the last one. I was just checking for next to last as a control group. I didn’t check other buckets for the sake of simplicity and to maximize the chances to compare the last number with some number that is also not affected by the modulo bias.

In the next test I checked the distribution on six places: the same three as before plus additional one at each quarter of the range. For illustration, if modulus was 1000, I would check 0, 250, 500, 750, 999 and 1000. Just because I can I have also chosen different prime for mod, about the same size as before: 256,203,161. Now I expect all to be about the same, just a drop in distribution at the last bucket.

Just when I though I caught Microsoft at off-by-one error, there is this surprise:

 

 Population First Q1 Q2 Q3 Pre-last Last
Experiment 1 3,139E+12 13205 13327 11428 13159 11821 11801
Experiment 2 3,157E+12 13092 13228 11547 13162 11800 11787
Experiment 3 3,14E+12 13122 13056 11645 13161 11593 11768
Experiment 4 3,136E+12 13001 12987 11508 13149 11555 11624
Distribution of .NET randomization.

Four experiments with mod=256203161; Q1=64050790, Q2=128101580, Q3=192152370; Population = about 3*10^12

 

Now the “consistent” bias toward only last number is gone. It is also present right in the middle of modulus. The amount of bias is right for modulo bias ( 1-Q2/Q1 [11,54%] is about the same as int.MaxValue/mod [11,93%] ), but locations are all wrong. The distribution should show bias toward certain number of numbers at the beginning, but after that bias dropped, it shouldn’t just reappear as I see it on point Q3. Hey, maybe it was just mistake that I will have to somehow explain later (seed problem, something else), but for the time being I decided I should just repeat the experiment.

 

 Population First Q1 Q2 Q3 Pre-last Last
Experiment 1 3,982E+12 16819 16497 14828 16674 14649 14786
Experiment 2 3,981E+12 16807 16880 14930 16496 14894 14661
Experiment 3 3,998E+12 16967 16773 15017 16728 14674 14621
Experiment 4 4,01E+12 16879 16814 14932 16669 14993 15098
Distribution of .NET randomization.

Same four experiments with mod=256203161; Population is this time somewhat greater, about 4*10^12

 

Same experiment, even with larger population of random numbers, and the same pattern emerged with the same ratios of biases. The second experiment was done a day later, so even if it takes timestamp as a default seed, this seed couldn’t be the same. Although this doesn’t prove conclusively that the seed isn’t to blame, I just don’t believe it and will not search into that area now.

Maybe Q2 because it is right in the middle? Would Q2-1 or Q2+1 be different? Are there any other such plummets in distribution? Would increased mod still bring distribution bias as predicted by modulo bias?

I tried another experiment to answer those questions. This time I will run less instances of the same, because previous experiments have shown no significat variation in distribution in 8 same experiments, so no need for four of the same at once. Because of that I can afford to run two different moduluses, two runs for each, but twice as many checkpoints and -2 -1 +2 +1 offsets from the middle element. The moduluses used are one same as before, as a control, plus a mod that big that it should give 50% bias. This should be a clear sign of modulo bias, even if on wrong places. The big modulus for this next experiment: 982451653, about half the max int.

 

Checkpoint buckets:
mod first Q1 Q2 Q3 Q4-2 Q4-1 Q4
256203161 0 32025395 64050790 96076185 128101578 128101579 128101580
Q4+1 Q4+2 Q5 Q6 Q7 Prelast Last
128101581 128101582 160126975 192152370 224177765 256203159 256203160
mod first Q1 Q2 Q3 Q4-2 Q4-1 Q4
982451653 0 122806456 245612912 368419368 491225822 491225823 491225824
Q4+1 Q4+2 Q5 Q6 Q7 Prelast Last
491225825 491225826 614032280 736838736 859645192 982451651 982451652

 

static void ModuloTest2(Random r, int max_random)
{
	// this should converge to small modulo bias with
	// larger and larger population of random numbers
	// control experiment is last and pre-last numbers
	int num = 0;
	int first = 0;
	int q1 = 0;
	int q2 = 0;
	int q3 = 0;
	int q4_2before = 0;
	int q4_1before = 0;
	int q4 = 0;
	int q4_1after = 0;
	int q4_2after = 0;
	int q5 = 0;
	int q6 = 0;
	int q7 = 0;
	int last = 0;
	int prelast = 0;

	int q1_number = 1 * (max_random/8);
	int q2_number = 2 * (max_random/8);
	int q3_number = 3 * (max_random/8);
	int q4_number = 4 * (max_random / 8);
	int q4_2before_number = q4_number - 2;
	int q4_1before_number = q4_number - 1;
	int q4_1after_number = q4_number + 1;
	int q4_2after_number = q4_number + 2;
	int q5_number = 5 * (max_random / 8);
	int q6_number = 6 * (max_random / 8);
	int q7_number = 7 * (max_random / 8);
	int last_number = max_random - 1; // to avoid billions of same calculations
	int prelast_number = max_random - 2; // to see inherent variation; that which is not modulo bias, noise
	for (ulong m = 0; m < ulong.MaxValue; m++)
	{
		num = r.Next(max_random);
		if( num == 0 )
			++first;
		else if( num == q1_number )
			++q1;
		else if( num == q2_number )
			++q2;
		else if( num == q3_number )
			++q3;
		else if( num == q4_number )
			++q4;
		else if( num == q4_2before_number )
			++q4_2before;
		else if( num == q4_1before_number )
			++q4_1before;
		else if( num == q4_1after_number )
			++q4_1after;
		else if( num == q4_2after_number )
			++q4_2after;
		else if( num == q5_number )
			++q5;
		else if( num == q6_number )
			++q6;
		else if( num == q7_number )
			++q7;
		else if( num == prelast_number )
			++prelast;
		else if( num == last_number )
			++last;
		if ((m % 1000000000)==0) // for each N attempts, print some info
		{
			using (System.IO.StreamWriter file = new System.IO.StreamWriter("modulo_test_another.txt", true))
			{
				file.WriteLine("{0}\t{1}\t{2}\t{3}\t{4}\t{5}\t{6}\t{7}\t{8}\t{9}\t{10}\t{11}\t{12}\t{13}\t{14}",
					m, first, q1, q2, q3, q4_2before, q4_1before, q4, q4_1after, q4_2after, q5, q6, q7, prelast, last);
			}
			Console.WriteLine("{0}\t{1}\t{2}\t{3}\t{4}\t{5}\t{6}\t{7}\t{8}\t{9}\t{10}\t{11}\t{12}\t{13}\t{14}",
				m, first, q1, q2, q3, q4_2before, q4_1before, q4, q4_1after, q4_2after, q5, q6, q7, prelast, last);
		}
	}
}

 

I left the code run over and over again, for three days and three nights (no, really, friday and over the weekend), 10^13 times, and I present to you – the data. Yes, I know, confusing table. But I am just not much of a designer. The purpose however is to write down what all this Qs mean: these are buckets for which I measured how many times this particular number was selected among all possible from 0 to modulus.

 

mod: 256203161 first Q1 Q2 Q3 Q4-2 Q4-1 Q4
1,15E+13 47728 48113 48388 43049 43014 48217 42894
1,19E+13 49765 49693 50168 44352 44317 49996 44247
Q4+1 Q4+2 Q5 Q6 Q7 Prelast Last
1,15E+13 48537 42899 42772 48079 48170 42958 43064
1,19E+13 49842 44755 44102 49809 49504 44213 44360
Distribution of .NET randomization.

Two experiments with mod=256203161

mod: 982451653 first Q1 Q2 Q3 Q4-2 Q4-1 Q4
1,19E+13 16753 10984 11243 11128 11154 16817 10867
1,18E+13 16501 10905 11038 10848 10927 16351 11003
Q4+1 Q4+2 Q5 Q6 Q7 Prelast Last
1,19E+13 10920 11079 11051 11035 11127 11302 11293
1,18E+13 10896 10973 10818 10986 11113 10895 10974

 

Distribution of .NET randomization.

Two experiments with mod=982451653

 

Results

Experiments with both mods showed clear bias in the magnitude you would expect from modulo bias, although results from experiment with bigger mod is slightly more biased then what you would expect from only modulo bias. That could be explained by initial random number being somewhat bigger then max int, for instance. Anyway, it is of that order of magnitude. And mind you, here the population of randomly selected numbers was about 10^13 and the distribution is therefore no coincidence.

The pattern however is clearly seen: there is clear bias toward some numbers, the order of magnitude shows it might be modulo bias effect, but the distribution does not conform to the one you would expect from modulo bias. Maybe it was accounted for, but fix failed to do it correctly. I will not try to guess what the cause or possible bug is. The post is long enough already. I also can’t see exactly what the RAND_MAX would have to be, because of the scattered distribution of bias.

Just one question is left to answer. Is it safe to implement poker game or roulette with this randomizer? Probably yes. As can be seen from the last experiment, modulo bias with such a small modulo will show such a small bias, that it will dissolve in chances, therefore it is still pretty safe to implement it into applications such as games, where only small number of different values are considered. I ran the last experiment right until buckets overflew into negative values, since I foolishely set buckets to be array of integers. I didn’t bother to repeat the experiment. The data from before overflowing are still good: distribution in small modulus is uniform. This is just another thing, that speaks in favour of conclusion, that there is a modulo bias effect.

 

Distribution of .NET randomization.

Hundreds of billions of random numbers will distribute uniformly among 100 buckets.

 

Related articles:

Enhanced by Zemanta

Categories: Thoughts.

Tags: , , , , ,

  • Chris

    Just look into the implementation of Random in mscorlib.dll using dotNetPeek and you’ll that Random.Next(max) is not implemented using moduolo, but is implemeted by converting an int in the range [0,2147483647) into a double in the range [0,1), then multiplying that out to the range [0,max). This intorduces slight bias for numerical analysis reasons for some values of max, but is not moduolo bias.

    • http://blog.matejzavrsnik.com/ Matej Zavrsnik

      I didn’t know about dotNetPeek before. Thank you for pointing that out, along with the type of algorithm behind the tested function. I will add an edit section with this information somewhere near top, to save people some reading :)