How do you mean?

When asked to calculate the average of some values, most people and most programmers would probably calculate the arithmetic mean, i.e. the sum of the values divided by the number of values:

xn=x1+x2+...+xnn(1) \bar{x}_n = \frac{x_1+x_2+...+x_n}{n} \qquad{(1)}

Certainly, that’s what I’ve been doing until recently. And indeed, this is how a highly rated online JavaScript course solves one of its first exercises:

function mean(xs) {
    let sum = 0;
    for (let x of xs) sum += x;
    return sum / xs.length;
}

More concisely in Haskell:

mean xs = sum xs / genericLength xs

Mathematically, this is correct. In the imperfect world of computers, there is a better way, as I recently discovered in an online C course which provided an algorithm suggested by Donald Knuth:

double mean(double *xs, size_t size) {
    double mean = 0.0;
    size_t i = 0;
    while (i < size) mean += (xs[i] - mean) / ++i;
    return mean;
}

This algorithm does not ever add up all the values. It keeps a running mean (also known as a cumulative average). With each new value it calculates a new mean using the difference between the value and the previous mean.

So how does it work (mathematically) and why is it better (computationally)?

The mathematics of a running mean

Looking back at eq. 1, which gives us the mean of nn values, we can use it for the case where we have n+1n+1 values:

xn+1=x1+x2+...+xn+xn+1n+1(2) \bar{x}_{n+1} = \frac{x_1+x_2+...+x_n+x_{n+1}}{n+1} \qquad{(2)}

Multiplying each side of eq. 1 by nn to clear the fraction gives

nxn=x1+x2+...+xn(3) n\bar{x}_n = x_1+x_2+...+x_n \qquad{(3)}

Now we see that the right-hand-side of eq. 3 appears in eq. 2, so we can substitute to give

xn+1=nxn+xn+1n+1 \bar{x}_{n+1} = \frac{n\bar{x}_n+x_{n+1}}{n+1}

Simplifying this further requires a little trick, often referred to as adding zero, which is to rewrite the nn that appears in the numerator as n+11n+1-1

xn+1=(n+11)xn+xn+1n+1 \bar{x}_{n+1} = \frac{(n+1-1)\bar{x}_n+x_{n+1}}{n+1}

Re-arranging

xn+1=(n+1)xnxn+xn+1n+1 \bar{x}_{n+1} = \frac{(n+1)\bar{x}_n-\bar{x}_n+x_{n+1}}{n+1}

xn+1=(n+1)xn+xn+1xnn+1 \bar{x}_{n+1} = \frac{(n+1)\bar{x}_n+x_{n+1}-\bar{x}_n}{n+1}

xn+1=(n+1)xnn+1+xn+1xnn+1 \bar{x}_{n+1} = \frac{(n+1)\bar{x}_n}{n+1}+\frac{x_{n+1}-\bar{x}_n}{n+1}

xn+1=xn+xn+1xnn+1(4) \bar{x}_{n+1} = \bar{x}_n + \frac{x_{n+1}-\bar{x}_n}{n+1} \qquad{(4)}

Now we have a formula for the mean of n+1n+1 values, based solely on the mean of the previous nn values, the new value xn+1x_{n+1}, and the new number of values n+1n+1. If you observe the C function above, you will see that it implements eq. 4.

Here is a JavaScript version of the running mean:

function rmean(xs) {
    return xs.reduce((mean, x, i) => mean + (x-mean)/(i+1));
}

Overflow and loss of precision

Mighty as they are, computers are finite. They cannot possibly represent all the numbers which exist in mathematics. Not only are there an infinite number of integers (whole numbers), but between each integer there are an infinite number of rational numbers (fractions and finite or recurring decimals). And in between any two rational numbers, there is an infinity of irrational numbers (infinite decimals). If you could throw an infinitely fine dart at a number line, your chances of hitting a rational number are zero, because there are so many more irrational numbers.

In mathematics, all the above numbers (integers, rationals and irrationals) put together are called the real numbers.

To represent real numbers (which are not only infinitely big but also infinitely precise) in a finite amount of memory, computers have to choose a finite sample of rational numbers (irrational numbers cannot be represented at all). This sample must have a maximum and minimum. And for the sample to be most useful, the numbers are not evenly distributed: we choose to have the most around zero, and the density of available numbers goes down as we go towards the maximum and minimum. So although the numbers you are used to seeing in a computer may seem to be very precise, there is in fact a gap between each number that a computer can represent, and this gap gets bigger and bigger as we get closer to the extremes1.

Coming back to our two ways of computing a mean, the one where you add up all the numbers first, then divide them, is more likely to overflow (go beyond the maximum or minimum) or suffer from loss of precision (from using numbers where the gaps become significant) than the one which computes the running mean without ever adding up all the numbers.

To see this in action, let’s compare the two JavaScript functions. First, although we have proved that they are equivalent mathematically, let’s try them out on a few ordinary values:

> mean([1,2,3])
2
> rmean([1,2,3])
2
> mean([-1, 3.5, 12, 56.987, 0])
14.2974
> rmean([-1, 3.5, 12, 56.987, 0])
14.2974

Now, to show the superiority of the running mean, we will use some large numbers (1e308 is 1 followed by 308 zeroes), and in order to be sure of what the result should be, we will compute the mean of two identical numbers (which should of course give back the same number):

> mean([1e308, 1e308])
Infinity
> rmean([1e308, 1e308])
1e+308

This shows that the first mean function overflows, whereas the running mean function gives the correct value.

Although it is hard to come up with a simple example to demonstrate it, if you use the arithmetic mean function with lots of data or large values, long before overflowing it will start producing values that are less correct than the running mean would produce. This will happen when the sum of all the numbers starts reaching values where the gaps between the numbers that the computer can represent become significant. And whereas with overflow you at least know that a problem has occurred, there is no way to detect loss of precision.

Wrapping up

If like me you adopt this new way of calculating averages, it is worth having a pure function which calculates the running mean given the previous mean and the new value, but knows nothing about where the values are stored or where they come from. That way, you can use the same function whatever the data structure you are using, or even without having all the data in memory (you could be using a real-time data stream, or a data lake that is too large to fit in memory).

For example, in JavaScript I have it coded so that it works with Array.reduce:

function rmean (mean, x, i) {
    return mean + (x-mean)/(i+1)
}

// use it on an array of values
let values = [1.5, 5, 12];
let average = values.reduce(rmean);

In Haskell our running mean function could be as simple as:

rmean :: Fractional a => a -> a -> a -> a
rmean mean n x = mean + (x-mean)/(n+1)

However, it is worth creating an algebraic data type for a running mean, and taking a bit more care to make the function polymorphic.

data Mean a b = Mean a b deriving Show

rmean :: (Fractional a, Integral b) => a -> Mean a b -> Mean a b
rmean x (Mean m n) = Mean (m + diff/fracn') n'
                     where n'     = n+1
                           fracn' = fromIntegral n'
                           diff   = x - m

The Mean type combines the value of the mean, and the number of values that was used to compute it. Now the running mean function becomes type safe and easy to use e.g. with foldr to compute the mean of the values in a list:

ghci> foldr rmean (Mean 0 0) [1..3]
Mean 2.0 3
ghci> foldr rmean (Mean 0 0) [-1, 3.5, 12, 56.987, 0]
Mean 14.297400000000001 5
ghci> foldr rmean (Mean 0 0) [1e308,1e308]
Mean 1.0e308 2

  1. For a more detailed and very clear explanation of the gaps between floating-point numbers see this article↩︎