In this article, we are going to derive and implement an iterative algorithm to compute the variance and mean of an incoming continuous stream of data.
Recently, a friend of mine was asked this question in an interview:
Suppose you are creating an IoT sensor device that emits output as a continuous stream of integer values. While most of the data is transmitted to a central server for logging, the local node needs to maintain some statistic about the data. This data is modelled by a random variable \(X\) and it is assumed that the values for this random variable are derived from a uniform distribution.
As a designer of the IoT device, you are required to maintain the variance and mean of the incoming stream of data on the device. Conservation of memory is paramount. Design an algorithm that can achieve this.
Now, as a designer - for a moment - let us just forget all about random variables and distributions. The problem is, we want to store the mean and variance of all the data that has ever been sensed and find a way to update it as and when new data arrives. So, we are looking for some kind of a recursive formula which we could implement in code. Here, we'll don our cap of statistics and find one such formula.
Deriving the Recurrence Relation
Assume that the data was represented by a sequence of data \(\{x_1, x_2 \dots x_{n-1}, x_n, x_{n+1} \dots\}\). Further, suppose that we are looking at this sequence of data at instant \(n\). Then, so far, we would have received \(n\) data samples.
Mean of \(n\) data samples
\(\mathbb{E}[X] = \overline{X_n} = \frac{x_1+x_2+\dots +x_{n-1}+ x_n}{n} = \frac{\sum_{i=1}^{n}x_i}{n}\)
We could break it down as the following:
\(\overline{X_n} = \frac{x_1+x_2+\dots +x_{n-1}}{n} + \frac{x_n}{n} = \frac{\sum_{i=1}^{n-1}x_i}{n}+ \frac{x_n}{n} \)
Multiply and divide each side by \((n-1)\) and simplifying, we can reduce this expression into an elegant recursive formula as:
\(\overline{X_n} = \frac{n-1}{n} \frac{\sum_{i=1}^{n-1}x_i}{n-1}+ \frac{x_n}{n} = \frac{n-1}{n} {\overline{X}}_{n-1} + \frac{x_n}{n}\)
Thus, our recursive formula can be written as:
\(\overline{X_{n+1}} = \frac{n}{n+1}\overline{X_n} + \frac{x_{n+1}}{n+1}\)
Similarly, for variance computation, we can write a recurrence relation deriving it from the standard formula:
\(var(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2\)
Now, we've already found the recurrence relation for \(\mathbb{E}[X]\). If we can find another similar kind of recurrence relation for \(\mathbb{E}[X^2]\), we'll have found our gold.
So, \(\mathbb{E}[X^2] = \sum_{i=1}^{n}{x_i}^2P(X=x_i)\) . Here, \(P(X=x_i)\)is the probability of random variable\(X\) taking the value \(x_i\). Since it is a uniform distribution, all values are equi-likely, and hence, this value may be taken out of the summation as \(\frac{1}{n}\). Note that we did this step implicitly in the previous derivation.
Then, \(\mathbb{E}(X^2) = \frac{\sum_{i=1}^{n} {x_i}^2}{n} = \frac{n-1}{n}\cdot\frac{\sum_{i=1}^{n-1} {x_i}^2}{n-1} + \frac{{x_n}^2}{n}\)
With this, we have our desired recurrence relation for variance as:
\((var(X))_{n+1} = (\frac{n}{n+1}\mathbb{E}_n({X^2}) + \frac{{x_{n+1}}^2}{n+1}) - (\frac{n}{n+1}\mathbb{E}_n(X) + \frac{x_{n+1}}{n+1})^2\)
So, now that we have our recurrence relations, we can just store away minimal amount of data and still be able to compute the mean and variance of this data. Note that we are not actually storing the variance of data, but the second moment of the data. From this second moment \(\mathbb{E}(X^2)\) and the first moment (or mean) \(\mathbb{E}(X)\) (or \(\overline{X}\)), we are able to calculate the variance of data till the time instant 'n'.
All in all, we are keeping 3 variables, E_x
, E_x_squared
and n
and by recursively updating these values every time a new data value arrives, we are satisfying the requirement of saving the mean and variance without using too much memory. This kind of iterative approach where we learn and update the new values on-the-fly is often called online learning or sequential learning in a machine learning context. (The other version is batch learning, where we treat our algorithm to chunks or batches of data and operate on them collectively.)
Now, let's put down the statistics hat and wear the programmer's hat. Can we validate this relation? Let us write very short code to see if it works right.
Sample Implementation
Presented below is a very short code written in MATLAB/Octave where we've simulated the incoming sequence (pre-generated the sequence) and fed it iteratively into the computational loop. We stop at random points to see if the expression of Mean and Variance are indeed what they should be.
%%Variables N = 10^6; stream = rand([N,1]); pause_at = sort(randperm(N, 10)); E_x = 0; E_x_squared = 0; for n=0:N-1 E_x = ((n/(n+1))*(E_x)) + (stream(n+1)/(n+1) ); E_x_squared = ((n/(n+1))*(E_x_squared)) + ... (stream(n+1)^2/(n+1) ); if(find(pause_at==n)) fprintf('\n Computed: N=%d. Mean=%f, Variance = %f',n, E_x, ... E_x_squared - (E_x)^2); fprintf('\n Analytical: Mean=%f, Variance = %f', ... mean(stream(1:n+1)),... var(stream(1:n+1))); end end
Here's a sample output:
Computed: N=22504. Mean=0.500859, Variance = 0.083856 Analytical: Mean=0.500859, Variance = 0.083860 Computed: N=60197. Mean=0.499909, Variance = 0.083266 Analytical: Mean=0.499909, Variance = 0.083268 Computed: N=111533. Mean=0.499907, Variance = 0.083466 Analytical: Mean=0.499907, Variance = 0.083467 Computed: N=121321. Mean=0.500081, Variance = 0.083466 Analytical: Mean=0.500081, Variance = 0.083467 Computed: N=123488. Mean=0.499915, Variance = 0.083462 Analytical: Mean=0.499915, Variance = 0.083463 Computed: N=352519. Mean=0.499968, Variance = 0.083501 Analytical: Mean=0.499968, Variance = 0.083501 Computed: N=392207. Mean=0.499799, Variance = 0.083529 Analytical: Mean=0.499799, Variance = 0.083529 Computed: N=479770. Mean=0.499807, Variance = 0.083528 Analytical: Mean=0.499807, Variance = 0.083528 Computed: N=480705. Mean=0.499824, Variance = 0.083528 Analytical: Mean=0.499824, Variance = 0.083528 Computed: N=903026. Mean=0.500124, Variance = 0.083363 Analytical: Mean=0.500124, Variance = 0.083363
Yep, we're doing fine! Easy!