Exponential Moving Average (EMA) Rates, part 1
I had been thinking about determining the average rate of occurrences over time of some observation. For example, you might like to measure how much traffic flows through a street throughout the day. Reporting the time that every single car goes by is very accurate, but not very useful. You might bin traffic into hours starting on every hour, but if there is a spike or sudden increase in the middle of an hour you might miss its significance. So, you'd like to see a graph that's smooth like an average but with more detail in time.
One approach is similar to the binning approach, but slide the hour-long window across the data by minutes. Doing this requires keeping the data around, and using each data point repeatedly. If you have a surge of one million cars in a few minutes, you need to use those million points in your calculations 60 times.
This behavior is similar to the Simple Moving Average (SMA). A SMA can easily be transformed into an Exponential Moving Average, which requires only the previous EMA and the new data point to calculate the new EMA. So, I decided to create an Exponential Moving Average Rate (EMAR).
Simulating the data
Let's say that we're measuring traffic on a road that gets about 10 vehicles passing every minute. The inter-arrival times of Poisson events are exponentially-distributed, with the average being 0.1 minutes between arrivals. We'll take measurements for about 8 hours (or, 480 minutes). We expect to see about 4800 vehicles pass in that time.
inter.arrival.times <- rexp(4800, rate=10)
The arrival times themselves are simply the cumulative sums of the inter-arrival times:
arrival.times <- cumsum(inter.arrival.times)
We'll group the arrival times into bins and count arrivals in each bin. A bin size of half a minute will be big enough to have a variety of counts in the bins, but small enough that the distribution of counts doesn't start to look too Gaussian:
binify <- function (x, minutes) { last.bin <- floor(max(x) / minutes) sapply(0:last.bin, function (i) { a <- minutes*i b <- minutes*(i + 1) sum(a <= x & x < b) }) } x <- binify(arrival.times, 0.5)
At this point, \(x\) should follow the Poisson distribution. We can check that visually:
hist(x)
Numerically, Poisson data should have mean and variance equal. We expect them to be the rate-per-minute times the bin size, \(10 \cdot 0.5 = 5\):
mean(x)
4.89795918367347
var(x)
5.28273331804632
Now that we've verified that the simulated data has the expected properties, let's try to measure the arrival rate.
Sliding windows
We'll start by allowing a window to slide over the arrivals, instead of binning them into rigid slots. This is essentially a simple moving average. We could explicitly discard data point as they expire from the window, but for easy of implementation, we'll calculate the window on the fly from the full data set.
We'll sample the current estimate of the rate every 0.1 units of time. Using a sliding window is a lagging technique, so we start a \(t = 60\):
sliding.window <- function (x) { t <- seq(60, 480) r <- sapply(t, function (t) { sum((t - 60) < x & x <= t) }) data.frame(time=t, rate=r) } d <- sliding.window(arrival.times) plot(d, type="l", xlim=c(0, 480), ylim=c(0, 1000)) abline(h=600, lty="dashed")
We can see that the sliding window reports an average rate that stays close to what we expect, which is 600 vehicles passing per hour.
Calculating without keeping data around
What we'd really like is to be able to discard every data point after we've used it to update our estimate, instead of keeping around an arbitrary amount of data to keep our window full. I'll discuss how to do just that in my next post.
Trackbacks
The author does not allow comments to this entry
Comments
Display comments as Linear | Threaded