Prediction: First look#
Suppose we observe \(i=1,\ldots,N\) outcomes of some data generating process (DGP), \(y_i\). We wish to predict the next observation. What would be a good way to guess what the next outcome from this DGP will be?
One approach is to choose a prediction, \(\hat y\), that would have been a good guess in the outcomes we have already seen. The question is then: what do we mean by a “good guess”? Clearly we have to define some measure for how good a prediction would have been. There are many ways we might choose to define this, but a very common approach is to say that we want to minimize the Mean Squared Error.
The MSE is just the average of the squared errors: the difference between each observation \(y_i\) and the corresponding prediction \(\hat y\). (In this case the prediction for every observation is the same — \(\hat y\) doesn’t vary by \(i\) — but later we’ll think about ways to make to make the prediction vary by observation.)
By defining the error term as the squared difference between the observation and the prediction we are implicitly assuming that larger differences are much larger than small differences. This assumption may or may not reflect how much we care about errors in different settings. We could define the error differently, say as the absolute value of the difference:
This is just one of many possible alternatives. But the MSE has a lot of desirable characterstics — not the least of which is that it is a smooth function with continuous derivatives and therefore amenable to being optimized using calculus.
To solve for the value of \(\hat y\) that minimizes the MSE, we set the derivative equal to zero. The derivative is
Setting equal to zero and rearranging, we have
implying that
That is, the best prediction is simply the average or mean of the \(y_i\) that we have observed.
Note
The second derivative of the MSE function is positive everywhere, so we know we have found the global minimum. The MSE is a convex function, meaning it is shaped like a multidimensional bowl with a single point at the bottom. Convex functions are much easier to minimize than non-convex functions, as we can use simple algorithms that take advantage of calculus to find the global minimum.
Following convention, we will usually denote the mean of a values \(y_i\) by \(\bar y\) (“y-bar”). If we substitute \(\bar y\) as the predictor in (8) we have
This expression is the variance of the \(y_i\). It provides a summary of how spread out the individual values of \(y_i\) around the mean \(\bar y\).
Extra credit
It might appear at this point that the MSE and variance are the same. This happens to be true in this case, but will not be true in general. The MSE is actually the sum of the variance and another term, the squared bias, which in this particular case is zero. The famous formula
formalizes this relation for an estimate \(\hat \theta\) of a parameter \(\theta\). See here for additional discussion.