Bias versus Variance dilemma is almost universal in all kinds of data modelling methods. As per Wikipedia,
variance of a random variable is the deviation squared of that variable from its expected value. In other words, it is a measure of variation within the values of the variable across all possible values along with their probabilities. The
bias of an estimator is the difference between the estimator's expected value and the true value of the parameter being estimated. A very good article on this topic is available
here. Without repeating much of the content, I would simply highlight the key points which would make things easier in understanding this topic.
- Var(x) = E(x^2) - [E(x)]^2, where E(.) is the expectation value. This can be rewritten as
E(x^2) = Var(x) + [E(x)]^2
If we replace x by e (approximation error of an estimator), we can rewrite above equation as
E(e^2) = Var(e) + [E(e)]^2
MSE = Var(e) + Bias^2
Hence we can see that for a desired mean square error, there is trade-off between the variance and the bias. If one increases, the other decreases.
- The complexity of an estimator model is expressed in terms of the number of parameters. The effect of complexity on the bias and the variance of the estimator is explained here. In brief, it can be said that
- Low Complexity leads to low variance but large bias.
- Highly complex model leads to low bias but large variance.
- Large variance implies that the estimator is too sensitive to the data set. Hence the model has a low generalization capability. Excellent performance on design (or training) data, but poor performance on test data. This is the case of overfitting. Large variance is observed in case of complex models with large number of parameters (over-parameterization). Note that because of high complexity, the model fits a the design data set very well. Hence, the error at individual points (in the data set) is very low and hence the model has a low bias.
- Large bias implies that the model is too simple and hence very few data points lie on the regression curve. Hence, the error at individual points (in a given data set) are high leading to large a MSE. But since very few points participate in the model formation, the performance does not differ on different data tests. Hence, it has a low variance. The performance remains same over design and test data sets.