Calculating model fit with R(2)/Coefficient of Determination in SQL

bigdatamarkIn this post, I’ll explain how to convert the R2_1 coefficient of determination equation into SQL for calculating model fit.

The coefficient of determination, also known as the R2 (“R square”), is a useful value to calculate when evaluating a regression model because it represents the proportion of the total variation of an observed value explained by the model and it can be represented as a percentage that is easy to explain to a stakeholder.

The formula can be represented as the following, where a dataset has y_{i} observed values, \bar{y} represents the mean of the observed values, and f_{i} represents a modeled value.

R^{2}=1-\frac{\sum_{i=1}^{n} (y_{i}-f_{i})^{2}}{\sum_{i=1}^{n}(y_{i}-\bar{y})^{2}}

When converting this formula into an algorithm in SQL, the column named rating represents my observed values, and the column named prediction represents the modeled values.

SELECT AVG(rating) FROM model_applied;

We’ll plug the average into the next query:

SELECT 1-(SUM(POWER(rating - prediction,2)))/(SUM(POWER(rating - [average],2))) FROM model_applied

If your values are INTEGER, you may need to convert them into FLOAT to get a meaningful result, because the R2 is a number between 0 and 1, where 0 indicates that the regression does not represent the data, and 1 is a perfect fit.

If my R2 is 0.222787592231293, then 22% of the variation can be explained by the regression, and the other 82% of the variation is unexplained. Basically, this says that the model is 22% more accurate than using a random guess (minimum error by using the mean).