# Calculating model fit with R(2)/Coefficient of Determination in SQL

In this post, I’ll explain how to convert the R2_1 coefficient of determination equation into SQL for calculating model fit.

The coefficient of determination, also known as the R2 (“R square”), is a useful value to calculate when evaluating a regression model because it represents the proportion of the total variation of an observed value explained by the model and it can be represented as a percentage that is easy to explain to a stakeholder.

The formula can be represented as the following, where a dataset has $y_{i}$ observed values, $\bar{y}$ represents the mean of the observed values, and $f_{i}$ represents a modeled value.

$R^{2}=1-\frac{\sum_{i=1}^{n} (y_{i}-f_{i})^{2}}{\sum_{i=1}^{n}(y_{i}-\bar{y})^{2}}$

When converting this formula into an algorithm in SQL, the column named rating represents my observed values, and the column named prediction represents the modeled values.

SELECT AVG(rating) FROM model_applied;

We’ll plug the average into the next query:

SELECT 1-(SUM(POWER(rating - prediction,2)))/(SUM(POWER(rating - [average],2))) FROM model_applied

If your values are INTEGER, you may need to convert them into FLOAT to get a meaningful result, because the R2 is a number between 0 and 1, where 0 indicates that the regression does not represent the data, and 1 is a perfect fit.

If my R2 is 0.222787592231293, then 22% of the variation can be explained by the regression, and the other 82% of the variation is unexplained. Basically, this says that the model is 22% more accurate than using a random guess (minimum error by using the mean).

## 2 Replies to “Calculating model fit with R(2)/Coefficient of Determination in SQL”

1. Fred says:

Hi,

Nice Post. There is a question, if the fi has a very bad prediction. So the variance between the original value and the predicted value could be very large. Won’t it be possible that the R2 value could be negative in that case?

2. It is possible to have a negative R2, which means that your model fits worse than a straight linear regression.