The coefficient of determination, also known as the R2 (“R square”), is a useful value to calculate when evaluating a regression model because it represents the proportion of the total variation of an observed value explained by the model and it can be represented as a percentage that is easy to explain to a stakeholder.
The formula can be represented as the following, where a dataset has observed values, represents the mean of the observed values, and represents a modeled value.
When converting this formula into an algorithm in SQL, the column named rating represents my observed values, and the column named prediction represents the modeled values.
SELECT AVG(rating) FROM model_applied;
We’ll plug the average into the next query:
SELECT 1-(SUM(POWER(rating - prediction,2)))/(SUM(POWER(rating - [average],2))) FROM model_applied
If your values are INTEGER, you may need to convert them into FLOAT to get a meaningful result, because the R2 is a number between 0 and 1, where 0 indicates that the regression does not represent the data, and 1 is a perfect fit.
If my R2 is 0.222787592231293, then 22% of the variation can be explained by the regression, and the other 82% of the variation is unexplained. Basically, this says that the model is 22% more accurate than using a random guess (minimum error by using the mean).