Cross-validation partitioning in SQL the easy way

bigdatamarkWhen doing cross-validation of your computer model, you can split your dataset dynamically using SQL, preventing the need to physically split the data and also making it easy to perform multiple rounds of cross-validation.

A common way to partition your dataset for cross-validation involves copying your dataset, then going over each line and randomly deleting the line from your training dataset or from your validation/testing dataset, usually leaving the majority of the data in your training dataset.

The SQL for this would normally look like this:

CREATE TABLE training_set AS
SELECT * FROM data_set;

CREATE TABLE validation_set AS
SELECT * FROM data_set;

DECLARE myrecord RECORD;
BEGIN
  FOR myrecord IN SELECT * FROM data_set LOOP
  IF RANDOM()...

END LOOP

This is complex, takes additional storage, and takes additional time.

Not to mention, Amazon Redshift doesn’t support procedural language at the time that I write this article.

Instead, I use the same dataset, and partition it dynamically by using the same RANDOM() function, but by setting a seed first.

My training dataset query looks like this:

SET SEED TO 1;
SELECT * FROM data_set
WHERE RANDOM() > .9;

My validation dataset query looks like this:

SET SEED TO 1;
SELECT * FROM data_set
WHERE RANDOM() <=.9;

To do another round of cross-validation, I just change the seed.

Leave a Reply

Your email address will not be published. Required fields are marked *