In this article, I will explain what causes an ERROR 1023 (Serializable isolation violation) in Redshift and how I dealt with it in a novel way.
Redshift’s use cases
Redshift’s design targets fast reads of flattened data. Historical data is typically loaded during off peak hours, stored in a flattened manner (no normalization), and queried for near real time display of data.
Redshift is not optimized for writes due to limitations in its scheduler, lack of dynamic scaling, lack of separation of duties between storage and compute resources, and more basically because it’s a columnar store that doesn’t have row level locking. As a result, an optimal use case involves loading data in batches during a daily/nightly maintenance window and only storing data through the previous day.
If you try to do intraday updates, you will have to deal with periods of high CPU and IO from the writes slowing down your queries. If you store your data in a star or snowflake schema, you’ll also run into issues with Redshift’s serializable isolation limitations if you have multiple jobs normalizing data to the same dimension tables.
When using an ODBC connector, those errors will look similar to this:
System.Data.Odbc.OdbcException: ERROR [HY000] [Amazon][RedShift ODBC] (30) Error occurred while trying to execute a query: ERROR: 1023 DETAIL: Serializable isolation violation on table - 1256109, transactions forming the cycle are: 15749004, 15748971, 15748625, 15748970, 15749497 (pid:8238) at System.Data.Odbc.OdbcConnection.HandleError(OdbcHandle hrHandle, RetCode retcode)
If you’re using Microsoft SQL Server Integration Services (SSIS) your messages may look closer to this:
Executed as user: NT Service\SQLSERVERAGENT. Microsoft (R) SQL Server Execute Package Utility Version 11.0.5058.0 for 64-bit Copyright (C) Microsoft Corporation. All rights reserved. Started: 7:01:01 AM Error: 2017-03-14 07:30:23.84 Code: 0xC002F304 Source: Populate fact tables Populate fact tables Description: An error occurred with the following error message: "ERROR: XX000: 1023". End Error DTExec: The package execution returned DTSER_FAILURE (1). Started: 7:01:01 AM Finished: 8:41:03 AM Elapsed: 6002.32 seconds. The package execution failed. The step failed.
ERROR: 1023 DETAIL: Serializable isolation violation on table
Basically, a serializable isolation violation in Redshift means that two jobs were writing to the same table concurrently.
I could add that this would okay as long as the multiple jobs had the same results regardless of what order they ran in (serializable). However, it’s not really possible to have them be serializable if you have concurrent jobs creating new rows in a dimension table with an auto-incrementing key.
Although one could argue that one shouldn’t perform normalization on Redshift, if you’re required to do this, we need to prevent two jobs from inserting new rows into the dimension table at the same time.
If you’re using a star schema, it’s quite normal for many fact tables to reference a single dimension table. As a result, the ETL jobs that load these fact tables will need to insert new keys into the same table. If these jobs every run concurrently, you’ll end up with serializable isolation violation errors.
In the Redshift world, you will want to have the ETL jobs for these multiple fact tables that normalize to the same dimension tables scheduled in serial. This is how you would load data in the once a day maintenance window scenario.
If you’re forced to schedule these jobs independently where run times may overlap, or if you’re intentionally trying to run these jobs somewhat in parallel, you may consider an alternative solution.
Redshift’s documentation recommends the following:
Force serialization by having each session lock all tables
You can use the LOCK command to force operations to occur serially by blocking operations to tables that would result in serializable isolation errors.
To force serialization while avoiding deadlock, make sure to do the following:
Lock all tables affected by the transaction, including those affected by read-only SELECT statements inside the transaction.
Lock tables in the same order regardless of the order of operations to be performed.
Lock all tables at the beginning of the transaction, before performing any operations.
However, this seems like a horrible idea to do on a production system. You don’t really want to lock a dimension table for reads and writes the entire time that you’re ETL job is loading data.
The semaphore solution (my recommendation)
I was thinking, if only there was a way to pass a semaphore, I could force the jobs to take turns writing to the dimension table and I wouldn’t have to lock the entire dimension table for reads and writes.
I decided to use a shared table and table locking to act as a semaphore whenever writing to a dimension table. Locking a shared “dummy” table prevents needlessly locking the dimension table itself.
For example, for my table named “dimtable” in the “prod” schema, I will use this snippet of SQL anytime I write to the table:
BEGIN; LOCK public.dimtable_lock; -- Write to prod.dimtable END;
This allows us to prevent concurrent writes of the dimension table without preventing reads of the dimension table. This isn’t perfect because If you’re writing to the dimension table and query the table to include the new records, in theory, this should also be a serializable isolation violation. However, in my case, it seemed a better solution than locking the dimension table completely against reads.
Troubleshooting serializable isolation issues:
I’ve found that this is the best query to dig into the cause of the 1023 error.
If your error message is like this:
Serializable isolation violation on table - 1809923, transactions forming the cycle are: 41329759, 41329414 (pid:11276)
Then take the above transaction xid’s (41329759, 41329414) and paste them into this query:
WITH pids AS ( SELECT DISTINCT pid FROM svl_qlog WHERE xid IN (41329759, 41329414) ) SELECT a.* FROM svl_qlog a INNER JOIN pids b ON(a.pid = b.pid) ORDER BY starttime;
The pid indicates the session. You’ll see the queries that overlap between the sessions and which ones were aborted, and the transaction that was undone.