How to localize daily aggregate data in Redshift to support multiple timezones

bigdatamarkWhether you’re currently storing your daily aggregate data with a local timezone or in GMT/UTC, you’ll eventually need to display that data in a different timezone. There is an optimal way to store reporting data in Redshift to support different locales.

If reporting data is currently being aggregated at the day grain, the database size is going to explode for each timezone supported. However, this shouldn’t be a performance issue as long as the sortkey is set up correctly, although it will impact data storage.

This may be obvious, but you won’t be able to transition to localized data unless you have a date field with the time of day, but you can translate to most timezones if you at least have data aggregated at the hour.

I recommend a table that looks like this:

datetimezonedaylightmeasure1measure2
2015-01-01US/Easterntrue41
2015-01-01US/Easternfalse51
2015-01-01GBtrue51
2015-01-01GBfalse41
CREATE TABLE mytable
(
 datekey DATE,
 timezone VARCHAR(32),
 daylight BOOLEAN,
 measure1 INTEGER,
 measure2 INTEGER
)
DISTSTYLE EVEN
SORTKEY
(
 datekey,
 timezone,
 daylight
);

If Redshift is also your data processing environment, you can use the CONVERT_TIMEZONE() function in Redshift. For example if your raw data is stored in UTC:

INSERT INTO mytable (
SELECT CONVERT_TIMEZONE('UTC', 'US/Eastern', datetimekey)
, 'US/Eastern'
, TRUE
, SUM(measure1)
, SUM(measure2)
FROM rawtable
GROUP BY 1,2,3
);

In Redshift, if you use the full timezone name, you’ll get dates adjusted for daylight savings time, appropriately switching between Eastern Standard Time and Eastern Daylight Time depending on the day. If you want Eastern Standard Time always, you would process your non-daylight records like this:

INSERT INTO mytable (
SELECT CONVERT_TIMEZONE('UTC', 'EST', datetimekey)
, 'US/Eastern'
, FALSE
, SUM(measure1)
, SUM(measure2)
FROM rawtable
GROUP BY 1,2,3
);

You may not need the daylight flag if you’re not offering this view to your clients.

You would then query your aggregate data for reporting in a localized time by using a query like this:

SELECT datekey
, SUM(measure1)
, SUM(measure2)
FROM mytable
WHERE datekey BETWEEN $startdate AND $enddate
AND timezone=$timezone
AND $daylight
GROUP BY 1;

In theory, one could have a measure column for each timezone, or a date column for each timezone and still only have one record per day, taking advantage of the columnar storage of Redshift and reducing the amount of data for table scans. However, having to change the schema to add new timezones is not maintenance that I look forward too. Having just a regular sort key should be almost as fast. The interleaved sort keys released in May of 2015 shouldn’t even be necessary. Just don’t forget to vacuum and analyze these fact tables when necessary.

Leave a Reply

Your email address will not be published. Required fields are marked *