overlap – BigDataMark

I recently had to fix someone else’s data before I could analyze it. They had a slowly changing dimension where they had duplicate and overlapping records, so a join using a single point in time resulted in an accidental cartesian join.

This is annoying but easy to fix. The following example query will merge the duplicate and overlapping records. Just run it continuously until it doesn’t return any results, as it only merges adjacent rows. If you have multiple records in a set that overlap, it’ll only handle the first one in each set.

This uses window functions and the OVERLAPS() function, so hopefully you’re using PostgreSQL.

BEGIN;
CREATE TEMPORARY TABLE newrows AS

SELECT DISTINCT LEAST(a.id, b.id) AS id
, a.company_id
, a.office_id
, a.plan_type
, LEAST(a.date_start, b.date_start) AS date_start
, GREATEST(a.date_end, b.date_end) AS date_end
, a.is_removed
FROM subscriptions a
, subscriptions b
WHERE a.id <> b.id
AND a.company_id = b.company_id
AND a.office_id = b.office_id
AND a.plan_type = b.plan_type
AND a.is_removed = b.is_removed
AND (a.date_start, a.date_end) OVERLAPS (b.date_start, b.date_end);

DELETE FROM subscriptions a
USING subscriptions b
WHERE a.id <> b.id
AND a.company_id = b.company_id
AND a.office_id = b.office_id
AND a.plan_type = b.plan_type
AND a.is_removed = b.is_removed
AND (a.date_start, a.date_end) OVERLAPS (b.date_start, b.date_end);

INSERT INTO subscriptions
SELECT * FROM newrows;

SELECT * FROM newrows;

DROP TABLE newrows;
END;

Tag: overlap

How to merge duplicate/overlapping timeseries records in your slowly changing dimension (SCD)

Share this: