The Agile Data Warehouse on Amazon Redshift

Evolution through Continuous Deployment

As with any good improvement model, some goals, guidelines, and a basic framework for success must be established. In the case of business intelligence projects like this, the purpose is ultimately to support accellerating organizational change with better insights. Delivering valueable analysis quickly becomes the goal, in an environment with an ever growing set of data sources to integrate.

The guidelines for this work align to the goals of DevOps:

What Redshift Isn’t

Truly the hardest lesson, in adopting Amazon Redshift, is to accept its strengths and limitations.

Redshift is a columnar datastore, with relational query capabilities, but it is not a relational database in the traditional sense. There are no foreign key constraints; foreign keys, when specified, are used only for query planning and optimization.

While Redshift behaves more-or-less compatibly with PostgreSQL in many ways, it lacks many of the modern capabilities that PostgreSQL provides. Notable among these differences is Redshift’s lack of support for materialized views (detailed below), very primitive support for JSON, and basically terrible performance on anything that isn’t purely tabular.

In many respects, it’s appropriate to view Redshift as a SQL query layer built over distributed storage, much like other map-reduce analytics solutions. There are distribution and sort keys, which make perfect sense for partitioning and sequencing à la map-reduce. It performs profoundly well on large datasets, where aggregation can be aligned to the distribution key structure (i.e. siloed within a segment of the data, to minimize network cost).

The DIY Data Warehouse

Much of the missing functionality which my clients find essential focuses around automation of ETL, data modeling, and reporting workflows. Where Redshift lacks features, the rest of the Amazon platform provides many resources to implement that functionality according to your business case.

This is where the DevOps paradigms become essential: infrastructure as code, microservices, continuous integration and continuous deployment are the foundation of successful automation, in this category. In general the implementations follow this model:

A few important considerations:

Don’t Flake Out

Because Redshift is columnar, it’s also well-suited to a relatively high degree of denormalized data, rather than fully snow-flaked structures, as the columnar storage lends itself to compression. Redshift is also best suited to batch-processing large datasets, as frequent loading of small changes tends to be inefficient.

With these in mind, I’ve found Redshift to perform better when modeling time-series data, with slowly-changing dimensionality, under something resembling a Pure Type-6 fact table. In general, process follows this model:

CREATE SCHEMA IF NOT EXISTS temp;
CREATE TABLE temp.ts_target$temp
	INTERLEAVED SORTKEY (timestamp, id)
	AS (
		SELECT
			SYSDATE as timestamp,
			id,
			other_fields_etc
		FROM
			source_table NATURAL JOIN other_source

	);

CREATE TABLE IF NOT EXISTS ts_target (LIKE temp.ts_target$temp);
ALTER TABLE ts_target APPEND FROM temp.ts_target$temp;
DROP TABLE IF EXISTS temp.ts_target$temp;

Historically this was implemented using Redshift’s support for SELECT INTO queries, but Amazon’s relatively recent addition of ALTER TABLE APPEND shows significant performance improvements.

The resulting materialized views include some level of denormalized records. In effect, Redshift’s columnar storage relies on decompression to provide the (effective) joining of dimension values to each record, rather than an explicit reference key structure.

Basically the same approach applies to materializing snapshots of views, without respect to any time-series model. The primary difference would be either truncating or replacing the target table.

The Brittleness of JSON

Redshift has some JSON support. Yep.

In some cases, where organizations treat Redshift as a data lake, they may want to store raw stream data as JSON directly in Redshift, with the intent of making it easily queriable in the future. Two major limitations of Redshift pose significant challenges to this approach.

In my observation, it’s almost always preferable to avoid putting JSON in Redshift. While it’s possible to mitigate some of the below issues, doing so justifies some sort of up-front validation process on a JSON stream before it reaches Redshift.

Simply replacing that validation approach with a preliminary extraction process, before load, offers greater advantage.

JSON Correctness

Redshift requires that all of the records containing JSON must be 100% valid UTF-8 encoded JSON. If any record deviates, it will block the entire query from returning any results. When one record in millions has an error, it can prevent access to all of them.

In large-scale, evolving software environments, errors will occasionally occur, where one of the many systems sending JSON data somehow sends an incomplete or malformed record.

Two approaches seem sensible to mitigate for this problem within Redshift, however neither offer a long-term holistic advantage:

  1. Perform some kind of validation in Redshift, either in SQL (i.e. a validation view, even materialized), or as a Python user-defined function.
  2. Build other assurance into your data pipeline, such that invalid JSON never reaches Redshift.

There are certainly other approaches. In either case above, the performance cost at scale has been simply untenable, even if you can assure 100% valid JSON.

JSON Performance

For the description below, please consider a view such as this:

CREATE VIEW my_json_extractor AS (
	SELECT
		id,
		jsontext,
		json_extract_path_text(jsontext, 'triangle', 'base') as t_base
		json_extract_path_text(jsontext, 'triangle', 'height') as t_height,
		json_extract_path_text(jsontext, 'area') as area
	FROM
		my_base_table_with_json_field
);

In Redshift, JSON is stored as pure text, and never treated as an “object” until one of Redshift’s functions attempts to read from it. Further, from what I observe, it seems when a query processes the JSON text, it does so separately for each function call.

In the example above, it would create 3 separate object instances for each record. For 100,000 records with 3 properties extracted from one JSON field, it would create 300,000 separate JSON objects.

This becomes inefficient quickly.

The problem is compounded by the evolving nature of software and the pace of modern development: naming conventions change, default values change, etc. This often justifies querying additional fields to infer the correct state, over data aggregated from years of operation from multiple, independently developed but otherwise related systems. The effect becomes a multiplicative performance overhead.

Don’t use JSON

In the end, our conclusion is to prefer avoiding JSON in Redshift altogether. A better approach, when dealing with JSON types, is to model them into tabular form before it ever reaches Redshift. I’ve seen great results from combining DynamoDB, streaming through a Lambda job, and then stored elsewhere for Redshift to pick up. S3 offers a good medium to retain longer-term snapshots, with fairly simple LOAD and UNLOAD options directly into Redshift.

Avoiding Lock-in

Another painpoint of working with Redshift, especially when supporting integration from third-party data services, is column locking by views.

When you query a view in Redshift, it does not perform the text of the SQL query “on the fly” every time. Instead, when you create the view, it “binds” to the columns (by node, not by name) as a way to improve performance in the query planning process.

This has some benefits to the database operation, but its side-effects on human workflow (even when automated) become troublesome. In particular:

Because columns’ types (including the number of characters allowed in a text field) are effectively locked-in and cannot be changed directly, the basic approach to structural adjustments requires:

  1. Creating a new column,
  2. Copying in the existing data,
  3. Dropping the old column, and
  4. Renaming the new column to replace the old.

The net effect is that even the smallest of structural changes requires completely dropping related views, applying the desired changes, and then re-creating the views.

This might be fine for some workflows, but when integrating data from multiple third-party vendors, across schemata they manage (and therefore occasionally change), it becomes a strong detraction from using views at all in those cases.

Getting Material

There’s an alternate approach that resolves this limitation, though it may seem at first to be unintuitive. The automation cited above (in The DIY Data Warehouse) facilitates a better approach.

When dealing with schemata that are likely to change, in this manner, views should be treated as ephemeral, not persisting against the underlying tables. The automation layer, performing scheduled jobs, should basically:

  1. Create (temporarily) all of the views required to model your data.
  2. Populate materialized tables from the integrated views.
  3. Remove the temporary views.
  4. Manipulate the materialized tables accordingly (APPEND for time-series, etc.)

A few more steps and considerations may be appropriate, as views are often a suitable abstraction layer to maintain for exposing data to business users. To provide that, the central point must be to isolate their dependencies on tables (materialized) whose schemata are managed by the BI development team(s).

Realized Benefits

The general principals in Evolution through Continuous Deployment, and the techniques outlined to address these specific challenges have proven their worth many times.

  1. When new data streams begin loading, we have workloads and models (i.e. views) built against them, tuned for initial roll-out to business users, within 30 minutes.
  2. Maintenance is mostly automatic, operating scheduled jobs with logging and notifications directly in the Continuous Integration environment.
  3. The high degree of automation accelerates workflow, allowing us to deliver real business value much faster. From the time when work (e.g. schema adjustment) is done, committed into the source-code repository, it usually reaches the production environment in less than 10 minutes.
  4. The “plumbing” enables us to innovate, instead of consuming focus and costing us time.

Conclusions

Redshift is a great OLAP database environment, though its limitations may surprise you. Many of its benefits are derived from its simplicity, belied by these limitations. Provided these techniques and design considerations can be applied, Redshift offers the flexibility needed for growing, evolving organizations to leverage their data at massive scale. Overcoming these limitations requires a tooling ecosystem surrounding Redshift, whereas Redshift must be respected for the simple, efficient, and yet robustly queriable datastore that it is.