- Evolution through Continuous Deployment
- What Redshift Isn’t
- The DIY Data Warehouse
- Don’t Flake Out
- The Brittleness of JSON
- Avoiding Lock-in
- Realized Benefits
- Conclusions
Evolution through Continuous Deployment
As with any good improvement model, some goals, guidelines, and a basic framework for success must be established. In the case of business intelligence projects like this, the purpose is ultimately to support accellerating organizational change with better insights. Delivering valueable analysis quickly becomes the goal, in an environment with an ever growing set of data sources to integrate.
The guidelines for this work align to the goals of DevOps:
- Automate everything, driven by version-managed configuration, especially routine maintenance
- Minimize complexity of work to make small evolutionary changes to configuration, schemata and data models
- Reduce lead-time for data model changes and new data source integration
- Provide sensible utilities to accelerate (preferably automate) data discovery
- Provide robust logging of all deployments, data modeling workloads (e.g. ETL),
- Provide snapshots of both source and modeled data as build artifacts
- Provide assurances à la integration testing for all schema definition (DDL) and data modeling processes
What Redshift Isn’t
Truly the hardest lesson, in adopting Amazon Redshift, is to accept its strengths and limitations.
Redshift is a columnar datastore, with relational query capabilities, but it is not a relational database in the traditional sense. There are no foreign key constraints; foreign keys, when specified, are used only for query planning and optimization.
While Redshift behaves more-or-less compatibly with PostgreSQL in many ways, it lacks many of the modern capabilities that PostgreSQL provides. Notable among these differences is Redshift’s lack of support for materialized views (detailed below), very primitive support for JSON, and basically terrible performance on anything that isn’t purely tabular.
In many respects, it’s appropriate to view Redshift as a SQL query layer built over distributed storage, much like other map-reduce analytics solutions. There are distribution and sort keys, which make perfect sense for partitioning and sequencing à la map-reduce. It performs profoundly well on large datasets, where aggregation can be aligned to the distribution key structure (i.e. siloed within a segment of the data, to minimize network cost).
The DIY Data Warehouse
Much of the missing functionality which my clients find essential focuses around automation of ETL, data modeling, and reporting workflows. Where Redshift lacks features, the rest of the Amazon platform provides many resources to implement that functionality according to your business case.
This is where the DevOps paradigms become essential: infrastructure as code, microservices, continuous integration and continuous deployment are the foundation of successful automation, in this category. In general the implementations follow this model:
- Everything is managed in Git including schema definition, all configuration, and process logic (as code)
- A CI environment, such as GitLab CI is extended to fully automate deployment of all resources, including infrastructure, data schema, and static assets
- Regularly scheduled workloads integrate with CI to configure applications based on currently deployed schema state, etc.
- When a configuration, service or data schema changes in Git, within minutes it’s tested against sample data, and deployed to production.
A few important considerations:
- Testing must always rely on data sampled from production data sources. This can be automated (relatively) simply with nightly
UNLOAD
andLOAD
queries targeting S3. - Some assurances, which we’d expect through unit- and integration-tesing in other software environs, can be supported by relying on the database engine to handle transactions. To safely test these, automation should spin up new Redshift instances for an isolated, disposable testing environment, using Amazon’s APIs.
- The CI environment will therefore need access to the AWS API libraries and suitable credentials.
Don’t Flake Out
Because Redshift is columnar, it’s also well-suited to a relatively high degree of denormalized data, rather than fully snow-flaked structures, as the columnar storage lends itself to compression. Redshift is also best suited to batch-processing large datasets, as frequent loading of small changes tends to be inefficient.
With these in mind, I’ve found Redshift to perform better when modeling time-series data, with slowly-changing dimensionality, under something resembling a Pure Type-6 fact table. In general, process follows this model:
CREATE SCHEMA IF NOT EXISTS temp;
CREATE TABLE temp.ts_target$temp
INTERLEAVED SORTKEY (timestamp, id)
AS (
SELECT
SYSDATE as timestamp,
id,
other_fields_etc
FROM
source_table NATURAL JOIN other_source
);
CREATE TABLE IF NOT EXISTS ts_target (LIKE temp.ts_target$temp);
ALTER TABLE ts_target APPEND FROM temp.ts_target$temp;
DROP TABLE IF EXISTS temp.ts_target$temp;
Historically this was implemented using Redshift’s support for SELECT INTO
queries, but Amazon’s relatively recent addition of ALTER TABLE APPEND
shows significant performance improvements.
The resulting materialized views include some level of denormalized records. In effect, Redshift’s columnar storage relies on decompression to provide the (effective) joining of dimension values to each record, rather than an explicit reference key structure.
Basically the same approach applies to materializing snapshots of views, without respect to any time-series model. The primary difference would be either truncating or replacing the target table.
The Brittleness of JSON
Redshift has some JSON support. Yep.
In some cases, where organizations treat Redshift as a data lake, they may want to store raw stream data as JSON directly in Redshift, with the intent of making it easily queriable in the future. Two major limitations of Redshift pose significant challenges to this approach.
In my observation, it’s almost always preferable to avoid putting JSON in Redshift. While it’s possible to mitigate some of the below issues, doing so justifies some sort of up-front validation process on a JSON stream before it reaches Redshift.
Simply replacing that validation approach with a preliminary extraction process, before load, offers greater advantage.
JSON Correctness
Redshift requires that all of the records containing JSON must be 100% valid UTF-8 encoded JSON. If any record deviates, it will block the entire query from returning any results. When one record in millions has an error, it can prevent access to all of them.
In large-scale, evolving software environments, errors will occasionally occur, where one of the many systems sending JSON data somehow sends an incomplete or malformed record.
Two approaches seem sensible to mitigate for this problem within Redshift, however neither offer a long-term holistic advantage:
- Perform some kind of validation in Redshift, either in SQL (i.e. a validation view, even materialized), or as a Python user-defined function.
- Build other assurance into your data pipeline, such that invalid JSON never reaches Redshift.
There are certainly other approaches. In either case above, the performance cost at scale has been simply untenable, even if you can assure 100% valid JSON.
JSON Performance
For the description below, please consider a view such as this:
CREATE VIEW my_json_extractor AS (
SELECT
id,
jsontext,
json_extract_path_text(jsontext, 'triangle', 'base') as t_base
json_extract_path_text(jsontext, 'triangle', 'height') as t_height,
json_extract_path_text(jsontext, 'area') as area
FROM
my_base_table_with_json_field
);
In Redshift, JSON is stored as pure text, and never treated as an “object” until one of Redshift’s functions attempts to read from it. Further, from what I observe, it seems when a query processes the JSON text, it does so separately for each function call.
In the example above, it would create 3 separate object instances for each record. For 100,000 records with 3 properties extracted from one JSON field, it would create 300,000 separate JSON objects.
This becomes inefficient quickly.
The problem is compounded by the evolving nature of software and the pace of modern development: naming conventions change, default values change, etc. This often justifies querying additional fields to infer the correct state, over data aggregated from years of operation from multiple, independently developed but otherwise related systems. The effect becomes a multiplicative performance overhead.
Don’t use JSON
In the end, our conclusion is to prefer avoiding JSON in Redshift altogether. A better approach, when dealing with JSON types, is to model them into tabular form before it ever reaches Redshift. I’ve seen great results from combining DynamoDB, streaming through a Lambda job, and then stored elsewhere for Redshift to pick up. S3 offers a good medium to retain longer-term snapshots, with fairly simple LOAD
and UNLOAD
options directly into Redshift.
Avoiding Lock-in
Another painpoint of working with Redshift, especially when supporting integration from third-party data services, is column locking by views.
When you query a view
in Redshift, it does not perform the text of the SQL query “on the fly” every time. Instead, when you create the view, it “binds” to the columns (by node, not by name) as a way to improve performance in the query planning process.
This has some benefits to the database operation, but its side-effects on human workflow (even when automated) become troublesome. In particular:
- After renaming a column that a view references, the view will behave as though updated to use the same name. (That’s sort of OK.)
- A linked column cannot be removed from its table (or replaced) without first dropping the view that references it.
Because columns’ types (including the number of characters allowed in a text field) are effectively locked-in and cannot be changed directly, the basic approach to structural adjustments requires:
- Creating a new column,
- Copying in the existing data,
- Dropping the old column, and
- Renaming the new column to replace the old.
The net effect is that even the smallest of structural changes requires completely dropping related views, applying the desired changes, and then re-creating the views.
This might be fine for some workflows, but when integrating data from multiple third-party vendors, across schemata they manage (and therefore occasionally change), it becomes a strong detraction from using views at all in those cases.
Getting Material
There’s an alternate approach that resolves this limitation, though it may seem at first to be unintuitive. The automation cited above (in The DIY Data Warehouse) facilitates a better approach.
When dealing with schemata that are likely to change, in this manner, views should be treated as ephemeral, not persisting against the underlying tables. The automation layer, performing scheduled jobs, should basically:
- Create (temporarily) all of the views required to model your data.
- Populate materialized tables from the integrated views.
- Remove the temporary views.
- Manipulate the materialized tables accordingly (
APPEND
for time-series, etc.)
A few more steps and considerations may be appropriate, as views are often a suitable abstraction layer to maintain for exposing data to business users. To provide that, the central point must be to isolate their dependencies on tables (materialized) whose schemata are managed by the BI development team(s).
Realized Benefits
The general principals in Evolution through Continuous Deployment, and the techniques outlined to address these specific challenges have proven their worth many times.
- When new data streams begin loading, we have workloads and models (i.e. views) built against them, tuned for initial roll-out to business users, within 30 minutes.
- Maintenance is mostly automatic, operating scheduled jobs with logging and notifications directly in the Continuous Integration environment.
- The high degree of automation accelerates workflow, allowing us to deliver real business value much faster. From the time when work (e.g. schema adjustment) is done, committed into the source-code repository, it usually reaches the production environment in less than 10 minutes.
- The “plumbing” enables us to innovate, instead of consuming focus and costing us time.
Conclusions
Redshift is a great OLAP database environment, though its limitations may surprise you. Many of its benefits are derived from its simplicity, belied by these limitations. Provided these techniques and design considerations can be applied, Redshift offers the flexibility needed for growing, evolving organizations to leverage their data at massive scale. Overcoming these limitations requires a tooling ecosystem surrounding Redshift, whereas Redshift must be respected for the simple, efficient, and yet robustly queriable datastore that it is.