Lessons from application development with databases

I recently finished up the main development effort of a new project. It incorporates a lot of database techniques I'd used in previous projects, and several that I hadn't.

Description

The application is written in Django, but uses SQLAlchemy to the nearly complete exclusion of Django's own ORM. The only thing that touches Django's ORM is the session module.

The view-handling functions perform first-level security checks, input scrubbing and coercion from strings to other types, and reporting errors to the user. These functions then call into logic functions, which perform most of the interaction with the database. Database interactions use SQLAlchemy as basically a wrapper over the DBAPI, using strings for queries and passing parameters positionally.

The database is PostgreSQL (which has an excellent set of features). Database views and database functions do all the heavy lifting; the logic functions themselves mostly use their own parameters to filter or sort view results (at the database), do a simple join or two, or perform data updates. The views do a lot of aggregation, including statistics, modeling, and forecasts; much of the aggregation uses window functions.

A migration system handles all schema updates, which include: data structure (CREATE/ALTER/DROP TABLE), data (UPDATE), views (CREATE/DROP VIEW), functions (CREATE/DROP FUNCTION), and grants (GRANT/REVOKE).

Migrations are managed by a small set of simple bash scripts called pg-migrate. Migrations have a single total order, with individual migrations identified by contiguous positive ordinals. A migration script is a single SQL file, named with its ordinal, a hyphen, and some brief descriptive text allowed in filenames, and a .sql extension. The bash scripts automatically wrap individual migrations in transaction blocks during execution.

What went well

Using database views is a huge win for abstraction.

Views are the first level of named abstraction for queries, like functions are for code in general-purpose languages. By giving names to parts of complicated queries, they become much easier to understand.

It's very easy to play with queries at psql's REPL until they're right, then freeze them into views. If you decide that a query's grouping, partitioning, or aggregating must be changed, it's often easier to modify a query than it would be to modify the equivalent nested and sequential loops in non-declarative code.

With views, program logic becomes very simple—just selecting from a view, maybe filtering and ordering, but very little post-processing.

Views don't lose optimization. The query planner essentially expands all views used in a query to derive one base-level query, then applies its optimization heuristics over that. Thus, a database can optimize across named-view boundaries. Typically, a procedural language cannot optimize across calls between different compilation units, and an object-oriented language cannot even optimize across any dynamic dispatch calls (i.e. invoking a method on an object). What might require looping over the data dozens of times in a typically-factored object-oriented solution, when translated into equivalent database views, could very well be accomplished in a single pass over the data.

Migrations were also a big success. Deploys were quick—I could do several a day. Many deploys happened with zero downtime, without using any kind of database or application fail-over.

Migrations were designed to stop at useful points if a human needed to update data before continuing.

What went poorly

I put site-specific data into some early migrations. Major mistake! I had to fix-up early migrations in order to cleanly build databases for new sites.

Modifying existing views is sometimes difficult. Changing the output columns or column-types of a view means you must DROP it first, then CREATE it over again. Otherwise, you can get away with CREATE OR REPLACE. In either case, you must pull up the definition from the last migration (the view's definition in a database dump is annotated to show type information and scope-resolution, but this is too clumsy to edit as a primary source).

If a view must be DROP'd for an update, every view that references it must also be DROP'd then CREATE'd again. Because of this, a migration which changes a single view can get very large by pulling in every dependent view.

In this project, editing views was different from editing the program. The source files for a program are a snapshot of the entire program—to change the program, change the snapshot, then reload the program. Editing the views, on the other hand, required applying changes through a diff (the migration). The snapshot could not be edited in place. Diffs naturally have a different language than snapshots, as they express how to modify something that already exists while maintaining integrity, whereas snapshots express the structure that will exist, ex nihilo.

I also lacked any sort of rigor in defining views near the beginning of the project. Some views unnecessarily included all of the underlying tables' fields. When the structure of those tables changed, the views had to change as well.

Queries in Python are painful. They're all strings, so they don't wrap like code, and can't bind directly with variables.

Paths to improvement

A migration should be named with only a number. Comments at the beginning of the file can provide explanatory text. The migration scripts would be augmented to parse these leading comments and display them to the user, when requested.

Split schema definition into two parts, tentatively called structure and world. Structure would be expressed in migration files, and include operations to alter domains, tables, table constraints, or to update existing data to accommodate changing structure or constraints. It would also include any data that is not site-specific (site-specific data being banned from migrations). World would be a snapshot description of everything else—views, functions, etc.

To perform a migration, the migration scripts would drop everything in world, apply any pending migration scripts, then reload everything in world. Consideration would have to be given to reloading world in the event that a migration fails. Dropping world would also be disadvantageous to migrations that would benefit from defined views in order to update existing data.

The system should allow world to be split into multiple files, for some modicum of modularity. The system would have an option to discover a working load-order and save that order to a file, so that the modules may be reloaded quickly and without trial-and-error, for production migrations.

Views should be classified by type. I have discovered three basic types:

1. An extension view, which extends a table with some extra, easy-to-calculate or easy-to-join fields. Such views seldom have aggregations, and joins are usually one-to-one. Extension views should include every field of the base table. Code will usually query an extension view instead of its underlying table, to get the extension fields in addition to the base fields. Extension views are named after the underlying table, with _ext appended.
2. A processing view. These views may perform significant aggregation and joining over several tables, to provide some well-defined, narrowly-scoped, output. Processing views include only the applicable primary keys (or candidate keys) from the underlying tables, plus their own contributions. Processing views are named according to their purpose.
3. A summary view. Summary views are based on a single table or single extension view, but join information from related processing views. They include most or all fields from all data sources. Summary views are named with an appended _summary.

This scheme maximally decouples processing views, being the most complex, from changes in underlying tables. Extension and summary views have simple definitions, often using * in their select lists, and so are not so annoying to update to match participating tables.

I would also consider using SQLAlchemy's table-definition features. However, it can't be the definitive source from which the database is built (because you can't express migrations there; it's a snapshot language, not a diff language). This would allow better code-formatting and comprehension by IDEs of table and field names and such. This would violate the DRY principle (don't repeat yourself), though.