Skip to content

A crazy fast bit-counting technique

Let's say that you've got a lot of numbers that represent bitmasks of some kind, and you want to count how many times each bit is on or off across the entire set.  Maybe you're analyzing game positions represented as bitboards for an AI, or trying to find certain types of weaknesses in random-number generators, like in Forge (a successor to crypto-js) or Cryptocat (at archive.org) (read the great write-up at Sophos).

So, you write some very straight-forward code to count the bits.  It grabs one bitmask.  If the lowest-order bit is set, it increments the counter for that bit position.  Then, it right-shifts the bitmask and moves to the counter for the next bit.  Repeat that for each bit in the mask, then repeat that for each bitmask:

const int N = 1000000;
unsigned long x[N]; // Assuming sizeof(unsigned long) == 8, or 64 bits.
int counts[64] = {0};

void count_simple(void) {
    for(int i = 0; i < N; i++) {
        int j = 0;
        while(x[i] != 0) {
            counts[j] += x[i] & 1;
            x[i] >>= 1;
            ++j;
        }
    }
}

You run your program, and it works correctly, but it's too slow.  I'll show you how to speed this up.  The technique, which applies to languages like Python or Javascript as well as to C, is both crazy, and crazy-fast!

Continue reading "A crazy fast bit-counting technique"

Lessons from application development with databases

I recently finished up the main development effort of a new project. It incorporates a lot of database techniques I'd used in previous projects, and several that I hadn't.

Description

The application is written in Django, but uses SQLAlchemy to the nearly complete exclusion of Django's own ORM. The only thing that touches Django's ORM is the session module.

The view-handling functions perform first-level security checks, input scrubbing and coercion from strings to other types, and reporting errors to the user. These functions then call into logic functions, which perform most of the interaction with the database. Database interactions use SQLAlchemy as basically a wrapper over the DBAPI, using strings for queries and passing parameters positionally.

The database is PostgreSQL (which has an excellent set of features). Database views and database functions do all the heavy lifting; the logic functions themselves mostly use their own parameters to filter or sort view results (at the database), do a simple join or two, or perform data updates. The views do a lot of aggregation, including statistics, modeling, and forecasts; much of the aggregation uses window functions.

A migration system handles all schema updates, which include: data structure (CREATE/ALTER/DROP TABLE), data (UPDATE), views (CREATE/DROP VIEW), functions (CREATE/DROP FUNCTION), and grants (GRANT/REVOKE).

Migrations are managed by a small set of simple bash scripts called pg-migrate. Migrations have a single total order, with individual migrations identified by contiguous positive ordinals. A migration script is a single SQL file, named with its ordinal, a hyphen, and some brief descriptive text allowed in filenames, and a .sql extension. The bash scripts automatically wrap individual migrations in transaction blocks during execution.

What went well

Using database views is a huge win for abstraction.

Views are the first level of named abstraction for queries, like functions are for code in general-purpose languages. By giving names to parts of complicated queries, they become much easier to understand.

It's very easy to play with queries at psql's REPL until they're right, then freeze them into views. If you decide that a query's grouping, partitioning, or aggregating must be changed, it's often easier to modify a query than it would be to modify the equivalent nested and sequential loops in non-declarative code.

With views, program logic becomes very simple—just selecting from a view, maybe filtering and ordering, but very little post-processing.

Views don't lose optimization. The query planner essentially expands all views used in a query to derive one base-level query, then applies its optimization heuristics over that. Thus, a database can optimize across named-view boundaries. Typically, a procedural language cannot optimize across calls between different compilation units, and an object-oriented language cannot even optimize across any dynamic dispatch calls (i.e. invoking a method on an object). What might require looping over the data dozens of times in a typically-factored object-oriented solution, when translated into equivalent database views, could very well be accomplished in a single pass over the data.

Migrations were also a big success. Deploys were quick—I could do several a day. Many deploys happened with zero downtime, without using any kind of database or application fail-over.

Migrations were designed to stop at useful points if a human needed to update data before continuing.

What went poorly

I put site-specific data into some early migrations. Major mistake! I had to fix-up early migrations in order to cleanly build databases for new sites.

Modifying existing views is sometimes difficult. Changing the output columns or column-types of a view means you must DROP it first, then CREATE it over again. Otherwise, you can get away with CREATE OR REPLACE. In either case, you must pull up the definition from the last migration (the view's definition in a database dump is annotated to show type information and scope-resolution, but this is too clumsy to edit as a primary source).

If a view must be DROP'd for an update, every view that references it must also be DROP'd then CREATE'd again. Because of this, a migration which changes a single view can get very large by pulling in every dependent view.

In this project, editing views was different from editing the program. The source files for a program are a snapshot of the entire program—to change the program, change the snapshot, then reload the program. Editing the views, on the other hand, required applying changes through a diff (the migration). The snapshot could not be edited in place. Diffs naturally have a different language than snapshots, as they express how to modify something that already exists while maintaining integrity, whereas snapshots express the structure that will exist, ex nihilo.

I also lacked any sort of rigor in defining views near the beginning of the project. Some views unnecessarily included all of the underlying tables' fields. When the structure of those tables changed, the views had to change as well.

Queries in Python are painful. They're all strings, so they don't wrap like code, and can't bind directly with variables.

Paths to improvement

A migration should be named with only a number. Comments at the beginning of the file can provide explanatory text. The migration scripts would be augmented to parse these leading comments and display them to the user, when requested.

Split schema definition into two parts, tentatively called structure and world. Structure would be expressed in migration files, and include operations to alter domains, tables, table constraints, or to update existing data to accommodate changing structure or constraints. It would also include any data that is not site-specific (site-specific data being banned from migrations). World would be a snapshot description of everything else—views, functions, etc.

To perform a migration, the migration scripts would drop everything in world, apply any pending migration scripts, then reload everything in world. Consideration would have to be given to reloading world in the event that a migration fails. Dropping world would also be disadvantageous to migrations that would benefit from defined views in order to update existing data.

The system should allow world to be split into multiple files, for some modicum of modularity. The system would have an option to discover a working load-order and save that order to a file, so that the modules may be reloaded quickly and without trial-and-error, for production migrations.

Views should be classified by type. I have discovered three basic types:

  1. An extension view, which extends a table with some extra, easy-to-calculate or easy-to-join fields. Such views seldom have aggregations, and joins are usually one-to-one. Extension views should include every field of the base table. Code will usually query an extension view instead of its underlying table, to get the extension fields in addition to the base fields. Extension views are named after the underlying table, with _ext appended.
  2. A processing view. These views may perform significant aggregation and joining over several tables, to provide some well-defined, narrowly-scoped, output. Processing views include only the applicable primary keys (or candidate keys) from the underlying tables, plus their own contributions. Processing views are named according to their purpose.
  3. A summary view. Summary views are based on a single table or single extension view, but join information from related processing views. They include most or all fields from all data sources. Summary views are named with an appended _summary.

This scheme maximally decouples processing views, being the most complex, from changes in underlying tables. Extension and summary views have simple definitions, often using * in their select lists, and so are not so annoying to update to match participating tables.

I would also consider using SQLAlchemy's table-definition features. However, it can't be the definitive source from which the database is built (because you can't express migrations there; it's a snapshot language, not a diff language). This would allow better code-formatting and comprehension by IDEs of table and field names and such. This would violate the DRY principle (don't repeat yourself), though.

Perceived shortcomings of style checkers (and a potential solution)

Style-checkers check code for certain statically-determinable properties. These properties may affect dynamic behavior (e.g. all references are definitely initialized before use), or they may be purely textual (e.g. code is indented in a particular way).

Teams use style-checkers to ensure some typical dynamic and textual standard in the code base. The idea is that common errors can be prevented, and that a common textual style will aid readability, especially between code from different authors.1

In reality, style-checking is far from ideal, especially regarding textual consistency. Some section of code may have notably worse readability if wrapped to the project’s standard of 80 columns. Some object allocation may not have a matching deällocation because it is live for the program’s entire run.

No matter the reason, exceptions to the rules always come up, and this is where style-checking becomes really difficult: how do you handle the exceptions? If you do nothing, exceptions will accumulate until the style-checker’s report is overwhelming. At best, the report will be ignored; at worst, it will become demoralizing to developers. In such a situation, style-checking may be worse than no style-checking at all.

To keep the report positively motivating, you must make a change somewhere:

  • You could alter the code to make it compliant with the style rules. This may directly compromise either the design integrity or actual readability of the code.
  • You could alter the style rules so that the case is no longer exceptional. But the new rule will probably be more complicated, have more edge cases, and still have other (and new) exceptions. Besides, which is better: writing rules, or writing code?
  • You could mark the exception in the code with some kind of meta-comment, so that the style-checker ignores it on future runs; but, this adds clutter to the code and runs counter to the goal of readability.
  • You could mark the exception somewhere else. This keeps the code clean, and the developers in control, but this database would have to be intelligent enough to keep up with code motion as the project develops. The database should also be version-controlled with the source that it annotates.

This last approach would yield a “perfect” solution: it would only ever mark newly-introduced problems in the code. A “conservative,” approximate solution (one that marks every new problem, and might mark a few old) is possible if we make the following assumptions:

  • Code is under version-control;
  • The style-checker is run before or as part of the process of committing code to the main branch; and,
  • Genuine problems, when reported, will actually be addressed.

Given these constraints, we can see that a style-checker could report on errors associated only with code features that are new or changed in the current version, relative to the latest main-line version. If this is the case, it would assume that any style exception in the main-line is there intentionally, and should not be reported on again (it would have been reported when it was previously committed)—it would not need to track exceptions in a separate database. I call this “differential style-checking.”

The code features that the style-checker tracks could be either lines of text, or nodes of parse-trees. The latter might be more accurate, but the former might be easier to implement and good enough in most cases.

Here is a proof-of-concept. It works on only one Python file at a time, and the output is messy, but it demonstrates the essential principles. pylint version 0.19.0, as packaged in Ubuntu 10.04, is the style-checker, with which I have no experience prior to writing this.

#!/bin/bash

 
usage() {
    echo $0 trunk_dir proposed_dir file
    exit 1
}


comment() {
    # $1 the file to comment
    # $2 the line number
    # $3 the comment to append
    head -n $(( $2 - 1 )) $1 >/tmp/tmp.py
    local target_line=$(head -n $2 $1 |tail -n1)
    echo "$target_line #{$3}" >>/tmp/tmp.py
    tail -n +$(( $2 + 1 )) $1 >>/tmp/tmp.py
    mv /tmp/tmp.py $1
}


process() {
    local type
    local line
    local msg
    local old_ifs="$IFS"
    IFS=":"
    while read type line msg
    do
        [ "$msg" ] && comment $1 $line "$type:$msg"
    done
    IFS="$old_ifs"
}


TRUNK=$1; shift
PROPOSED=$1; shift
FILE=$1; shift

[ -d "$TRUNK" -a -d "$PROPOSED" -a "$FILE" ] || usage
[ "$1" ] && usage

STRIPPED=${FILE%.py}
MODULE=${STRIPPED//\//.}

pushd $TRUNK >/dev/null
cp $FILE /tmp/left.py
pylint -rn $MODULE 2>/dev/null |process /tmp/left.py
popd >/dev/null

pushd $PROPOSED >/dev/null
cp $FILE /tmp/right.py
pylint -rn $MODULE 2>/dev/null |process /tmp/right.py
popd >/dev/null

diff -u /tmp/{left,right}.py |egrep '^[+@]'

For a sample run, I will use the Trac project, which I greatly respect. At revision 10114, the PostgreSQL backend received a change, which we will analyze.


cd /tmp
svn co -r10113 http://svn.edgewall.org/repos/trac/branches/0.12-stable 10113
svn co -r10114 http://svn.edgewall.org/repos/trac/branches/0.12-stable 10114
diffstyle 10113 10114 trac/db/postgres_backend.py

The (wrapped) output is:

+++ /tmp/right.py       2011-04-09 09:59:48.274189942 -0600
@@ -237,6 +237,11 @@
+    def update_sequence(self, cursor, table, column='id'): #{C:PostgreSQLConnection.update_sequence: Missing docstring} #{R:PostgreSQLConnection.update_sequence: Method could be a function}
+        cursor.execute("""
+            SELECT setval('%s_%s_seq', (SELECT MAX(id) FROM %s))
+            """ % (table, column, table))
+

We see that, according to pylint’s default checks, two new (potential) problems were introduced: the new method doesn’t have a docstring, and the new method could be a function outside of a class.2 Running pylint normally on postgres_backend.py revision 10114 produces 47 warnings. (The entire Trac packages yields 10065.)

pylint -rn trac.db.postgres_backend |egrep '^[A-Z]' |wc -l
pylint -rn trac |egrep '^[A-Z]' |wc -l
End notes
  1. I believe that style-checkers can effectively guard against some types of runtime errors, just as compilers can. I don’t believe that style-checkers can be an effective judge of textual style.
  2. In this case, the style-checker got it wrong. This must be a method, because it is invoked polymorphically; although, it need not be an instance method—it could be a class method.