Skip to content

An abstraction gone wrong

A given abstraction can be helpful in one situation and harmful in another. Often, we use abstractions out of habit without thinking critically about their benefit. Sometimes, an abstraction is harmful because it distances us from other features of the language, as I wrote in Abstractions.

Here, I give an example of such an abstraction, which happens to be very common in the domain of parsing, but which comes up in many other places. I also begin refactoring the code to use a more helpful abstraction.

The example

Let's tokenize a small language that consists of numbers and quoted strings. The strings may use "\" to escape characters. Here's an example, with an illegal "x" on the end:

123 "This is 1 string." 456 "A 2nd \"string\" with \\-escapes." x

The output we want looks something like this:

INT(123)
STRING(This is 1 string.)
INT(456)
STRING(A 2nd "string" with \-escapes.)
Unexpected: character 'x'

Suppose we wanted to use a finite state-machine. A very traditional approach might look something like this:

import java.io.FileReader;

class tok {

    class Unexpected extends Exception {
        Unexpected(char ch, int state) {
            super("Unexpected: character '" + ch + "' in state " + state);
        }
    }

    enum ChType {
        DIGIT, QUOTE, SPACE, BSLASH, OTHER
    }

    void tokenize() throws Exception {
        int state = 0;
        String image = "";
        FileReader reader = new FileReader("demo.txt");

        while (true) {
            int cp = reader.read();
            if (cp == -1)
                break;
            char ch = (char) cp;

            ChType type;
            if (Character.isDigit(ch)) {
                type = ChType.DIGIT;
            } else if (ch == '"') {
                type = ChType.QUOTE;
            } else if (ch == ' ' || ch == '\n') {
                type = ChType.SPACE;
            } else if (ch == '\\') {
                type = ChType.BSLASH;
            } else {
                type = ChType.OTHER;
            }

            switch (state) {
            case 0:
                switch (type) {
                case DIGIT:
                    image += ch;
                    state = 1;
                    continue;
                case QUOTE:
                    state = 2;
                    continue;
                case SPACE:
                    continue;
                }
                break;

            case 1:
                switch (type) {
                case DIGIT:
                    image += ch;
                    state = 1;
                    continue;
                case SPACE:
                    System.out.println("INT(" + image + ")");
                    image = "";
                    state = 0;
                    continue;
                }
                break;

            case 2:
                switch (type) {
                case QUOTE:
                    System.out.println("STRING(" + image + ")");
                    image = "";
                    state = 3;
                    continue;
                case BSLASH:
                    state = 4;
                    continue;
                default:
                    image += ch;
                    continue;
                }

            case 3:
                switch (type) {
                case SPACE:
                    state = 0;
                    continue;
                }
                break;

            case 4:
                image += ch;
                state = 2;
                continue;
            }

            throw new Unexpected(ch, state);
        }
    }

    public static void main(String[] args) throws Exception {
        new tok().tokenize();
    }
}

Some might not call this an abstract state machine because the transition rules aren't parameterized—they're right in the code. However, the state is represented by a plain integer the same as in a fully abstact state machine, and the transitions are implemented in a way that gives us very little language support; the language doesn't understand which states or transitions are legal. One small typo could introduce a bug in the machine, and the language wouldn't know anything about it.

I'll call this style semi-abstract.

We're going to refactor this code by using better abstractions from the language, so that the language can give us better support. In a weird twist, the state-machine will become more concrete as we introduce better abstractions, because the language will understand which states and transitions are legal and limit our code to only those.

The first refactor

The first step to is to use symbols instead of integers to represent states. This helps prevent some forms of typo.

Let's define symbols to alias the current integers:

int INITIAL = 0, IN_NUMBER = 1, IN_STRING = 2, AFTER_STRING = 3, ESCAPING = 4;

We'll substitute those symbols everywhere, and also switch to if-statements:

if (state == INITIAL) {
    switch (type) {
    case DIGIT:
        image += ch;
        state = IN_NUMBER;
        continue;
    case QUOTE:
        state = IN_STRING;
        continue;
    case SPACE:
        continue;
    }
} else if (state == IN_NUMBER) {
    // etc...

This is only a partial step, though. The language still permits us to use literal integers for states, so we could miss an instance.

To complete the step, we'll introduce an empty class, and use instances of that instead of integers:

class State {
}

State INITIAL = new State(), // etc...

void tokenize() throws Exception {
    State state = INITIAL;
    // etc...

We'll also update the constructor for the Unexpected class.

It's now not possible to typo a state name (unless we happen upon the name of another defined state).

In many languages, we could have used an enum for this as well. However, using instances of classes will allow us to do more interesting refactorings later on. I'll write about those in future posts.

The code so far

import java.io.FileReader;

class tok {

    class Unexpected extends Exception {
        Unexpected(char ch, State state) {
            super("Unexpected: character '" + ch + "' in state " + state);
        }
    }

    enum ChType {
        DIGIT, QUOTE, SPACE, BSLASH, OTHER
    }

    class State {
    }

    State INITIAL = new State(),
        IN_NUMBER = new State(),
        IN_STRING = new State(),
        AFTER_STRING = new State(),
        ESCAPING = new State();

    void tokenize() throws Exception {
        State state = INITIAL;
        String image = "";
        FileReader reader = new FileReader("demo.txt");

        while (true) {
            int cp = reader.read();
            if (cp == -1)
                break;
            char ch = (char) cp;

            ChType type;
            if (Character.isDigit(ch)) {
                type = ChType.DIGIT;
            } else if (ch == '"') {
                type = ChType.QUOTE;
            } else if (ch == ' ' || ch == '\n') {
                type = ChType.SPACE;
            } else if (ch == '\\') {
                type = ChType.BSLASH;
            } else {
                type = ChType.OTHER;
            }

            if (state == INITIAL) {
                switch (type) {
                case DIGIT:
                    image += ch;
                    state = IN_NUMBER;
                    continue;
                case QUOTE:
                    state = IN_STRING;
                    continue;
                case SPACE:
                    continue;
                }
            } else if (state == IN_NUMBER) {
                switch (type) {
                case DIGIT:
                    image += ch;
                    state = IN_NUMBER;
                    continue;
                case SPACE:
                    System.out.println("INT(" + image + ")");
                    image = "";
                    state = INITIAL;
                    continue;
                }
            } else if (state == IN_STRING) {
                switch (type) {
                case QUOTE:
                    System.out.println("STRING(" + image + ")");
                    image = "";
                    state = AFTER_STRING;
                    continue;
                case BSLASH:
                    state = ESCAPING;
                    continue;
                default:
                    image += ch;
                    continue;
                }
            } else if (state == AFTER_STRING) {
                switch (type) {
                case SPACE:
                    state = INITIAL;
                    continue;
                }
            } else if (state == ESCAPING) {
                image += ch;
                state = IN_STRING;
                continue;
            }

            throw new Unexpected(ch, state);
        }
    }

    public static void main(String[] args) throws Exception {
        new tok().tokenize();
    }
}

Trackbacks

No Trackbacks

Comments

Display comments as Linear | Threaded

No comments

Add Comment

E-Mail addresses will not be displayed and will only be used for E-Mail notifications.
To leave a comment you must approve it via e-mail, which will be sent to your address after submission.
Enclosing asterisks marks text as bold (*word*), underscore are made via _word_.
Form options