Concept and Design
Thinc's conceptual model and how it works

Thinc is built on a fairly simple conceptual model that’s a little bit different from other neural network libraries. On this page, we build up the library from first principles, so you can see how everything fits together. This page assumes some conceptual familiarity with backpropagation, but you should be able to follow along even if you’re hazy on some of the details.

The model composition problem

The central problem for a neural network implementation is this: during the forward pass, you compute results that will later be useful during the backward pass. How do you keep track of this arbitrary state, while making sure that layers can be cleanly composed?

Instead of starting with the problem directly, let’s start with a simple and obvious approach, so that we can run into the problem more naturally. The most obvious idea is that we have some thing called a model, and this thing holds some parameters (“weights”) and has a method to predict from some inputs to some outputs using the current weights. So far so good. But we also need a way to update the weights. The most obvious API for this is to add an update method, which will take a batch of inputs and a batch of correct labels, and compute the weight update.

class UncomposableModel:
    def __init__(self, W):
        self.W = W

    def predict(self, inputs):
        return inputs @ self.W.T

    def update(self, inputs, targets, learn_rate=0.001):
        guesses = self.predict(inputs)
        d_guesses = (guesses-targets) / targets.shape[0]  # gradient of loss w.r.t. output
        # The @ is newish Python syntax for matrix multiplication
        d_inputs = d_guesses @ self.W
        dW = d_guesses.T @ inputs  # gradient of parameters
        self.W -= learn_rate * dW  # update weights
        return d_inputs

This API design works in itself, but the update() method only works as the outer-level API. You wouldn’t be able to put another layer with the same API after this one and backpropagate through both of them. Let’s look at the steps for backpropagating through two matrix multiplications:

def backprop_two_layers(W1, W2, inputs, targets):
    hiddens = inputs @ W1.T
    guesses = hiddens @ W2.T
    d_guesses = (guesses-targets) / targets.shape[0]  # gradient of loss w.r.t. output
    dW2 = d_guesses @ hiddens.T
    d_hiddens = d_guesses @ W2
    dW1 = d_hiddens @ inputs.T
    d_inputs = d_hiddens @ W1
    return dW1, dW2, d_inputs

In order to update the first layer, we need to know the gradient with respect to its output. We can’t calculate that value until we’ve finished the full forward pass, calculated the gradient of the loss, and then backpropagated through the second layer. This is why the UncomposableModel is uncomposable: the update method expects the input and the target to both be available. That only works for the outermost API – the same API can’t work for intermediate layers.

Although nobody thinks of it this way, reverse-model auto-differentiation (as supported by PyTorch, Tensorflow, etc) can be seen as a solution to this API problem. The solution is to base the API around the predict method, which doesn’t have the same composition problem: there’s no problem with writing model3.predict(model2.predict(model1.predict(X))), or model3.predict(model2.predict(X) + model1.predict(X)), etc. We can easily build a larger model from smaller functions when we’re programming the forward computations, and so that’s exactly the API that reverse-mode autodifferentiation was invented to offer.

The key idea behind Thinc is that it’s possible to just fix the API problem directly, so that models can be composed cleanly both forwards and backwards. This results in an interestingly different developer experience: the code is far more explicit and there are very few details of the framework to consider. There’s potentially more flexibility, but potentially lost performance and sometimes more opportunities to make mistakes.

We don’t want to suggest that Thinc’s approach is uniformly better than a high-performance computational graph engine such as PyTorch or Tensorflow. It isn’t. The trick is to use them together: you can use PyTorch, Tensorflow or some other library to do almost all of the actual computation, while doing almost all of your programming with a much more transparent, flexible and simpler system. Here’s how it works.

No (explicit) computational graph – just higher order functions

The API design problem we’re facing here is actually pretty basic. We’re trying to compute two values, but before we can compute the second one, we need to pass control back to the caller, so they can use the first value to give us an extra input. The general solution to this type of problem is a callback, and in fact a callback is exactly what we need here.

Specifically, we need to make sure our model functions return a result, and then a callback that takes a gradient of outputs, and computes the corresponding gradient of inputs.

def forward(X: InT) -> Tuple[OutT, Callable[[OutT], InT]]:
    Y: OutT = _do_whatever_computation(X)

    def backward(dY: OutT) -> InT:
        dX: InputType = _do_whatever_backprop(dY, X)
        return dX

    return Y, backward

To make this less abstract, here are two layers following this signature. For now, we’ll stick to layers that don’t introduce any trainable weights, to keep things simple.

reduce_sum layerdef reduce_sum(X: Floats3d) -> Tuple[Floats2d, Callable[[Floats2d], Floats3d]]:
    Y = X.sum(axis=1)
    X_shape = X.shape

    def backprop_reduce_sum(dY: Floats2d) -> Floats3d:
        dX = zeros(X_shape)
        dX += dY.reshape((dY.shape[0], 1, dY.shape[1]))
        return dX

    return Y, backprop_reduce_sum

Relu layerdef relu(inputs: Floats2d) -> Tuple[Floats2d, Callable[[Floats2d], Floats2d]]:
    mask = inputs >= 0
    def backprop_relu(d_outputs: Floats2d) -> Floats2d:
        return d_outputs * mask
    return inputs * mask, backprop_relu

Notice that the reduce_sum layer’s output is a different shape from its input. The forward pass runs from input to output, while the backward pass runs from gradient-of-output to gradient-of-input. This means that we’ll always have two matching pairs: (input_to_forward, output_of_backprop) and (output_of_forward, input_of_backprop). These pairs must match in type. If our functions obey this invariant, we’ll be able to write combinator functions that can wire together layers in standard ways.

The most basic way we’ll want to combine layers is a feed-forward relationship. We call this combinator chain, after the chain rule:

Chain combinatordef chain(layer1, layer2):
    def forward_chain(X):
        Y, get_dX = layer1(X)
        Z, get_dY = layer2(Y)

        def backprop_chain(dZ):
            dY = get_dY(dZ)
            dX = get_dX(dY)
            return dX

        return Z, backprop_chain

    return forward_chain

We can use the chain combinator to build a function that runs our reduce_sum and relu layers in succession:

chained = chain(reduce_sum, relu)
X = uniform((2, 10, 6)) # (batch_size, sequence_length, width)
dZ = uniform((2, 6))    # (batch_size, width)
Z, get_dX = chained(X)
dX = get_dX(dZ)
assert dX.shape == X.shape

Our chain combinator works easily because our layers return callbacks. The callbacks ensure that there is no distinction in API between the outermost layer and a layer that’s part of a larger network. We can see this clearly by imagining the alternative, where the function expects the gradient with respect to the output along with its input:

Problem without callbacksdef reduce_sum_no_callback(X, dY):
    Y = X.sum(axis=1)
    X_shape = X.shape
    dX = zeros(X_shape)
    dX += dY.reshape((dY.shape[0], 1, dY.shape[1]))
    return Y, dX

def relu_no_callback(inputs, d_outputs):
    mask = inputs >= 0
    outputs = inputs * mask
    d_inputs = d_outputs * mask
    return outputs, d_inputs

def chain_no_callback(layer1, layer2):
    def chain_forward_no_callback(X, dZ):        # How do we call layer1? We can't, because its signature expects dY        # as part of its input – but we don't know dY yet! We can only        # compute dY once we have Y. That's why layers must return callbacks.        raise CannotBeImplementedError()

The reduce_sum and relu layers are easy to work with, because they don’t introduce any parameters. But networks that don’t have any parameters aren’t very useful. So how should we handle them? We can’t just say that parameters are just another type of input variable, because that’s not how we want to use the network. We want the parameters of a layer to be an internal detail – we don’t want to have to pass in the parameters on each input.

Parameters need to be handled differently from input variables, because we want to specify them at different times. We’d like to specify the parameters once when we create the function, and then have them be an internal detail that doesn’t affect the function’s signature. The most direct approach is to introduce another layer of closures, and make the parameters and their gradients arguments to the outer layer. The gradients can then be incremented during the backward pass:

def Linear(W, b, dW, db):
    def forward_linear(X):

        def backward_linear(dY):
            dW += dY.T @ X
            db += dY.sum(axis=0)
            return dY @ W

        return X @ W.T + b, backward_linear
    return forward_linear

n_batch = 128
n_in = 16
n_out = 32
W = uniform((n_out, n_in))
b = uniform((n_out,))
dW = zeros(W.shape)
db = zeros(b.shape)
X = uniform((n_batch, n_in))
Y_true = uniform((n_batch, n_out))

linear = Linear(W, b, dW, db)
Y_out, get_dX = linear(X)

# Now we could calculate a loss and backpropagate
dY = (Y_out - Y_true) / Y_true.shape[0]
dX = get_dX(dY)

# Now we could do an optimization step like
W -= 0.001 * dW
b -= 0.001 * db
dW.fill(0.0)
db.fill(0.0)

While the above approach would work, handling the parameters and their gradients explicitly will quickly get unmanageable. To make things easier, we need to introduce a Model class, so that we can keep track of the parameters, gradients, dimensions and other attributes that each layer might require.

The most obvious thing to do at this point would be to introduce one class per layer type, with the forward pass implemented as a method on the class. While this approach would work reasonably well, we’ve preferred a slightly different implementation, that relies on composition rather than inheritance. The implementation of the Linear layer provides a good example.

Instead of defining a subclass of thinc.model.Model, the layer provides a function Linear that constructs a Model instance, passing in the function forward in thinc.layers.linear:

def forward(model: Model, X: InputType, is_train: bool):

The function receives a model instance as its first argument, which provides you access to the dimensions, parameters, gradients, attributes and layers. The second argument is the input data, and the third argument is a boolean that lets layers run differently during training and prediction – an important requirement for layers like dropout and batch normalization.

As well as the forward function, the Model also lets you pass in a function init, allowing us to support shape inference.

Linearmodel = Model(
    "linear",
    forward,    init=init,    dims={"nO": nO, "nI": nI},
    params={"W": None, "b": None},
)

We want to be able to define complex networks concisely, passing in only genuine configuration — we shouldn’t have to pass in a lot of variables whose values are dictated by the rest of the network. The more redundant the configuration, the more ways the values we pass in can be invalid. In the example above, there are many different ways for the inputs to Linear to be invalid: the W and dW variables could be different shapes, the size of b could fail to match the first dimension of W, the second dimension of W could fail to match the second dimension of the input, etc. With inputs like these, there’s no way we can expect functions to validate their inputs reliably, leading to unpredictable logic errors that make the calling code difficult to debug.

In a network with two Linear layers, only one dimension is an actual hyperparameter. The input size to the first layer and the output size of the second layer are both determined by the shape of the data. The only choice to make is the number of “hidden units”, which will determine the output size of the first layer and the input size of the second layer. So we want to be able to write something like this:

model = chain(Linear(nO=n_hidden), Linear())

… and have the missing dimensions inferred later, based on the input and output data. In order to make this work, we need to specify initialization logic for each layer we define. For example, here’s the initialization logic for the Linear and chain layers:

Initialization logicfrom typing import Optional
from thinc.api import Model, glorot_uniform_init
from thinc.types import Floats2d
from thinc.util import get_width

def init(model: Model, X: Optional[Floats2d] = None, Y: Optional[Floats2d] = None) -> None:
    if X is not None:
        model.set_dim("nI", get_width(X))
    if Y is not None:
        model.set_dim("nO", get_width(Y))
    W = model.ops.alloc2f(model.get_dim("nO"), model.get_dim("nI"))
    b = model.ops.alloc1f(model.get_dim("nO"))
    glorot_uniform_init(model.ops, W.shape)
    model.set_param("W", W)
    model.set_param("b", b)

Concept and Design Thinc's conceptual model and how it works

The model composition problem

No (explicit) computational graph – just higher order functions

Concept and Design
Thinc's conceptual model and how it works