Types & Dataclasses
Type annotations, data structures and more

TypesCustom type annotations for input/output types available in Thinc.
DataclassesData structures for efficient processing, especially for sequence data.

Types

Floats1d, Floats2d, Floats3d, Floats4d, FloatsXd1d, 2d, 3d, 4d and any-d arrays of floats (DTypesFloat).
Ints1d, Ints2d, Ints3d, Ints4d, IntsXd1d, 2d, 3d, 4d and any-d arrays of ints (DTypesInt).
Array1d, Array2d, Array3d, Array4d, ArrayXd1d, 2d, 3d, 4d and any-d arrays of floats or ints.
List1d, List2d, List3d, List4d, ListXdLists of 1d, 2d, 3d, 4d and any-d arrays (with same-type elements).
DTypesFloatFloat data types: "f" or "float32".
DTypesIntInteger data types: "i", "int32", "int64", "uint32", "uint64".
DTypesUnion of DTypesFloat and DTypesInt.
ShapeAn array shape. Equivalent to Tuple[int,].
Xpnumpy on CPU or cupy on GPU. Equivalent to Union[numpy, cupy].
GeneratorCustom type for generators / iterators for better config validation.
BatchableUnion[Pairs, Ragged, Padded, ArrayXd, List, Tuple].

Dataclasses

A dataclass is a lightweight data structure, similar in spirit to a named tuple, defined using the @dataclass decorator introduced in Python 3.7 (and backported to 3.6). Thinc uses dataclasses for many situations that would otherwise be written with nested Python containers. Dataclasses work better with the type system, and often result in code that’s easier to read and test.

Ragged dataclass

A batch of concatenated sequences, that vary in the size of their first dimension. Ragged allows variable-length sequence data to be contiguous in memory, without padding. Indexing into Ragged is just like indexing into the lengths array, except it returns a Ragged object with the accompanying sequence data. For instance, you can write ragged[1:4] to get a Ragged object with sequences 1, 2 and 3. Internally, the input data is reshaped into a two-dimensional array, to allow routines to operate on it consistently. The original data shape is stored, and the reshaped data is accessible via the dataXd property.

MemberTypeDescription
dataArray2dThe data array.
dataXdArrayXdThe data array with the original shape.
data_shapeShapeThe original data shape, with -1 for the first dimension.
lengthsInts1dThe sequence lengths.

Padded dataclass

A batch of padded sequences, sorted by decreasing length. The auxiliary array size_at_t indicates the length of the batch at each timestep, so you can do data[:, :size_at_t[t]] to shrink the batch. For instance, let’s say you have a batch of four documents, of lengths [6, 5, 2, 1]. The size_at_t will be [4, 3, 3, 3, 2, 1]. The lengths array indicates the length of each row, and the indices indicates the original ordering.

MemberTypeDescription
dataFloats3dA three-dimensional array, sorted by decreasing sequence length. The dimensions are timestep, batch item, row data.
site_at_tInts1dAn array indicating how the batch can be truncated at different sequence lengths. You can do data[:, :size_at_t[t]] to get an unpadded batch.
lengthsInts1dThe sequence lengths. Applies to the reordered sequences, not the original ordering. So it’ll be decreasing length.
indicesInts1dLists of indices indicating how to put the items back into original order.

Pairs

A batch of paired data, for instance images and their captions, or pairs of texts to compare. Indexing operations are performed as though the data were transposed to make the batch the outer dimension. For instance, pairs[:3] will return Pairs(pairs.one[:3], pairs.two[:3]), i.e. a slice of the batch with the first three items, as a new Pairs object.

Examplefrom thinc.types import Pairs

pairs = Pairs([1, 2, 3, 4], [5, 6, 7, 8])
assert pairs.one == [1, 2, 3, 4]
assert pairs[2] == Pairs(3, 7)
assert pairs[2:4] == Pairs([3, 4], [7, 8])
MemberTypeDescription
oneSequenceThe first sequence.
twoSequenceThe second sequence.

SizedGenerator dataclass

A custom dataclass for a generator that has a __len__ and can repeatedly call the generator function. This is especially useful for batching (see Ops.minibatch) where you know the length of the data upfront, but still want to batch it as a stream and return a generator. Exposing a __len__ attribute also makes it work seamlessly with progress bars like tqdm and similar tools.

Exampletrain_data = model.ops.multibatch(128, train_X, train_Y, shuffle=True)
assert isinstance(train_data, SizedGenerator)
for i in range(10):
    for X, Y in tqdm(train_data, leave=False):
        Yh, backprop = model.begin_update(X)
MemberTypeDescription
get_itemsCallable[[], Generator]The generator function. Available via the __iter__ method.
lengthintThe length of the data. Available via the __len__ attribute.

ArgsKwargs dataclass

A tuple of (args, kwargs) that can be spread into some function f: f(*args, **kwargs). Makes it easier to handle positional and keyword arguments that get passed around, especially for integrating custom models via a Shim.

MemberTypeDescription
argsTuple[Any,]The positional arguments. Can be passed into a function as *ArgsKwargs.args.
kwargsDict[str, Any]The keyword arguments. Can be passed into a function as **ArgsKwargs.kwargs.

ArgsKwargs.from_items classmethod

Create an ArgsKwargs object from a sequence of (key, value) tuples, such as produced by ArgsKwargs.items. Each key should be either a string or an integer. Items with integer keys are added to the args, and items with string keys are added to the kwargs. The args are determined by sequence order, not the value of the integer.

Exampleitems = [(0, "value"), ("key", "other value"), (1, 15), ("foo", True)]
ak = ArgsKwargs.from_items(items)
assert ak.args == ("value", 15)
assert ak.kwargs == {"key": "other value", "foo": True}
ArgumentTypeDescription
itemsSequence[Tuple[Union[int, str], Any]]The items.
RETURNSArgsKwargsThe ArgsKwargs dataclass.

ArgsKwargs.keys method

Yield indices from ArgsKwargs.args, followed by keys from ArgsKwargs.kwargs.

ArgumentTypeDescription
YIELDSUnion[int, str]The keys, args followed by kwargs.

ArgsKwargs.values method

Yield values from ArgsKwargs.args, followed by keys from ArgsKwargs.kwargs.

ArgumentTypeDescription
YIELDSAnyThe values, args followed by kwargs.

ArgsKwargs.items method

Yield enumerate(ArgsKwargs.args), followed by ArgsKwargs.kwargs.items().

ArgumentTypeDescription
YIELDSTuple[Union[int, str], Any]The values, args followed by kwargs.