A dataclass is a lightweight data structure, similar in spirit to a named
tuple, defined using the
decorator introduced in Python 3.7 (and backported to 3.6). Thinc uses
dataclasses for many situations that would otherwise be written with nested
Python containers. Dataclasses work better with the type system, and often
result in code that’s easier to read and test.
A batch of concatenated sequences, that vary in the size of their first
dimension. Ragged allows variable-length sequence data to be contiguous in
memory, without padding. Indexing into Ragged is just like indexing into the
lengths array, except it returns a Ragged object with the accompanying
sequence data. For instance, you can write ragged[1:4] to get a Ragged
object with sequences 1, 2 and 3. Internally, the input data is reshaped
into a two-dimensional array, to allow routines to operate on it consistently.
The original data shape is stored, and the reshaped data is accessible via the
A batch of padded sequences, sorted by decreasing length. The auxiliary array
size_at_t indicates the length of the batch at each timestep, so you can do
data[:, :size_at_t[t]] to shrink the batch. For instance, let’s say you have a
batch of four documents, of lengths [6, 5, 2, 1]. The size_at_t will be
[4, 3, 3, 3, 2, 1]. The lengths array indicates the length of each row, and
the indices indicates the original ordering.
A batch of paired data, for instance images and their captions, or pairs of
texts to compare. Indexing operations are performed as though the data were
transposed to make the batch the outer dimension. For instance, pairs[:3] will
return Pairs(pairs.one[:3], pairs.two[:3]), i.e. a slice of the batch with the
first three items, as a new Pairs object.
A custom dataclass for a generator that has a __len__ and can repeatedly call
the generator function. This is especially useful for batching (see
Ops.minibatch) where you know the length of
the data upfront, but still want to batch it as a stream and return a generator.
Exposing a __len__ attribute also makes it work seamlessly with progress bars
like tqdm and similar tools.
Exampletrain_data = model.ops.multibatch(128, train_X, train_Y, shuffle=True)assertisinstance(train_data, SizedGenerator)for i inrange(10):for X, Y in tqdm(train_data, leave=False):
Yh, backprop = model.begin_update(X)
A tuple of (args, kwargs) that can be spread into some function f:
f(*args, **kwargs). Makes it easier to handle positional and keyword arguments
that get passed around, especially for integrating custom models via a
The positional arguments. Can be passed into a function as *ArgsKwargs.args.
The keyword arguments. Can be passed into a function as **ArgsKwargs.kwargs.
Create an ArgsKwargs object from a sequence of (key, value) tuples, such as
produced by ArgsKwargs.items. Each key should be either a string or an
integer. Items with integer keys are added to the args, and items with string
keys are added to the kwargs. The args are determined by sequence order, not
the value of the integer.