Type Checking
By this point you’ve probably seen that Thinc uses the new Python 3.6+
syntax for type hints or
“type annotations”. All the code base is type-annotated and it is recommended
that you add at least some types to your own code, too. Type annotations can
make your numeric code much more explicit, making it easier to come back to
later. Type annotations also allow your editor (and many other tools) to perform
type checks before executing your code. They also power
autocompletion. For example, if you try to add a str
and an int
, your
editor will probably warn you that it is an invalid operation, without having to
wait until you run the invalid code. It may also tell you that a function
expects a float
, so you don’t pass it an invalid type. If your layer is typed
as Model[Floats2d, Ints1d]
, Thinc can tell you if its inputs and outputs are
incompatible with the rest of your network.
Thinc’s type-system won’t catch every error. It has no representation for the sizes of your dimensions, so a lot of invalid operations can’t be detected until runtime. Sometimes the syntax gets quite ugly, and the error messages are often frustratingly opaque. Nevertheless, we do recommend you try it out, especially for your model definitions and the functions you’re registering for the config system.
Installation and setup
mypy
is the “standard” type checker
for Python, in fact, that’s where these new Python type hints were born. You can
install mypy
from pip or conda. If you use a virtual environment for your
project, make sure that you install it in the same environment.
pippip install mypy
condaconda install -c conda-forge mypy
Thinc comes with a mypy
plugin that extends the normal functionality to
perform additional type checks in code using Thinc. If you installed Thinc, you
already have the plugin. To enable the Thinc plugin for mypy
you just have to
create a file mypy.ini
within your project folder. This will tell mypy
to
use the plugin in the module thinc.mypy
. If you use
pydantic
for
advanced configuration, you can also enable
pydantic
’s plugin. If you’re using Thinc as part of your Python package, you
can also add the [mypy]
section to your package’s setup.cfg
.
mypy.ini[mypy]
plugins = thinc.mypy
mypy.ini[mypy]
plugins = thinc.mypy, pydantic.mypy
To type check a file or directory, you can now use the mypy
command:
mypy my_file.py
Setting up linting in your editor
Real-time linting is especially powerful, as it lets you type-check your code as it leaves your fingers. This often lets you catch errors in their original context, when they’re least confusing. It can also save you trips to the documentation.
Visual Studio Code | If you use Visual Studio Code, make sure you install the Python extension. Then select the appropriate environment in your editor. If you installed mypy in the same environment and select it in your editor, after adding the mypy.ini file (as described above) everything should work. |
PyCharm | If you use PyCharm, make sure you configure the Python Interpreter for your project. Then install the “Mypy” plugin. You may also want to install the “Mypy (Official)” plugin. If you installed mypy in the same environment/interpreter, after adding the mypy.ini file (as described above) and installing the plugin, everything should work. |
Other editors | See the mypy docs for instructions for other editors like Vim, Emacs, Sublime Text and Atom. |
Static type checking
“Static type checking” means that your editor (or other tools) will check the code using the declared types before running it. Because it is done before running the code, it’s called “static”. The contrary would be “dynamic” type checking, where checks are performed at runtime, while the program is running and the code is being executed. (Thinc also does runtime validation by the way!) As editors and similar tools can’t just randomly run your code to verify that it’s correct, we have these type annotations to help editors check the code and provide autocompletion.
Even if you never run a type-checker, adding type-annotations to your code can greatly improve its readability. Multi-dimensional array libraries like numpy make it easy to write terse, fairly general code – but when you revisit the code later, it’s often very hard to figure out what’s happening without executing the code and debugging.
No typesdef do_things(A, B):
A = A.reshape(A.shape + (B.shape[-1],))
A = A.sum(axis=-1)
# Is this last line an error? Maybe they wanted axis=-1?
return (B * A).sum()
Typesdef do_things(A: Floats2d, B: Floats3d) -> float:
A = A.reshape(A.shape + (B.shape[-1],)).sum(axis=-1)
# Ah, the function says it returns float --- so this all makes sense.
return (B * A).sum()
Type annotations provide a relatively concise way to document some of the most important information about your code. The same information can be provided in comments, but unless you use a consistent syntax, your type comments will probably be much longer and more distracting than the equivalent annotations.
Another advantage of type annotations as documentation is that they can be
queried for more detail, while with comments, you have to choose the level
of detail to provide up-front. Thinc’s type annotations take into account
numpy
’s tricky indexing system, and also the semantics of the different
reduction operations as different arguments are passed in. This makes it much
easier to follow along with steps that might have felt obvious to the author of
the code.
Array shape typesfrom thinc.types import Floats3d, Ints1d
def numpy_shapes_pop_quiz(arr1: Floats3d, indices: Ints1d):
# How many dimensions do each of these arrays have?
q1 = arr1[0]
q2 = arr1.mean()
q3 = arr1[1:, 0]
q4 = arr1[1:, :-1]
q5 = arr1.sum(axis=0)
q6 = arr1[1:, ..., :-1]
q7 = arr1.sum(axis=(0, 1), keepdims=True)
q8 = arr1[indices].cumsum()
q9 = arr1[indices[indices]].ptp(axis=(-2, -1))
# Run mypy over the snippet to find out your score!
reveal_type(q1)
reveal_type(q2)
reveal_type(q3)
reveal_type(q4)
reveal_type(q5)
reveal_type(q6)
reveal_type(q7)
reveal_type(q8)
reveal_type(q9)
Using Thinc’s custom types in your code
Array types
Thinc relies heavily on the numpy.ndarray
interface, which was not designed
with type checking or type annotations in mind. The numpy
API is extremely
polymorphic, with most common operations returning a variety of output types
depending on what combination of arguments are provided. Retrofitting a type
system to the interface will always involve some compromise between
“correctness” (whether the type-system approves all and only valid numpy code)
and “type sanity”: whether the type-system is able to infer useful types, so you
can catch more bugs with less detailed annotations.
While official type-annotations for numpy will likely have to lean towards
correctness, Thinc has the luxury of leaning heavily towards type-sanity. We
accept a few usage limitations, offer type-specific options for a few common
operations (array allocation,
reshaping and
conversion), and declare a few usage patterns
off-limits (such as passing dtype
into many methods or functions).
Floats1d , Floats2d , Floats3d , Floats4d , FloatsXd | 1d, 2d, 3d, 4d and any-d arrays of floats. |
Ints1d , Ints2d , Ints3d , Ints4d , IntsXd | 1d, 2d, 3d, 4d and any-d arrays of ints. |
Array1d , Array2d , Array3d , Array4d , ArrayXd | 1d, 2d, 3d, 4d and any-d arrays of floats or ints. |
We also compromise on how much detail the type-system can be expected to represent, setting on two useful distinctions: broad data type (ints vs. floats), and number of dimensions (1, 2, 3, 4, and many) – so we have 10 array subtypes in total. Notably, our type-system does not specify actual array shapes. Representing the shapes as well would be fantastic, but it would make the system far more verbose, complex and difficult to work with.
Generic Model types
Thinc also makes use of type-annotations for its model class, by making it a
generic with two
type parameters, representing the layer’s input and output types. Generics
let you tell the type-system a little bit more information on a per-instance
basis. For instance, the typing.List
class is a generic: you can write
List[int]
to denote a list of integers, and List[str]
to denote a list of
strings. This helps you declare your interfaces more precisely, and lets the
type-system infer types when you later loop over the list, all without having to
declare your own subclass.
Type-parameters for generics are written between square brackets, and are
comma-delimited like function arguments. So to specify that a model takes a list
of strings as input and produces a two-dimensional array of integers as output,
you would write Model[List[str], Ints2d]
. You can also under-specify either or
both arguments, by writing e.g. Model[Any, Any]
, Model[Any, Ints2d]
, etc. If
you specify simply Model
, that is read as syntactic sugar for
Model[Any, Any]
.
A common problem at first is that it feels natural to write simply Model
for
code that should be agnostic to input and output types. This generally works as
an input annotation, but if you use it as a return type you’ll often have
problems.
Sane but wrongdef pop_layer(model: Model) -> Model:
model.layers.pop(0)
return model
Thanks, I hate itfrom typing import TypeVar
_T = TypeVar("_T", bound=Model)
def pop_layer(model: _T) -> _T:
model.layers.pop(0)
return model
The problem is that you need a way to note that the model your function is
returning is of the same type as your function’s argument. The
typing.TypeVar
class provides a (syntactically awkward) solution: if you use the same TypeVar
in your method definition, that will denote that the two variables must have
the same type. However, the scoping behavior is very surprising: you can use
the same TypeVar
in different function definitions, and it won’t bind the
types between them. The behavior with respect to classes and generics is also
quite subtle, and mostly left as an exercise to the reader by
the official documentation.
Although TypeVar
will make your days a little bit worse, they are often
necessary, and we suggest making peace with having to use them. However, in some
situations you can instead use the
@overload
decorator.
This alternative will work whenever you can enumerate a small number of specific
types. The @overload
decorator lets you describe multiple input type to return
type mappings for your function, without actually changing the return-type
dispatch. It’s also more flexible than TypeVar
s in many situations, as you can
express more subtle relationships. We use @overload
extensively in our array
definitions, to represent numpy’s polymorphic behavior, so you can find some
more complex examples of @overload
in the thinc.types
module.
Type logic with overload@overload
def toggle_types(hello: str) -> int:
...
@overload
def toggle_types(hello: int) -> str:
...
def toggle_types(hello: Union[str, int]) -> Union[str, int]:
return 1 if isinstance(hello, str) else "hello"
Tips, tricks & best practices
-
If you’re just starting out with type annotations, try just using them in your model-building functions. The type annotations can help you plan your model “top down”, so you can see when things aren’t lining up even without running the type checker.
-
If you’re using the
@registry
decorator to register functions for the config system, you probably want to add type annotations to your declaration. This lets the config system validate the arguments, which can catch a lot of errors. -
Writing type-checked code invites slightly different conventions. You’ll often want to split interfaces up into several functions, so that the types are a bit more specific. For instance, we made separate
alloc2f
,alloc2i
,alloc3f
etc. methods on theOps
object, because these return specific types. We only use the generic one for generic code, where the desired type is unknown. -
Instead of nested Python containers, try using the
@dataclass
decorator to create small, behavior-less struct-like classes. You would have to create a name for the nested format anyway, so it may as well be a useful type. -
It sometimes helps to think of the type-checker as trying to falsify your code, rather than trying to validate it. For instance, if you’re looking at a particular piece of code, you might see that the variable passed into your function is of a certain type, and the type-checker knows that – so why does it raise a problem in the function? The answer is that the checker is thinking about any possible input value, according to the type declarations.