Configuration is a huge problem for machine-learning code because you may want
to expose almost any detail of any function as a hyperparameter. The setting you
want to expose might be arbitrarily far down in your call stack, so the setting
might need to pass all the way through the CLI or REST API, through any number
of intermediate functions, affecting the interface of everything along the way.
And then once those settings are added, they become hard to remove later.
Default values also become hard to change without breaking backwards
compatibility.
To solve this problem, Thinc leverages
confection - a config system that
lets you easily describe arbitrary trees of objects. The objects can be
created via function calls you register using a simple decorator syntax. You
can even version the functions you create, allowing you to make improvements
without breaking backwards compatibility. The most similar config system we’re
aware of is Gin, which uses a similar
syntax, and also allows you to link the configuration system to functions in
your code using a decorator. Thinc’s config system is simpler and emphasizes a
different workflow via a subset of Gin’s functionality.
Config[training]patience=10dropout=0.2use_vectors=false[training.logging]level="INFO"[nlp]# This uses the value of training.use_vectorsuse_vectors=${training.use_vectors}lang="en"
The config is divided into sections, with the section name in square brackets –
for example, [training]. Within the sections, config values can be assigned to
keys using =. Values can also be referenced from other sections using the dot
notation and placeholders indicated by the dollar sign and curly braces. For
example, ${training.use_vectors} will receive the value of use_vectors in
the training block. This is useful for settings that are shared across
components.
The config format has three main differences from Python’s built-in
configparser:
JSON-formatted values. Thinc passes all values through json.loads to
interpret them. You can use atomic values like strings, floats, integers or
booleans, or you can use complex objects such as lists or maps.
Structured sections. Thinc uses a dot notation to build nested sections.
If you have a section named [section.subsection], Thinc will parse that
into a nested structure, placing subsection within section.
References to registry functions. If a key starts with @, Thinc will
interpret its value as the name of a function registry, load the function
registered for that name and pass in the rest of the block as arguments. If
type hints are available on the function, the argument values (and return
value of the function) will be validated against them. This lets you express
complex configurations, like a training pipeline where batch_size is
populated by a function that yields floats (see
schedules). Also see the section on
registry integration for more details.
There’s no pre-defined scheme you have to follow and how you set up the
top-level sections is up to you. At the end of it, you’ll receive a dictionary
with the values that you can use in your script – whether it’s complete
initialized functions, or just basic settings. For examples that show Thinc’s
config system in action, check out the following tutorials:
Intro to Thinc · Everything you need to know to get started. Composing and training a model on the MNIST data, using config files, registering custom functions and wrapping PyTorch, TensorFlow and MXNet models.
Basic CNN part-of-speech tagger · Implementing and training a basic CNN for part-of-speech tagging model without external dependencies and using different levels of Thinc's configuration system.
Thinc’s registry system lets you map string keys
to functions. For instance, let’s say you want to define a new optimizer. You
would define a function that constructs it and add it to the right register,
like so:
Registering a functionfrom typing import Union, Iterable
import thinc
@thinc.registry.optimizers.register("my_cool_optimizer.v1")defmake_my_optimizer(learn_rate: Union[float, Iterable[float]], gamma:float):return MyCoolOptimizer(learn_rate, gamma)# Later you can retrieve your function by name:
create_optimizer = thinc.registry.optimizers.get("my_cool_optimizer.v1")
The registry lets you refer to your function by string name, which is often more
convenient than passing around the function itself. This is especially useful
for configuration files: you can provide the name of your function and the
arguments in the config file, and you’ll have everything you need to rebuild
the object.
Since this is a common workflow, the registry system provides a shortcut for it,
the registry.resolve function. If a
section contains a key beginning with @, it will be interpreted as the name of
a function registry – e.g. @optimizers refers to a function registered in the
optimizers registry. The value will be interpreted as the name to look up and
the rest of the block will be passed into the function as arguments. Here’s a
simple example:
Under the hood, Thinc will look up the "my_cool_optimizer.v1" function in the
"optimizers" registry and then call it with the arguments learn_rate and
gamma. If the function has type annotations, it will also validate the
input. For instance, if learn_rate is annotated as a float and the config
defines a string, Thinc will raise an error.
Under the hoodoptimizer_func = thinc.registry.get("optimizers","my_cool_optimizer.v1")
optimizer = optimizer_func(learn_rate=0.001, gamma=1e-8)
The function registry integration becomes even more powerful when used to build
recursive structures. Let’s say you want to use a learning rate schedule and
pass in a schedule as the learn_rate argument. Here’s an example of a function
that yields an infinite series of decaying values, following the schedule
base_rate * 1 / (1 + decay * t). It’s also available in Thinc as
schedules.decaying. The decorator registers
the function "my_cool_decaying_schedule.v1" in the registry schedules:
When Thinc resolves the config, it will first look up
"my_cool_decaying_schedule.v1" and call it with its arguments. Both arguments
will be validated against the type annotations (float). The return
value will then be passed to the optimizer function as the learn_rate
argument. If type annotations are available for the return value and it’s a type
that can be evaluated, the return value of the function will be validated as
well.
Under the hoodlearn_rate_func = thinc.registry.get("schedules","my_cool_decaying_schedule.v1")
learn_rate = learn_rate_func(base_rate=0.001, decay=1e-4)
optimizer_func = thinc.registry.get("optimizers","my_cool_optimizer.v1")
optimizer = optimizer_func(learn_rate=learn_rate, gamma=1e-8)
After resolving the config and filling in the values, registry.resolve will
return a tuple of the resolved config and the filled config with default values
added. The resolved config will be a dict with one key, "optimizer", mapped to
an instance of the custom optimizer function initialized with the arguments
defined in the config.
If you’re setting function arguments in a config block, Thinc will expect the
function to have an argument of that same name. For instance,
base_rate = 0.001 means that the function will be called with
base_rate=0.001. This works fine, since Python allows function arguments to be
supplied as positional arguments or as keyword arguments. If possible, named
arguments are recommended, since it makes your code and config more explicit.
However, in some situations, your registered function may accept variable
positional arguments. In your config, you can then use * to define a list of
values:
import thinc
from thinc.schedules import Schedule
@thinc.registry.schedules("my_cool_schedule.v1")defstep_values(*steps:float, final:float=1.0)-> Schedule[float]:
step_list =list(steps)return Schedule("step_values",
_step_values_schedule,
attrs={"steps":list(steps),"final": final})def_step_values_schedule(schedule: Schedule, step:int,**kwargs)->float:
steps = schedule.attrs["steps"]
final = schedule.attrs["final"]return steps[step]if step <len(steps)else final
You can also use the * placeholder in nested configs to populate positional
arguments from function registries. This is useful for combinators like
chain that take a variable number of layers as
arguments. The following config will create two Relu
layers, pass them to chain and return a
combined model:
For hyperparameters and other settings that need to be used in different places
across your config, you can define a separate block once and then reference the
values using the
extended interpolation.
For example, ${hyper_params.dropout} will insert the value of dropout from
the section hyper_params.
Thinc’s registry includes
several pre-defined registries that are also used
for its built-in functions. You can also use the
registry.create method to add your own
registries that you can then reference in config files. The following will
create a registry visualizers and let you use the
@thinc.registry.visualizers decorator, as well as the @visualizers key in
config files.
pydantic is a modern Python
library for data parsing and validation using type hints. It’s used by Thinc
to validate configuration files, and you can also use it in your model and
component definition to enforce stricter and more fine-grained validation. If
type annotations only define basic types like str, int or bool, the
validation will accept all values that can be cast to this type. For instance,
0 is considered valid for bool, since bool(0) is valid. If you need
stricter validation, you can use
strict types
instead. This example defines an optimizer that only accepts a float, a positive
integer and a constrained string matching the given regular expression:
If your config defines a value that’s not compatible with the type annotations –
for instance, a negative integer for steps – Thinc will raise an error:
ErrorsConfig validation error
steps ensure this value is greater than 0
{'@optimizers': 'my_cool_optimizer.v1', 'learn_rate': 0.001, 'steps': -1, 'log_level': 'DEBUG'}
Argument annotations can also define
pydantic models. This is
useful if your function takes dictionaries as arguments. The data is then passed
to the model and is parsed and validated. pydantic models are classes that
inherit from the pydantic.BaseModel class and define fields with type hints
and optional defaults as attributes:
In the config file, logging_config can now become its own section,
[optimizer.logging_config]. Its values will be validated against the
LoggingConfig schema:
For even more flexible validation of values and relationships between them, you
can define validators
that apply to one or more attributes and return the parsed attribute. In this
example, the validator checks that the value of name doesn’t contain spaces
and returns its lowercase form:
If a config file specifies registered functions, their argument values will be
validated against the type annotations of the function. For all other values,
you can pass a schema to
registry.resolve, a
pydantic model used to
parse and validate the data. Models can also be nested to describe nested
objects.
Setting extra = "forbid" in the
Config means that
validation will fail if the object contains additional properties – for
instance, another top-level section that’s not training. The default value,
"ignore", means that additional properties will be ignored and filtered out.
Setting extra = "allow" means any extra values will be passed through without
validation.
The main motivation for Thinc’s configuration system was to eliminate hidden
defaults and ensure that config settings are passed around consistently. This
also means that config files should always define all available settings.
The registry.fill method also resolves the
config, but it leaves references to registered functions intact and doesn’t
replace them with their return values. If type annotations and/or a base schema
are available, they will be used to parse the config and fill in any missing
values and defaults to create an up-to-date “master config”.
Let’s say you’ve updated your schema and scripts to use two additional optional
settings. These settings should also be reflected in your config files so they
accurately represent the available settings (and don’t assume any hidden
defaults).
The same also works for config blocks that reference registry functions. If your
function arguments change, you can run registry.fill to get your config up
to date with the new defaults. For instance, let’s say the optimizer now allows
a new setting, gamma, that defaults to 1e-8:
The config file should now also reflect this new setting and the default value
that’s being passed in – otherwise, you’ll lose that piece of information.
Running registry.fill solves this and returns a new Config with the complete
set of available settings: