Skip to content
Snippets Groups Projects
yaml.md 12.9 KiB
Newer Older
# YAML configuration

This page is a reference for version 2 of the YAML configuration file for
Git repositories handled by Arkindex. Version 1 is not supported.

The configuration file is always named `.arkindex.yml` and should be found at
the root of the repository.

## Required attributes

The following attributes are required in every `.arkindex.yml` file:

`version`
: Version of the configuration file in use. An error will occur if the version
  number is not set to `2`.

### Example configuration

```yaml
---
version: 2
workers:
  - workers/config.yml
```

This would match `workers/config.yml` starting at the root of
the repository.

## Worker repository attributes

The `workers` attribute is a list of the following:

- Paths to a YAML file holding the configuration for a single worker
- Unix-style patterns matching paths to YAML files holding the configuration
  for a single worker
- The configuration of a single worker embedded directly into the file

### Single worker configuration

The following describes the attributes of a YAML file configuring one worker, or
of the configuration embedded directly in the `.arkindex.yml` file.

All attributes are optional unless explicitly specified.

`name`
: Mandatory. Name of the worker, for display purposes.

`slug`
: Mandatory. Slug of this worker. The slug must be unique across the repository and must only hold alphanumerical characters, underscores or hyphens.

`type`
: Mandatory. Type of the worker, for display purposes only. Some common values
include:

    - `classifier`
    - `recognizer`
    - `ner`
    - `dla`
    - `word-segmenter`
    - `paragraph-creator`

`gpu_usage`
: Whether or not this worker requires or supports GPUs. Defaults to `disabled`. May take one of the following values:

    `required`
    : This worker requires a GPU, and will only be run on Ponos agents whose hosts have a GPU.

    `supported`
    : This worker supports using a GPU, but may run on any available host, including those without GPUs.

    `disabled`
    : This worker does not support GPUs. It may run on a host that has a GPU, but it will ignore it.

`model_usage`
: Whether or not this worker requires a model version to run. Defaults to `disabled`. May take one of the following values:

    `required`
    : This worker requires a model version, and will only be run on processes with a model.

    `supported`
    : This worker supports a model version, but may run on any processes, including those without model.

    `disabled`
    : This worker does not support model version. It may run on a process that has a model, but it will ignore it.
`docker`
: Regroups Docker-related configuration attributes:
    - `build`
    : Path towards a Dockerfile used to build this worker, relative to the root of
    the repository. Defaults to `Dockerfile`.
    - `command`
    : Custom command line to be used when launching the Docker container for
    this Worker. By default, the command specified in the Dockerfile will be used.
Yoann Schneider's avatar
Yoann Schneider committed
    - `shm_size`: Size of the available shared memory in `/dev/shm`. The default value is `64M`, but when training machine learning models an increase might be necessary. The given value must be either an integer, or an integer followed by a unit (`b` for bytes, `k` for kilobytes, `m` for megabytes and `g` for gigabytes). If no unit is specified, the default unit is `bytes`. See the [Docker documentation](https://docs.docker.com/engine/reference/run/#runtime-constraints-on-resources).
    - `environment`
    : Mapping of string keys and string values to define environment variables to be
set when the Docker image runs.

`configuration`
: Mapping holding any string keys and values that can be later accessed in the
worker's Python code. Can be used to define settings on your own worker, such as
a file's location.

`user_configuration`
: Mapping defining settings on your worker that can be modified by users. [See below](#setting-up-user-configurable-parameters) for details.

`secrets`
: List of required secret names for that specific worker. For more information, learn how to use secrets in workers on the official Arkindex [documentation](https://doc.arkindex.org/secrets).

### Setting up user-configurable parameters

The YAML file can define parameters that users will be able to change when they use this worker in a process on Arkindex. These parameters are listed in a `user_configuration` attribute.

A parameter is defined using the following settings:

`title`
: Mandatory. The parameter's title.

`type`
: Mandatory. A value type. The supported types are:

    - `int`
    - `bool`
    - `float`
    - `string`
    - `enum`
    - `dict`

`default`
: Optional. A default value for the parameter. Must be of the defined parameter `type`.

`required`
: Optional. A boolean, defaults to `false`.

`choices`
: Optional. A list of options for `enum` type parameters.

`subtype`
: Optional. The type of the elements of `list` type parameters.

This definition allows for both validation of the input and the display of a form to make configuring workers easy for Arkindex users.

![User configuration](user_configuration/configuration_form.png "User configuration form on Arkindex")

#### String parameters

String-type parameters must be defined using a `title` and the `string` `type`. You can also set a `default` value for this parameter, which must be a string, as well as make it a `required` parameter, which prevents users from leaving it blank.

For example, a string-type parameter can be defined like this:

```yaml
subfolder_name:
  title: Created Subfolder Name
  type: string
  default: My Neat Subfolder
```

Which will result in the following display for the user:

![String-type parameter](user_configuration/string_config.png "Example string-type parameter.")

#### Integer parameters

Integer-type parameters must be defined using a `title` and the `int` `type`. You can also set a `default` value for this parameter, which must be an integer, as well as make it a `required` parameter, which prevents users from leaving it blank.

For example, an integer-type parameter can be defined like this:

```yaml
input_size:
  title: Input Size
  type: int
  default: 768
  required: True
```

Which will result in the following display for the user:

![integer-type parameter](user_configuration/integer_config.png "Example integer-type parameter.")

#### Float parameters

Float-type parameters must be defined using a `title` and the `float` `type`. You can also set a `default` value for this parameter, which must be a float, as well as make it a `required` parameter, which prevents users from leaving it blank.

For example, a float-type parameter can be defined like this:

```yaml
wip:
  title: Word Insertion Penalty
  type: float
  required: True
```

Which will result in the following display for the user:

![Float-type parameter](user_configuration/float_config.png "Example float-type parameter.")

#### Boolean parameters

Boolean-type parameters must be defined using a `title` and the `bool` `type`. You can also set a `default` value for this parameter, which must be a boolean, as well as make it a `required` parameter, which prevents users from leaving it blank.

In the configuration form, boolean parameters are displayed as toggles.

For example, a boolean-type parameter can be defined like this:

```yaml
score:
  title: Run Worker in Evaluation Mode
  type: bool
  default: False
```

Which will result in the following display for the user:
![Boolean-type parameter](user_configuration/bool_config.png "Example boolean-type parameter.")

#### Enum (choices) parameters

Enum-type parameters must be defined using a `title`, the `enum` `type` and at least two `choices`. You cannot define an enum-type parameter without `choices`. You can also set a `default` value for this parameter, which must be one of the available `choices`, as well as make it a `required` parameter, which prevents users from leaving it blank. Enum-type parameters should be used when you want to limit the users to a given set of options.

In the configuration form, enum parameters are displayed as selects.

For example, an enum-type parameter can be defined like this:

```yaml
parent_type:
  title: Target Parent Element Type
  type: enum
  default: paragraph
  choices:
    - paragraph
    - text_zone
    - page
```

Which will result in the following display for the user:

![Enum-type parameter](user_configuration/enum_config.png "Example enum-type parameter.")

#### List parameters

List-type parameters must be defined using a `title`, the `list` `type` and a `subtype` for the elements inside the list. You can also set a `default` value for this parameter, which must be a list containing elements of the given `subtype`, as well as make it a `required` parameter, which prevents users from leaving it blank.

The allowed `subtype`s are `int`, `float` and `string`.

In the configuration form, list parameters are displayed as rows of input fields.

For example, a list-type parameter can be defined like this:

```yaml
a_list:
  title: A List of Values
  type: list
  subtype: int
  default: [4, 3, 12]
```

Which will result in the following display for the user:

![List-type parameter](user_configuration/list_config.png "Example list-type parameter.")

#### Dictionary parameters

Dictionary-type parameters must be defined using a `title` and the `dict` `type`. You can also set a `default` value for this parameter, which must be a dictionary, as well as make it a `required` parameter, which prevents users from leaving it blank. You can use dictionary parameters for example to specify a correspondence between the classes that are predicted by a worker and the elements that are created on Arkindex from these predictions.

Dictionary-type parameters only accept strings as values.

In the configuration form, dictionary parameters are displayed as a table with one column for keys and one column for values.

For example, a dictionary-type parameter can be defined like this:

```yaml
classes:
  title: Output Classes to Elements Correspondence
  type: dict
  default:
    a: page
    b: text_line
```

Which will result in the following display for the user:

![Dictionary-type parameter](user_configuration/dict_config.png "Example dictionary-type parameter.")

#### Model parameters

Model-type parameters must be defined using a `title` and the `model` type. You can also set a `default` value for this parameter, which must be the UUID of an existing Model, and make it a `required` parameter, which prevents users from leaving it blank. You can use a model parameter to specify to which Model the Model Version that is created by a Training process will be attached.

Model-type parameters only accept Model UUIDs as values.

In the configuration form, model parameters are displayed as an input field. Users can select a model from a list of available Models: what they type into the input field filters that list, allowing them to search for a model using its name or UUID.

For example, a model-type parameter can be defined like this:

```yaml
model_param:
  title: Training Model
  type: model
```

Which will result in the following display for the user:

![Model-type parameter](user_configuration/model_config.png "Example model-type parameter.")

#### Example user_configuration

```yaml
user_configuration:
  vertical_padding:
    type: int
    default: 0
    title: Vertical Padding
  element_base_name:
    type: string
    required: true
    title: Element Base Name
  create_confidence_metadata:
    type: bool
    default: false
    title: Create confidence metadata on elements
  some_other_parameter:
    type: enum
    required: true
    default: 23
    choices:
      - 12
      - 23
      - 56
    title: Another Parameter
  a_model_parameter:
    type: model
    title: Model to train
```

#### Fallback to free JSON input

If you have defined user-configurable parameters using these specifications, Arkindex users can choose between using the form or the free JSON input field by toggling the **JSON** toggle. If there are unsupported parameter types in the defined `user_configuration`, the frontend will automatically fall back to the free JSON input field. The same is true if you have not defined user-configurable parameters using these specifications.

### Example configuration

```yaml
---
version: 2

workers:
  # Path to a single YAML file
  - path/to/worker.yml
  # Pattern matching any YAML file in the configuration folder
  # or in its sub-directories
  - configuration/**/*.yml
  # Configuration embedded directly into this file
  - name: Book of hours
    slug: book_of_hours
    type: classifier
    docker:
      build: project/Dockerfile
      image: hub.docker.com/project/image:tag
      command: python mysuperscript.py --blabla
Yoann Schneider's avatar
Yoann Schneider committed
      shm_size: 128m
      environment:
        TOKEN: deadBeefToken
    configuration:
      model: path/to/model
      anyKey: anyValue
      classes: [X, Y, Z]
    user_configuration:
      vertical_padding:
        type: int
        default: 0
        title: Vertical Padding
    secrets:
      - path/to/secret.json
```