Skip to content
Snippets Groups Projects

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Select target project
No results found

Target

Select target project
  • workers/base-worker
1 result
Show changes
Showing
with 494 additions and 105 deletions
docs/contents/workers/user_configuration/list_config.png

91.2 KiB

docs/contents/workers/user_configuration/string_config.png

30 KiB

# YAML configuration
This page is a reference for version 2 of the YAML configuration file for
Git repositories handled by Arkindex. Version 1 is not supported.
The configuration file is always named `.arkindex.yml` and should be found at
the root of the repository.
## Required attributes
The following attributes are required in every `.arkindex.yml` file:
`version`
: Version of the configuration file in use. An error will occur if the version
number is not set to `2`.
`type`
: Type of the repository. Has to be set to `worker` for a repository holding Arkindex
workers.
### Example configuration
```yaml
---
version: 2
type: worker
workers:
- workers/config.yml
```
This would match `workers/config.yml` starting at the root of
the repository.
## Worker repository attributes
When the `type` is set to `worker`, the `workers` attribute is mandatory.
The `workers` attribute is a list of the following:
- Paths to a YAML file holding the configuration for a single worker
- Unix-style patterns matching paths to YAML files holding the configuration
for a single worker
- The configuration of a single worker embedded directly into the file
### Single worker configuration
The following describes the attributes of a YAML file configuring one worker, or
of the configuration embedded directly in the `.arkindex.yml` file.
All attributes are optional unless explicitly specified.
`name`
: Mandatory. Name of the worker, for display purposes.
`slug`
: Mandatory. Slug of this worker. The slug must be unique across the repository and must only hold alphanumerical characters, underscores or dashes.
`type`
: Mandatory. Type of the worker, for display purposes only. Some common values
include:
- `classifier`
- `recognizer`
- `ner`
- `dla`
- `word-segmenter`
- `paragraph-creator`
`gpu_usage`
: Whether or not this worker requires or supports GPUs. Defaults to `disabled`. May take one of the following values:
`required`
: This worker requires a GPU, and will only be run on Ponos agents whose hosts have a GPU.
`supported`
: This worker supports using a GPU, but may run on any available host, including those without GPUs.
`disabled`
: This worker does not support GPUs. It may run on a host that has a GPU, but it will ignore it.
`model_usage`
: Boolean. Whether or not this worker requires a model version to run. Defaults to `false`.
`docker`
: Regroups Docker-related configuration attributes:
<!--
TODO: Make the path relative to the YAML file itself, in the case of a
separate file for a single worker?
https://gitlab.com/teklia/arkindex/tasks/-/issues/95
-->
<!--
TODO: Implement this!
https://gitlab.com/teklia/arkindex/tasks/-/issues/93
`image`: Tag of an existing Docker image to use for this worker instead of building a
custom image from a Dockerfile.
-->
- `build`
: Path towards a Dockerfile used to build this worker, relative to the root of
the repository. Defaults to `Dockerfile`.
- `command`
: Custom command line to be used when launching the Docker container for
this Worker. By default, the command specified in the Dockerfile will be used.
- `shm_size`: Size of the available shared memory in `/dev/shm`. The default value is `64M`, but when training machine learning models an increase might be necessary. The given value must be either an integer, or an integer followed by a unit (`b` for bytes, `k` for kilobytes, `m` for megabytes and `g` for gigabytes). If no unit is specified, the default unit is `bytes`. See the [Docker documentation](https://docs.docker.com/engine/reference/run/#runtime-constraints-on-resources).
- `environment`
: Mapping of string keys and string values to define environment variables to be
set when the Docker image runs.
`configuration`
: Mapping holding any string keys and values that can be later accessed in the
worker's Python code. Can be used to define settings on your own worker, such as
a file's location.
`user_configuration`
: Mapping defining settings on your worker that can be modified by users. [See below](#setting-up-user-configurable-parameters) for details.
`secrets`
: List of required secret names for that specific worker. For more information, learn how to use secrets in workers on the official Arkindex [documentation](https://doc.arkindex.org/secrets).
### Setting up user-configurable parameters
The YAML file can define parameters that users will be able to change when they use this worker in a process on Arkindex. These parameters are listed in a `user_configuration` attribute.
A parameter is defined using the following settings:
`title`
: Mandatory. The parameter's title.
`type`
: Mandatory. A value type. The supported types are:
- `int`
- `bool`
- `float`
- `string`
- `enum`
- `list`
- `dict`
`default`
: Optional. A default value for the parameter. Must be of the defined parameter `type`.
`required`
: Optional. A boolean, defaults to `false`.
`choices`
: Optional. A list of options for `enum` type parameters.
`subtype`
: Optional. The type of the elements of `list` type parameters.
This definition allows for both validation of the input and the display of a form to make configuring workers easy for Arkindex users.
![User configuration](user_configuration/configuration_form.png "User configuration form on Arkindex")
#### String parameters
String-type parameters must be defined using a `title` and the `string` `type`. You can also set a `default` value for this parameter, which must be a string, as well as make it a `required` parameter, which prevents users from leaving it blank.
For example, a string-type parameter can be defined like this:
```yaml
subfolder_name:
title: Created Subfolder Name
type: string
default: My Neat Subfolder
```
Which will result in the following display for the user:
![String-type parameter](user_configuration/string_config.png "Example string-type parameter.")
#### Integer parameters
Integer-type parameters must be defined using a `title` and the `int` `type`. You can also set a `default` value for this parameter, which must be an integer, as well as make it a `required` parameter, which prevents users from leaving it blank.
For example, an integer-type parameter can be defined like this:
```yaml
input_size:
title: Input Size
type: int
default: 768
required: True
```
Which will result in the following display for the user:
![integer-type parameter](user_configuration/integer_config.png "Example integer-type parameter.")
#### Float parameters
Float-type parameters must be defined using a `title` and the `float` `type`. You can also set a `default` value for this parameter, which must be a float, as well as make it a `required` parameter, which prevents users from leaving it blank.
For example, a float-type parameter can be defined like this:
```yaml
wip:
title: Word Insertion Penalty
type: float
required: True
```
Which will result in the following display for the user:
![Float-type parameter](user_configuration/float_config.png "Example float-type parameter.")
#### Boolean parameters
Boolean-type parameters must be defined using a `title` and the `bool` `type`. You can also set a `default` value for this parameter, which must be a boolean, as well as make it a `required` parameter, which prevents users from leaving it blank.
In the configuration form, boolean parameters are displayed as toggles.
For example, a boolean-type parameter can be defined like this:
```yaml
score:
title: Run Worker in Evaluation Mode
type: bool
default: False
```
Which will result in the following display for the user:
![Boolean-type parameter](user_configuration/bool_config.png "Example boolean-type parameter.")
#### Enum (choices) parameters
Enum-type parameters must be defined using a `title`, the `enum` `type` and at least two `choices`. You cannot define an enum-type parameter without `choices`. You can also set a `default` value for this parameter, which must be one of the available `choices`, as well as make it a `required` parameter, which prevents users from leaving it blank. Enum-type parameters should be used when you want to limit the users to a given set of options.
In the configuration form, enum parameters are displayed as selects.
For example, an enum-type parameter can be defined like this:
```yaml
parent_type:
title: Target Parent Element Type
type: enum
default: paragraph
choices:
- paragraph
- text_zone
- page
```
Which will result in the following display for the user:
![Enum-type parameter](user_configuration/enum_config.png "Example enum-type parameter.")
#### List parameters
List-type parameters must be defined using a `title`, the `list` `type` and a `subtype` for the elements inside the list. You can also set a `default` value for this parameter, which must be a list containing elements of the given `subtype`, as well as make it a `required` parameter, which prevents users from leaving it blank.
The allowed `subtype`s are `int`, `float` and `string`.
In the configuration form, list parameters are displayed as rows of input fields.
For example, a list-type parameter can be defined like this:
```yaml
a_list:
title: A List of Values
type: list
subtype: int
default: [4, 3, 12]
```
Which will result in the following display for the user:
![List-type parameter](user_configuration/list_config.png "Example list-type parameter.")
#### Dictionary parameters
Dictionary-type parameters must be defined using a `title`, the `dict` `type`. You can also set a `default` value for this parameter, which must be one a dictionary, as well as make it a `required` parameter, which prevents users from leaving it blank. You can use dictionary parameters for example to specify a correspondence between the classes that are predicted by a worker and the elements that are created on Arkindex from these predictions.
Dictionary-type parameters only accept strings as values.
In the configuration form, dictionary parameters are displayed as a table with one column for keys and one column for values.
For example, a dictionary-type parameter can be defined like this:
```yaml
classes:
title: Output Classes to Elements Correspondence
type: dict
default:
a: page
b: text_line
```
Which will result in the following display for the user:
![Dictionary-type parameter](user_configuration/dict_config.png "Example dictionary-type parameter.")
#### Example user_configuration
```yaml
user_configuration:
vertical_padding:
type: int
default: 0
title: Vertical Padding
element_base_name:
type: string
required: true
title: Element Base Name
create_confidence_metadata:
type: bool
default: false
title: Create confidence metadata on elements
some_other_parameter:
type: enum
required: true
default: 23
choices:
- 12
- 23
- 56
title: Another Parameter
```
#### Fallback to free JSON input
If you have defined user-configurable parameters using these specifications, Arkindex users can choose between using the form or the free JSON input field by toggling the **JSON** toggle. If there are unsupported parameter types in the defined `user_configuration`, the frontend will automatically fall back to the free JSON input field. The same is true if you have not defined user-configurable parameters using these specifications.
### Example configuration
```yaml
---
version: 2
type: worker
workers:
# Path to a single YAML file
- path/to/worker.yml
# Pattern matching any YAML file in the configuration folder
# or in its sub-directories
- configuration/**/*.yml
# Configuration embedded directly into this file
- name: Book of hours
slug: book_of_hours
type: classifier
docker:
build: project/Dockerfile
image: hub.docker.com/project/image:tag
command: python mysuperscript.py --blabla
shm_size: 128m
environment:
TOKEN: deadBeefToken
configuration:
model: path/to/model
anyKey: anyValue
classes: [X, Y, Z]
user_configuration:
vertical_padding:
type: int
default: 0
title: Vertical Padding
secrets:
- path/to/secret.json
```
......@@ -4,54 +4,16 @@
Add the `docs` extra when installing `arkindex-base-worker`:
``` sh
```sh
# In a clone of the Git repository
pip install .[docs]
```
Build the documentation using `make html`.
Build the documentation using `mkdocs serve -v`. The documentation should be available as http://localhost:8000
## Writing documentation
This documentation uses [Sphinx](http://www.sphinx-doc.org/) and has been
configured to allow using both [CommonMark][1] and reStructuredText. There are
some special considerations when using Markdown, because its syntax is too
simple to normally allow using Sphinx features.
You can use [reStructuredText directives][2] in Markdown docs by using fenced
code blocks:
~~~ markdown
``` autoclass:: arkindex_worker.worker.base.BaseWorker
:members:
```
~~~
For evaluating whole chunks of reST syntax and inserting them in your file,
you can use the `eval_rst` info string in your code blocks:
~~~ markdown
``` eval_rst
Here's a :cls:`link to a class <arkindex_worker.worker.base.BaseWorker>`
.. important:: This an important notice!
```
~~~
Another important consideration is the `Contents` section: a section with this
name should hold a list of links to other files, to be used as the main table
of contents:
``` markdown
## Contents
* [A file](file1)
* [Another file](file2)
```
This list will become the table of contents in the sidebar.
See `recommonmark`'s [AutoStructify documentation][3] for more information.
This documentation uses [Sphinx](http://www.sphinx-doc.org/) and was generated using [mkdocs](https://mkdocs.org/) and [mkdocstrings](https://mkdocstrings.github.io/).
## Linting
......@@ -59,9 +21,6 @@ This documentation is subject to linting using `doc8`, integrated into
`pre-commit`. You can run it locally by typing `pre-commit run`. You should use
`pre-commit install` to have the pre-commit hook run before each Git commit.
The linting rules that `doc8` applies can be found on [its documentation][4].
The linting rules that `doc8` applies can be found on [its documentation][1].
[1]: https://commonmark.org/help/
[2]: https://www.sphinx-doc.org/en/master/usage/restructuredtext/basics.html
[3]: https://recommonmark.readthedocs.io/en/latest/auto_structify.html
[4]: https://doc8.readthedocs.io/en/latest/readme.html#usage
[1]: https://doc8.readthedocs.io/en/latest/readme.html#usage
# Welcome to the Arkindex Workers documentation!
[Arkindex](https://doc.arkindex.org/), [Teklia](https://teklia.com)'s document processing platform, uses **workers** to run any Machine Learning tools on millions of documents.
Workers are **Docker** images that are ran on powerful servers that communicates with Arkindex instances using its Rest API.
To efficiently build these workers, we maintain a library named **Base Worker** and documented here. It simplifies a lot the development of new workers, and brings a lot of helpers to use the Arkindex API.
Welcome to the Arkindex Base Worker documentation!
==================================================
Python API
----------
.. autosummary::
:toctree: generated
:caption: Python API
:recursive:
arkindex_worker.cache
arkindex_worker.git
arkindex_worker.image
arkindex_worker.models
arkindex_worker.reporting
arkindex_worker.utils
arkindex_worker.worker
.. toctree::
:maxdepth: 2
:caption: Development
dev/build_docs
@ECHO OFF
pushd %~dp0
REM Command file for Sphinx documentation
if "%SPHINXBUILD%" == "" (
set SPHINXBUILD=sphinx-build
)
set SOURCEDIR=.
set BUILDDIR=_build
%SPHINXBUILD% >NUL 2>NUL
if errorlevel 9009 (
echo.
echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
echo.installed, then set the SPHINXBUILD environment variable to point
echo.to the full path of the 'sphinx-build' executable. Alternatively you
echo.may add the Sphinx directory to PATH.
echo.
echo.If you don't have Sphinx installed, grab it from
echo.https://www.sphinx-doc.org/
exit /b 1
)
if "%1" == "" goto help
%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
goto end
:help
%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
:end
popd
# Classification
::: arkindex_worker.worker.classification
options:
members: no
::: arkindex_worker.worker.classification.ClassificationMixin
options:
show_bases: false
show_category_heading: no
show_root_heading: no
# Element
::: arkindex_worker.worker.element
options:
members:
- ElementType
- MissingTypeError
options:
show_category_heading: no
::: arkindex_worker.worker.element.ElementMixin
options:
show_bases: false
show_category_heading: no
show_root_heading: no
\ No newline at end of file
# Entity
::: arkindex_worker.worker.entity
options:
members:
- EntityType
options:
show_category_heading: no
::: arkindex_worker.worker.entity.EntityMixin
options:
show_bases: false
show_category_heading: no
show_root_heading: no
\ No newline at end of file
# Arkindex API integration
Workers are bound to make request to an Arkindex API instance while processing elements.
Helper methods were designed to help developers stay updated with the latest development on Arkindex. Many endpoints are implemented to facilitate:
- [classification](classification.md) operations
- [element](element.md) operations
- [entity](entity.md) operations
- [metadata](metadata.md) operations
- [Machine Learning models training](training.md)
- [transcription](transcription.md) operations
- [worker_version](worker_version.md) operations
# Metadata
::: arkindex_worker.worker.metadata
options:
members:
- MetaType
options:
show_category_heading: no
::: arkindex_worker.worker.metadata.MetaDataMixin
options:
show_bases: false
show_category_heading: no
show_root_heading: no
# Training
::: arkindex_worker.worker.training
options:
members:
- DirPath
- Hash
- FileSize
- create_archive
options:
show_category_heading: no
::: arkindex_worker.worker.training.TrainingMixin
options:
show_bases: false
show_category_heading: no
show_root_heading: no
# Transcription
::: arkindex_worker.worker.transcription
options:
members:
- TextOrientation
options:
show_category_heading: no
::: arkindex_worker.worker.transcription.TranscriptionMixin
options:
show_bases: false
show_category_heading: no
show_root_heading: no
# WorkerVersion
::: arkindex_worker.worker.version
options:
members: no
options:
show_category_heading: no
::: arkindex_worker.worker.version.WorkerVersionMixin
options:
show_bases: false
show_category_heading: no
show_root_heading: no
\ No newline at end of file
# Base Worker
::: arkindex_worker.worker.base
# Cache
::: arkindex_worker.cache
# Elements Worker
::: arkindex_worker.worker
# Git & Gitlab support
::: arkindex_worker.git
# Image utilities
::: arkindex_worker.image