Skip to content
Snippets Groups Projects
Commit 48b94af2 authored by Yoann Schneider's avatar Yoann Schneider :tennis: Committed by Bastien Abadie
Browse files

configuration documentation

parent 22e775ef
No related branches found
No related tags found
No related merge requests found
# Configuration
When the worker is running over elements, be it locally or on Arkindex, the first step before actually doing
anything is configuration. This process is implemented in the `configure` method that a worker
inherits from [ElementsWorker][elements_worker] and [BaseWorker][base_worker].
anything is configuration. This process is implemented in the `configure` method.
This method can also be overloaded if the worker needs additional configuration steps.
The developer mode was designed to help worker developers reproduce and test how their worker
would behave on Arkindex. This is why the configuration process in this mode mirrors the operation while
replacing API calls by CLI arguments.
would behave on Arkindex. This is why the configuration process in this mode mirrors the operations done on Arkindex while
replacing configuration API calls by CLI arguments.
The developer mode is enabled when at least one of three events occur:
- the `--dev` CLI arguments is used,
- the `WORKER_VERSION_ID` variable was not set in the environment,
- the `ARKINDEX_WORKER_RUN` variable was not set in the environment.
The developer mode is enabled when at least either:
- the `--dev` CLI argument is used,
- the `ARKINDEX_WORKER_RUN_ID` variable was not set in the environment.
None of these happen when running on Arkindex.
## Arkindex mode
The details of a worker execution (what is called a **WorkerRun**) on Arkindex are stored in the backend. The first step of the configuration is to retrieve this information using the Arkindex API. The [RetrieveWorkerRun](https://demo.arkindex.org/api-docs/#tag/process/operation/RetrieveWorkerRun) endpoint gives information about:
- the running process,
- the configuration parameters that the user may have added from the frontend,
- the worker used,
- the version of this worker,
- the configuration stored in this version,
- the model version used in this worker execution,
- the configuration stored in this model version.
This step shows that there are a lot of sources for the actual configuration that the worker will use. The principal is that the user always has the last word. Any parameter the user chooses will override whatever was previously set in other configurations.
The worker configuration may specify default values for some parameters (see [this section](../workers/yaml.md#setting-up-user-configurable-parameters) for more details about worker configuration). These default values are stored in the `user_configuration` dictionary attribute.
This is also when the secrets (see [this section](../secrets/usage.md#declaring-secrets-in-workers) to learn more about secrets) are actually downloaded. They are stored in the `secrets` dictionary attribute.
An Arkindex-mode exclusive step is done after all that: the cache setup. Some workers benefit a lot, performance-wise, from having a SQLite cache artifact from previous workers. This is mostly used in processes with multiple workers with dependencies, where the second worker needs the results of the first one to work. The database is initialized, the tables created and its version checked as it must match the one supported by the Arkindex instances. The database is then merged with any other database generated by previous worker runs.
## Developer mode
- The worker's configuration YAML with variable needed by the worker
contains also the list of secrets needed by the worker. See [secrets][].
In the developer mode, the worker execution is not linked to anything on Arkindex. Therefore, the only configuration the worker can use is provided via the `--config` CLI argument. It supports YAML-formatted file and it should be similar to the `configuration` section of the [worker configuration file](../workers/yaml/#single-worker-configuration), without the `user_configuration` details. More details about how to create the local worker configuration are available in [this section](../workers/run-local/).
- ARKINDEX_CORPUS_ID to specify which corpus the processed elements belong to
- Local secrets loading
The multiple configuration sources from the Arkindex-mode are merged into a unique one here. The configuration parameters are parsed as well as the list of required secrets. The secrets are loaded using a local Arkindex client. Again, see the [section about local execution](../workers/run-local/) for more details.
- DEBUG mode
When implementing a new worker, some additional logs might be needed to properly investigate
why something is not working as intended. The logging level can be set to the `DEBUG` level via either
- the `--verbose` CLI arguments,
- setting the `ARKINDEX_DEBUG` to `True` in the environment,
- specifying the `"debug": True` in the worker's configuration via the `user_configuration`.
For more information, see [how to use the user_configuration][user-config].
One information cannot be retrieved directly from the configuration file and is required in some cases: the ID of the Arkindex corpus which the elements processed belong to. This is retrieved via the `ARKINDEX_CORPUS_ID` environment variable.
## Worker reporter
At the end of a worker execution, a report about the publication done by the worker is generated in JSON-format. This lists
## Arkindex mode
- the starting time,
- the number of elements created, grouped by type,
- the number of transcription created,
- the number of classifications created, grouped by class,
- the number of entities created,
- the number of entities created on transcriptions,
- the number of metadatas created,
- the encountered errors' logs.
This is done by the many helper described in the [reporting module](../../ref/reporting.md). They use the `report` attribute initialized at the configuration stage.
## Setting Debug logging level
There are three ways to activate the debug mode:
- the `--verbose` CLI argument,
- setting the `ARKINDEX_DEBUG` environment variable to `True`,
- setting `"debug": True` in the worker's configuration via any configuration source.
## Important class attributes
Many attributes are set on the worker during at the configuration stage. Here is a *non-exhaustive* list with some details about their source and their usage.
`api_client`
: The Arkindex API client used by the worker to make the requests. One should not rely on this attribute to make API calls but use the many helpers available. The exception is for endpoints where no helper are available.
`args`
: The arguments passed via the CLI. This is used to trigger the Developer mode via `--dev`, to specify the configuration file via `--config` and to list elements to process via `--element`.
`config`
: A dictionary with the worker's configuration. This is filled by the worker run's configuration, the worker version's and the model version's if there is any.
`corpus_id`
: The ID of the corpus linked to the current process. This is mostly needed when publishing objects linked to a corpus like `Entities`. You may set it in developer mode via the `ARKINDEX_CORPUS_ID` environment variable.
`is_read_only`
: This is the computed property that determines which mode should be used. The Developer mode prevents any actual publication on Arkindex, hence the name `read_only`.
`model_configuration`
: The parsed configuration as stored in the `ModelVersion` object on Arkindex.
`process_information`
: The details about the process parent to this worker execution. Only set in Arkindex mode.
`reporter`
: The `Reporter` instance that will generate the `ml_report.json` artifacts which sums up the publication done during this execution and the errors encountered.
`secrets`
: A dictionary mapping the secret name to their parsed content.
`use_cache`
: Whether the cache optimization is available or not.
`user_configuration`
: The parsed configuration as the user entered it via the Arkindex frontend. Any parameter not specified will be filled with its default value if there is one.
- DEBUG mode
- RetrieveWorkerRun, what is a worker run. link to arkindex api ? what information does it give
- user_configuration loading
reading default values and storing them in the config
- secrets actual loading
- overriding the config with worker's configuration
`worker_details`
: The details of the worker used in this execution.
`worker_run_id`
: The ID of the `WorkerRun` corresponding object on the Arkindex instance. In Arkindex mode, this is used in `RetrieveWorkerRun` API call to retrieve the configuration and other necessary information. In developer mode, this is not set nor used.
[elements_worker]: /../../../ref/elements_worker#elements-worker
[base_worker]: /../../ref/base_worker#base-worker
[user-config]: /../workers/yaml.md#setting-up-user-configurable-parameters
`worker_version_id`
: The ID of the `WorkerVersion` object linked to the current `WorkerRun`. Like the `worker_run_id` attribute, this is not set nor used in developer mode.
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment