Skip to content
Snippets Groups Projects
Commit 3af756ad authored by Yoann Schneider's avatar Yoann Schneider :tennis: Committed by Bastien Abadie
Browse files

Configure method documentation

parent da41732f
No related branches found
No related tags found
1 merge request!218Configure method documentation
Pipeline #80110 passed
......@@ -30,7 +30,7 @@ repos:
- id: trailing-whitespace
- id: check-yaml
args: [--allow-multiple-documents]
exclude: "^worker-{{cookiecutter.slug}}/.arkindex.yml$"
exclude: "^worker-{{cookiecutter.slug}}/.arkindex.yml$|^mkdocs.yml$"
- id: mixed-line-ending
- id: name-tests-test
args: ['--django']
......
# Configuration
When the worker is running over elements, be it locally or on Arkindex, the first step before actually doing
anything is configuration. This process is implemented in the `configure` method.
This method can also be overloaded if the worker needs additional configuration steps.
The developer mode was designed to help worker developers reproduce and test how their worker
would behave on Arkindex. This is why the configuration process in this mode mirrors the operations done on Arkindex while
replacing configuration API calls by CLI arguments.
The developer mode (or `read-only` mode) is enabled when at least either:
- the `--dev` CLI argument is used,
- the `ARKINDEX_WORKER_RUN_ID` variable was not set in the environment.
None of these happen when running on Arkindex.
## Parallel between both modes
```mermaid
flowchart TB
subgraph configure[Configuration step]
argument_parsing[CLI argument parsing]
end
argument_parsing --> is_read_only{IsReadOnly?}
is_read_only -- Yes --> devMode
is_read_only -- No --> arkindexMode
subgraph arkindexMode[Arkindex mode]
direction TB
subgraph workerConfiguration[Worker configuration]
direction TB
retrieveWorkerRun["API call to RetrieveWorkerRun"] --> userconfig_defaults[Initialize user configuration with default values]
userconfig_defaults --> load_secrets_API["Load Secrets using API calls to RetrieveSecret"]
load_secrets_API --> load_user_config[Override user configuration by values set by user]
load_user_config --> load_model_config["Load model configuration"]
end
workerConfiguration --> cacheConfiguration
subgraph cacheConfiguration[Base worker cache setup]
direction TB
get_paths_from_parent_tasks["Retrieve paths of parent tasks' cache databases"] --> initialize_db[Create cache database and its tables]
initialize_db --> merge_parent_databases[Merge parents databases]
end
end
subgraph devMode[Developer mode]
direction TB
subgraph devWorkerConfiguration[Worker configuration]
direction TB
configuration_parsing[CLI config argument parsing] --> corpus_id[Read Corpus ID from environment]
corpus_id --> load_secrets[Load secret in local developer storage]
end
end
classDef pyMeth font-style:italic
```
## Arkindex mode
The details of a worker execution (what is called a **WorkerRun**) on Arkindex are stored in the backend. The first step of the configuration is to retrieve this information using the Arkindex API. The [RetrieveWorkerRun](https://demo.arkindex.org/api-docs/#tag/process/operation/RetrieveWorkerRun) endpoint gives information about:
- the running process,
- the configuration parameters that the user may have added from the frontend,
- the worker used,
- the version of this worker,
- the configuration stored in this version,
- the model version used in this worker execution,
- the configuration stored in this model version.
This step shows that there are a lot of sources for the actual configuration that the worker can use. Nothing is overridden by default, the worker has to do it in its overridden version of the configure method. In the end, any parameter set by the user **must** be applied over other known configurations.
!!! warning
The convention is to always give the final word to the user. This means that when the user configuration is filled, its values must be the last to override the worker's `config` attribute. If a model configuration was set, its values must override this attribute before the user configuration's.
The worker configuration may specify default values for some parameters (see [this section](../workers/yaml.md#setting-up-user-configurable-parameters) for more details about worker configuration). These default values are stored in the `user_configuration` dictionary attribute.
This is also when the secrets (see [this section](../secrets/usage.md#declaring-secrets-in-workers) to learn more about secrets) are actually downloaded. They are stored in the `secrets` dictionary attribute.
An Arkindex-mode exclusive step is done after all that: the cache setup. Some workers benefit a lot, performance-wise, from having a SQLite cache artifact from previous workers. This is mostly used in processes with multiple workers with dependencies, where the second worker needs the results of the first one to work. The database is initialized, the tables created and its version checked as it must match the one supported by the Arkindex instances. The database is then merged with any other database generated by previous worker runs.
## Developer mode
In the developer mode, the worker execution is not linked to anything on Arkindex. Therefore, the only configuration the worker can use is provided via the `--config` CLI argument. It supports YAML-formatted file and it should be similar to the `configuration` section of the [worker configuration file](../workers/yaml/#single-worker-configuration), without the `user_configuration` details. More details about how to create the local worker configuration are available in [this section](../workers/run-local/).
The multiple configuration sources from the Arkindex-mode are merged into a unique one here. The configuration parameters are parsed as well as the list of required secrets. The secrets are loaded using a local Arkindex client. Again, see the [section about local execution](../workers/run-local/) for more details.
One information cannot be retrieved directly from the configuration file and is required in some cases: the ID of the Arkindex corpus which the elements processed belong to. This is retrieved via the `ARKINDEX_CORPUS_ID` environment variable.
## Worker reporter
At the end of a worker execution, a report about the publication done by the worker is generated in JSON-format. This lists
- the starting time,
- the number of elements created, grouped by type,
- the number of transcription created,
- the number of classifications created, grouped by class,
- the number of entities created,
- the number of entities created on transcriptions,
- the number of metadatas created,
- the encountered errors' logs.
This is done by the many helper described in the [reporting module](../../ref/reporting.md). They use the `report` attribute initialized at the configuration stage.
## Setting Debug logging level
There are three ways to activate the debug mode:
- the `--verbose` CLI argument,
- setting the `ARKINDEX_DEBUG` environment variable to `True`,
- setting `"debug": True` in the worker's configuration via any configuration source.
## Important class attributes
Many attributes are set on the worker during at the configuration stage. Here is a *non-exhaustive* list with some details about their source and their usage.
`api_client`
: The Arkindex API client used by the worker to make the requests. One should not rely on this attribute to make API calls but use the many helpers available. The exception is for endpoints where no helper are available.
`args`
: The arguments passed via the CLI. This is used to trigger the Developer mode via `--dev`, to specify the configuration file via `--config` and to list elements to process via `--element`.
`config`
: A dictionary with the worker's configuration. This is filled by the worker run's configuration, the worker version's and the model version's if there is any.
`corpus_id`
: The ID of the corpus linked to the current process. This is mostly needed when publishing objects linked to a corpus like `Entities`. You may set it in developer mode via the `ARKINDEX_CORPUS_ID` environment variable.
`is_read_only`
: This is the computed property that determines which mode should be used. The Developer mode prevents any actual publication on Arkindex, hence the name `read_only`.
`model_configuration`
: The parsed configuration as stored in the `ModelVersion` object on Arkindex.
`process_information`
: The details about the process parent to this worker execution. Only set in Arkindex mode.
`reporter`
: The `Reporter` instance that will generate the `ml_report.json` artifacts which sums up the publication done during this execution and the errors encountered.
`secrets`
: A dictionary mapping the secret name to their parsed content.
`use_cache`
: Whether the cache optimization is available or not.
`user_configuration`
: The parsed configuration as the user entered it via the Arkindex frontend. Any parameter not specified will be filled with its default value if there is one.
`worker_details`
: The details of the worker used in this execution.
`worker_run_id`
: The ID of the `WorkerRun` corresponding object on the Arkindex instance. In Arkindex mode, this is used in `RetrieveWorkerRun` API call to retrieve the configuration and other necessary information. In developer mode, this is not set nor used.
`worker_version_id`
: The ID of the `WorkerVersion` object linked to the current `WorkerRun`. Like the `worker_run_id` attribute, this is not set nor used in developer mode.
# Worker Implementation
This section presents
- the different stages happening during a worker execution:
- the initialization
- the [configuration](./configure.md)
- the execution
- the conception of a worker
- the architecture
- additional configuration steps
- element processing
The following graph describes what happens when running the worker, either on Arkindex or locally. Words in italic font are actual method calls in the worker.
```mermaid
flowchart LR
subgraph all[Worker execution]
direction LR
subgraph id1[Worker initialization]
init
end
run -.-> configure
subgraph id2[Inference]
direction TB
configure --> list_elements
list_elements --> element_processing
subgraph id3[Loop over each element]
element_processing --> element_processing
end
element_processing -- Save ML report to disk --> reporting
end
init --> run
end
classDef pyMeth font-style:italic
class init,run,configure,list_elements pyMeth
```
More details about the `element_processing` step.
```mermaid
flowchart LR
subgraph all[Element processing]
direction LR
subgraph id1[Element details retrieval]
retrieve_element
end
retrieve_element --> update_activity_started
subgraph id2[Processing]
direction LR
update_activity_started[update_activity] --> process_element -- No errors --> update_activity_processed
update_activity_started -- to Started --> update_activity_started
update_activity_processed[update_activity] -- to Processed --> update_activity_processed
update_activity_error[update_activity] -- to Error --> update_activity_error
end
process_element -- Errors found --> update_activity_error
end
classDef pyMeth font-style:italic
class process_element,update_activity_started,update_activity_error,update_activity_processed pyMeth
```
......@@ -70,6 +70,9 @@ nav:
- Using secrets in workers:
- contents/secrets/index.md
- Usage: contents/secrets/usage.md
- Worker Implementation:
- contents/implem/index.md
- Configuration: contents/implem/configure.md
- Python Reference:
- Base Worker: ref/base_worker.md
- Elements Worker: ref/elements_worker.md
......@@ -101,7 +104,11 @@ markdown_extensions:
- admonition # syntax coloration in code blocks
- codehilite
- pymdownx.details
- pymdownx.superfences
- pymdownx.superfences:
custom_fences:
- name: mermaid
class: mermaid
format: !!python/name:pymdownx.superfences.fence_code_format # yamllint disable-line
copyright: Copyright © Teklia
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment