Configure method documentation

3af756ad · Yoann Schneider · Bastien Abadie · da41732f · 3af756ad · 3af756ad
Commit 3af756ad authored 2 years ago by Yoann Schneider Committed by Bastien Abadie 2 years ago
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@@ -30,7 +30,7 @@ repos:
      - id: trailing-whitespace
      - id: check-yaml
        args: [--allow-multiple-documents]
-        exclude: "^worker-{{cookiecutter.slug}}/.arkindex.yml$"
+        exclude: "^worker-{{cookiecutter.slug}}/.arkindex.yml$|^mkdocs.yml$"
      - id: mixed-line-ending
      - id: name-tests-test
        args: ['--django']

--- a/docs/contents/implem/configure.md
+++ b/docs/contents/implem/configure.md
+# Configuration
+
+When the worker is running over elements, be it locally or on Arkindex, the first step before actually doing
+anything is configuration. This process is implemented in the `configure` method.
+This method can also be overloaded if the worker needs additional configuration steps.
+
+The developer mode was designed to help worker developers reproduce and test how their worker
+would behave on Arkindex. This is why the configuration process in this mode mirrors the operations done on Arkindex while
+replacing configuration API calls by CLI arguments.
+
+The developer mode (or `read-only` mode) is enabled when at least either:
+
+- the `--dev` CLI argument is used,
+- the `ARKINDEX_WORKER_RUN_ID` variable was not set in the environment.
+
+None of these happen when running on Arkindex.
+
+## Parallel between both modes
+
+```mermaid
+flowchart TB
+    subgraph configure[Configuration step]
+        argument_parsing[CLI argument parsing]
+    end
+    argument_parsing --> is_read_only{IsReadOnly?}
+    is_read_only -- Yes --> devMode
+    is_read_only -- No --> arkindexMode
+    subgraph arkindexMode[Arkindex mode]
+        direction TB
+        subgraph workerConfiguration[Worker configuration]
+            direction TB
+            retrieveWorkerRun["API call to RetrieveWorkerRun"] --> userconfig_defaults[Initialize user configuration with default values]
+            userconfig_defaults --> load_secrets_API["Load Secrets using API calls to RetrieveSecret"]
+            load_secrets_API --> load_user_config[Override user configuration by values set by user]
+            load_user_config --> load_model_config["Load model configuration"]
+        end
+        workerConfiguration --> cacheConfiguration
+        subgraph cacheConfiguration[Base worker cache setup]
+            direction TB
+            get_paths_from_parent_tasks["Retrieve paths of parent tasks' cache databases"] --> initialize_db[Create cache database and its tables]
+            initialize_db --> merge_parent_databases[Merge parents databases]
+        end
+    end
+
+    subgraph devMode[Developer mode]
+        direction TB
+        subgraph devWorkerConfiguration[Worker configuration]
+            direction TB
+            configuration_parsing[CLI config argument parsing] --> corpus_id[Read Corpus ID from environment]
+            corpus_id --> load_secrets[Load secret in local developer storage]
+        end
+    end
+    classDef pyMeth font-style:italic
+```
+
+## Arkindex mode
+The details of a worker execution (what is called a **WorkerRun**) on Arkindex are stored in the backend. The first step of the configuration is to retrieve this information using the Arkindex API. The [RetrieveWorkerRun](https://demo.arkindex.org/api-docs/#tag/process/operation/RetrieveWorkerRun) endpoint gives information about:
+
+- the running process,
+- the configuration parameters that the user may have added from the frontend,
+- the worker used,
+- the version of this worker,
+- the configuration stored in this version,
+- the model version used in this worker execution,
+- the configuration stored in this model version.
+
+This step shows that there are a lot of sources for the actual configuration that the worker can use. Nothing is overridden by default, the worker has to do it in its overridden version of the configure method. In the end, any parameter set by the user **must** be applied over other known configurations.
+
+!!! warning
+
+    The convention is to always give the final word to the user. This means that when the user configuration is filled, its values must be the last to override the worker's `config` attribute. If a model configuration was set, its values must override this attribute before the user configuration's.
+
+The worker configuration may specify default values for some parameters (see [this section](../workers/yaml.md#setting-up-user-configurable-parameters) for more details about worker configuration). These default values are stored in the `user_configuration` dictionary attribute.
+
+This is also when the secrets (see [this section](../secrets/usage.md#declaring-secrets-in-workers) to learn more about secrets) are actually downloaded. They are stored in the `secrets` dictionary attribute.
+
+An Arkindex-mode exclusive step is done after all that: the cache setup. Some workers benefit a lot, performance-wise, from having a SQLite cache artifact from previous workers. This is mostly used in processes with multiple workers with dependencies, where the second worker needs the results of the first one to work. The database is initialized, the tables created and its version checked as it must match the one supported by the Arkindex instances. The database is then merged with any other database generated by previous worker runs.
+
+## Developer mode
+In the developer mode, the worker execution is not linked to anything on Arkindex. Therefore, the only configuration the worker can use is provided via the `--config` CLI argument. It supports YAML-formatted file and it should be similar to the `configuration` section of the [worker configuration file](../workers/yaml/#single-worker-configuration), without the `user_configuration` details. More details about how to create the local worker configuration are available in [this section](../workers/run-local/).
+
+The multiple configuration sources from the Arkindex-mode are merged into a unique one here. The configuration parameters are parsed as well as the list of required secrets. The secrets are loaded using a local Arkindex client. Again, see the [section about local execution](../workers/run-local/) for more details.
+
+One information cannot be retrieved directly from the configuration file and is required in some cases: the ID of the Arkindex corpus which the elements processed belong to. This is retrieved via the `ARKINDEX_CORPUS_ID` environment variable.
+
+## Worker reporter
+At the end of a worker execution, a report about the publication done by the worker is generated in JSON-format. This lists
+
+- the starting time,
+- the number of elements created, grouped by type,
+- the number of transcription created,
+- the number of classifications created, grouped by class,
+- the number of entities created,
+- the number of entities created on transcriptions,
+- the number of metadatas created,
+- the encountered errors' logs.
+
+This is done by the many helper described in the [reporting module](../../ref/reporting.md). They use the `report` attribute initialized at the configuration stage.
+
+## Setting Debug logging level
+There are three ways to activate the debug mode:
+
+- the `--verbose` CLI argument,
+- setting the `ARKINDEX_DEBUG` environment variable to `True`,
+- setting `"debug": True` in the worker's configuration via any configuration source.
+
+## Important class attributes
+Many attributes are set on the worker during at the configuration stage. Here is a *non-exhaustive* list with some details about their source and their usage.
+
+
+`api_client`
+: The Arkindex API client used by the worker to make the requests. One should not rely on this attribute to make API calls but use the many helpers available. The exception is for endpoints where no helper are available.
+
+`args`
+: The arguments passed via the CLI. This is used to trigger the Developer mode via `--dev`, to specify the configuration file via `--config` and to list elements to process via `--element`.
+
+`config`
+: A dictionary with the worker's configuration. This is filled by the worker run's configuration, the worker version's and the model version's if there is any.
+
+`corpus_id`
+: The ID of the corpus linked to the current process. This is mostly needed when publishing objects linked to a corpus like `Entities`. You may set it in developer mode via the `ARKINDEX_CORPUS_ID` environment variable.
+
+`is_read_only`
+: This is the computed property that determines which mode should be used. The Developer mode prevents any actual publication on Arkindex, hence the name `read_only`.
+
+`model_configuration`
+: The parsed configuration as stored in the `ModelVersion` object on Arkindex.
+
+`process_information`
+: The details about the process parent to this worker execution. Only set in Arkindex mode.
+
+`reporter`
+: The `Reporter` instance that will generate the `ml_report.json` artifacts which sums up the publication done during this execution and the errors encountered.
+
+`secrets`
+: A dictionary mapping the secret name to their parsed content.
+
+`use_cache`
+: Whether the cache optimization is available or not.
+
+`user_configuration`
+: The parsed configuration as the user entered it via the Arkindex frontend. Any parameter not specified will be filled with its default value if there is one.
+
+`worker_details`
+: The details of the worker used in this execution.
+
+`worker_run_id`
+: The ID of the `WorkerRun` corresponding object on the Arkindex instance. In Arkindex mode, this is used in `RetrieveWorkerRun` API call to retrieve the configuration and other necessary information. In developer mode, this is not set nor used.
+
+`worker_version_id`
+: The ID of the `WorkerVersion` object linked to the current `WorkerRun`. Like the `worker_run_id` attribute, this is not set nor used in developer mode.
--- a/docs/contents/implem/index.md
+++ b/docs/contents/implem/index.md
+# Worker Implementation
+
+This section presents
+
+- the different stages happening during a worker execution:
+    - the initialization
+    - the [configuration](./configure.md)
+    - the execution
+- the conception of a worker
+    - the architecture
+    - additional configuration steps
+    - element processing
+
+The following graph describes what happens when running the worker, either on Arkindex or locally. Words in italic font are actual method calls in the worker.
+
+```mermaid
+flowchart LR
+    subgraph all[Worker execution]
+        direction LR
+        subgraph id1[Worker initialization]
+            init
+        end
+        run -.-> configure
+        subgraph id2[Inference]
+            direction TB
+            configure --> list_elements
+            list_elements --> element_processing
+            subgraph id3[Loop over each element]
+                element_processing --> element_processing
+            end
+            element_processing -- Save ML report to disk --> reporting
+        end
+        init --> run
+    end
+    classDef pyMeth font-style:italic
+    class init,run,configure,list_elements pyMeth
+```
+
+More details about the `element_processing` step.
+
+```mermaid
+flowchart LR
+    subgraph all[Element processing]
+        direction LR
+        subgraph id1[Element details retrieval]
+            retrieve_element
+        end
+        retrieve_element --> update_activity_started
+        subgraph id2[Processing]
+            direction LR
+            update_activity_started[update_activity] --> process_element -- No errors --> update_activity_processed
+            update_activity_started -- to Started --> update_activity_started
+            update_activity_processed[update_activity] -- to Processed --> update_activity_processed
+            update_activity_error[update_activity] -- to Error --> update_activity_error
+        end
+        process_element -- Errors found --> update_activity_error
+    end
+    classDef pyMeth font-style:italic
+    class process_element,update_activity_started,update_activity_error,update_activity_processed pyMeth
+```
+
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -70,6 +70,9 @@ nav:
  - Using secrets in workers:
    - contents/secrets/index.md
    - Usage: contents/secrets/usage.md
+  - Worker Implementation:
+      - contents/implem/index.md
+      - Configuration: contents/implem/configure.md
  - Python Reference:
      - Base Worker: ref/base_worker.md
      - Elements Worker: ref/elements_worker.md
@@ -101,7 +104,11 @@ markdown_extensions:
    - admonition # syntax coloration in code blocks
    - codehilite
    - pymdownx.details
-    - pymdownx.superfences
+    - pymdownx.superfences:
+        custom_fences:
+          - name: mermaid
+            class: mermaid
+            format: !!python/name:pymdownx.superfences.fence_code_format # yamllint disable-line

 copyright:  Copyright &copy; Teklia