Skip to content

Workers processes can be started with unavailable WorkerVersions

Sentry Issue: ARKINDEX-BACKEND-1DB

ValueError: badly formed hexadecimal UUID string
  File "django/db/models/fields/__init__.py", line 2649, in to_python
    return uuid.UUID(**{input_form: value})
  File "uuid.py", line 171, in __init__
    raise ValueError('badly formed hexadecimal UUID string')

ValidationError: ['“None” is not a valid UUID.']
(23 additional frame(s) were not displayed)
...
  File "django/db/models/lookups.py", line 27, in __init__
    self.rhs = self.get_prep_lookup()
  File "django/db/models/lookups.py", line 341, in get_prep_lookup
    return super().get_prep_lookup()
  File "django/db/models/lookups.py", line 85, in get_prep_lookup
    return self.lhs.output_field.get_prep_value(self.rhs)
  File "django/db/models/fields/__init__.py", line 2633, in get_prep_value
    return self.to_python(value)
  File "django/db/models/fields/__init__.py", line 2651, in to_python
    raise exceptions.ValidationError(

Starting this process in demo (with more details in the Django admin) causes a HTTP 500 error due to Ponos trying to find an artifact with an id of None.

5 out of the 6 WorkerVersions on this process are marked as WorkerVersionState.Error and do not have a Docker image artifact ID, due to their artifacts getting deleted by the cleanup after the fix in #1402 (closed).

The YAML recipe includes artifact: None because both WorkerRun.build_task_recipe (for Workers processes) and Process.build_workflow (for Training processes) use 'artifact': str(version.docker_image_id) without checking if the ID is an UUID. None is interpreted as a 'None' string in YAML, unlike null.

We could have had at least three layers of defense against this:

  • Ponos should validate that the artifact ID is an UUID (ponos#132)
  • Simple assertions in both recipe building methods could have checked that the relevant version is both Available and with a Docker image artifact. This would have caused an HTTP 500 still, but a more explicit one, and we can unit test that.
  • Starting this process should have resulted in a HTTP 400 stating that some versions are not available.

Note that we have a database constraint preventing versions from being available without a Docker image, but those check constraints cannot prevent a process from being linked to versions that are unavailable or without images. We also already have a check on CreateWorkerRun to prevent adding an unavailable worker version to a process, but that will not detect when a worker version becomes unavailable after it got added to a process.