Workers processes can be started with unavailable WorkerVersions
Sentry Issue: ARKINDEX-BACKEND-1DB
ValueError: badly formed hexadecimal UUID string
File "django/db/models/fields/__init__.py", line 2649, in to_python
return uuid.UUID(**{input_form: value})
File "uuid.py", line 171, in __init__
raise ValueError('badly formed hexadecimal UUID string')
ValidationError: ['“None” is not a valid UUID.']
(23 additional frame(s) were not displayed)
...
File "django/db/models/lookups.py", line 27, in __init__
self.rhs = self.get_prep_lookup()
File "django/db/models/lookups.py", line 341, in get_prep_lookup
return super().get_prep_lookup()
File "django/db/models/lookups.py", line 85, in get_prep_lookup
return self.lhs.output_field.get_prep_value(self.rhs)
File "django/db/models/fields/__init__.py", line 2633, in get_prep_value
return self.to_python(value)
File "django/db/models/fields/__init__.py", line 2651, in to_python
raise exceptions.ValidationError(
Starting this process in demo (with more details in the Django admin) causes a HTTP 500 error due to Ponos trying to find an artifact with an id of None
.
5 out of the 6 WorkerVersions on this process are marked as WorkerVersionState.Error
and do not have a Docker image artifact ID, due to their artifacts getting deleted by the cleanup after the fix in #1402 (closed).
The YAML recipe includes artifact: None
because both WorkerRun.build_task_recipe
(for Workers
processes) and Process.build_workflow
(for Training
processes) use 'artifact': str(version.docker_image_id)
without checking if the ID is an UUID. None
is interpreted as a 'None'
string in YAML, unlike null
.
We could have had at least three layers of defense against this:
- Ponos should validate that the artifact ID is an UUID (ponos#132)
- Simple assertions in both recipe building methods could have checked that the relevant version is both
Available
and with a Docker image artifact. This would have caused an HTTP 500 still, but a more explicit one, and we can unit test that. - Starting this process should have resulted in a HTTP 400 stating that some versions are not available.
Note that we have a database constraint preventing versions from being available without a Docker image, but those check constraints cannot prevent a process from being linked to versions that are unavailable or without images. We also already have a check on CreateWorkerRun
to prevent adding an unavailable worker version to a process, but that will not detect when a worker version becomes unavailable after it got added to a process.