Skip to content

Handle entities starting with a number in `XMLEntity` dataclass

I encountered an issue while trying to extract LOC-digirati dataset with teklia-dan dataset extract:

Extracting data from form (002d17cf-0f7f-45f6-9da5-c4852f051043) for split (train):   0%|          | 0/395 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/home/users/ebardou/atr/dan/dan_venv/bin/teklia-dan", line 8, in <module>
    sys.exit(main())
  File "/home/users/ebardou/atr/dan/dan_venv/lib/python3.10/site-packages/dan/cli.py", line 26, in main
    status = args.pop("func")(**args)
  File "/home/users/ebardou/atr/dan/dan_venv/lib/python3.10/site-packages/dan/datasets/extract/arkindex.py", line 350, in run
    ).run()
  File "/home/users/ebardou/atr/dan/dan_venv/lib/python3.10/site-packages/dan/datasets/extract/arkindex.py", line 301, in run
    self.process_parent(
  File "/home/users/ebardou/atr/dan/dan_venv/lib/python3.10/site-packages/dan/datasets/extract/arkindex.py", line 181, in process_parent
    self.process_element(dataset_parent, parent)
  File "/home/users/ebardou/atr/dan/dan_venv/lib/python3.10/site-packages/dan/datasets/extract/arkindex.py", line 150, in process_element
    text = self.extract_transcription(element)
  File "/home/users/ebardou/atr/dan/dan_venv/lib/python3.10/site-packages/dan/datasets/extract/arkindex.py", line 124, in extract_transcription
    entities_to_xml(
  File "/home/users/ebardou/atr/dan/dan_venv/lib/python3.10/site-packages/dan/datasets/extract/utils.py", line 316, in entities_to_xml
    entity.insert(root)
  File "/home/users/ebardou/atr/dan/dan_venv/lib/python3.10/site-packages/dan/datasets/extract/utils.py", line 227, in insert
    e = SubElement(parent, slugify(self.type))
  File "src/lxml/etree.pyx", line 3156, in lxml.etree.SubElement
  File "src/lxml/apihelpers.pxi", line 179, in lxml.etree._makeSubElement
  File "src/lxml/apihelpers.pxi", line 1754, in lxml.etree._tagValidOrRaise
ValueError: Invalid tag name '2_copy_date'

Extract from the tokens.yml file:

---
1_copy_date:
  start: 
  end: ''
2_copy_date:
  start: 
  end: ''
[...]

The extraction failed on this element which holds a 2_copy_date entity.