Handle entities starting with a number in `XMLEntity` dataclass
I encountered an issue while trying to extract LOC-digirati dataset with teklia-dan dataset extract
:
Extracting data from form (002d17cf-0f7f-45f6-9da5-c4852f051043) for split (train): 0%| | 0/395 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/home/users/ebardou/atr/dan/dan_venv/bin/teklia-dan", line 8, in <module>
sys.exit(main())
File "/home/users/ebardou/atr/dan/dan_venv/lib/python3.10/site-packages/dan/cli.py", line 26, in main
status = args.pop("func")(**args)
File "/home/users/ebardou/atr/dan/dan_venv/lib/python3.10/site-packages/dan/datasets/extract/arkindex.py", line 350, in run
).run()
File "/home/users/ebardou/atr/dan/dan_venv/lib/python3.10/site-packages/dan/datasets/extract/arkindex.py", line 301, in run
self.process_parent(
File "/home/users/ebardou/atr/dan/dan_venv/lib/python3.10/site-packages/dan/datasets/extract/arkindex.py", line 181, in process_parent
self.process_element(dataset_parent, parent)
File "/home/users/ebardou/atr/dan/dan_venv/lib/python3.10/site-packages/dan/datasets/extract/arkindex.py", line 150, in process_element
text = self.extract_transcription(element)
File "/home/users/ebardou/atr/dan/dan_venv/lib/python3.10/site-packages/dan/datasets/extract/arkindex.py", line 124, in extract_transcription
entities_to_xml(
File "/home/users/ebardou/atr/dan/dan_venv/lib/python3.10/site-packages/dan/datasets/extract/utils.py", line 316, in entities_to_xml
entity.insert(root)
File "/home/users/ebardou/atr/dan/dan_venv/lib/python3.10/site-packages/dan/datasets/extract/utils.py", line 227, in insert
e = SubElement(parent, slugify(self.type))
File "src/lxml/etree.pyx", line 3156, in lxml.etree.SubElement
File "src/lxml/apihelpers.pxi", line 179, in lxml.etree._makeSubElement
File "src/lxml/apihelpers.pxi", line 1754, in lxml.etree._tagValidOrRaise
ValueError: Invalid tag name '2_copy_date'
Extract from the tokens.yml
file:
---
1_copy_date:
start: Ⓐ
end: ''
2_copy_date:
start: Ⓑ
end: ''
[...]
The extraction failed on this element which holds a 2_copy_date
entity.