Build UTF-8 script payloads
Retrieve raw scripts list with glyphs ranges
curl https://www.unicode.org/Public/13.0.0/ucd/Scripts.txt | grep -v '^#' | grep -v '^$' > scripts.txt
Then build a JSON file with those glyphs:
import re
import json
import collections
REGEX = re.compile(r'^([A-F0-9]+)(?:\.\.([A-F0-9]+))?\s+; ([\w_]+)')
def parse(path):
scripts = collections.defaultdict(list)
with open(path) as f:
lines = f.readlines()
for line in lines:
output = REGEX.search(line)
assert output is not None, f"Invalid parsing on: {line}"
start, end, name = output.groups()
if end is not None:
scripts[name].append((start, end))
else:
scripts[name].append(start)
return scripts
if __name__ == '__main__':
print(json.dumps(parse('scripts.txt'), indent=4))
Tasks:
- cleanup python script
- support raw file without grep cleanup (avoid comment and enmpty lines)
- generate one clean JSON file per script in a
assets/scripts/
folder in this repo
Expected output for the Script Nyiakeng_Puachue_Hmong
:
- a file named
assets/scripts/Nyiakeng_Puachue_Hmong.json
- with that structure:
{
"name": "Nyiakeng Puachue Hmong",
"characters": [
{
"code": "000A",
"description": "A nice character",
},
...
]
}
A second pass is necessary to retrieve the glyph names from https://www.unicode.org/Public/13.0.0/ucd/extracted/DerivedName.txt
Edited by Bastien Abadie