Skip to content

Build UTF-8 script payloads

Retrieve raw scripts list with glyphs ranges

curl https://www.unicode.org/Public/13.0.0/ucd/Scripts.txt | grep -v '^#' | grep -v '^$' > scripts.txt

Then build a JSON file with those glyphs:

import re
import json
import collections

REGEX = re.compile(r'^([A-F0-9]+)(?:\.\.([A-F0-9]+))?\s+; ([\w_]+)')

def parse(path):
    scripts = collections.defaultdict(list)
    with open(path) as f:
        lines = f.readlines()

    for line in lines:
        output = REGEX.search(line)
        assert output is not None, f"Invalid parsing on: {line}"

        start, end, name = output.groups()
        if end is not None:
            scripts[name].append((start, end))
        else:
            scripts[name].append(start)

    return scripts

if __name__ == '__main__':

    print(json.dumps(parse('scripts.txt'), indent=4))

Tasks:

  • cleanup python script
  • support raw file without grep cleanup (avoid comment and enmpty lines)
  • generate one clean JSON file per script in a assets/scripts/ folder in this repo

Expected output for the Script Nyiakeng_Puachue_Hmong:

  • a file named assets/scripts/Nyiakeng_Puachue_Hmong.json
  • with that structure:
{
  "name": "Nyiakeng Puachue Hmong",
  "characters": [
    {
      "code": "000A",
      "description": "A nice character",
    },
  ...
  ]
}

A second pass is necessary to retrieve the glyph names from https://www.unicode.org/Public/13.0.0/ucd/extracted/DerivedName.txt

Edited by Bastien Abadie