Skip to content
GitLab
Explore
Sign in
Register
Primary navigation
Search or go to…
Project
D
DAN
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Deploy
Releases
Package registry
Container Registry
Operate
Terraform modules
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
Automatic Text Recognition
DAN
Merge requests
!282
Charset should only include training characters
Code
Review changes
Check out branch
Download
Patches
Plain diff
Merged
Charset should only include training characters
training-charset
into
main
Overview
18
Commits
5
Pipelines
0
Changes
2
Merged
Manon Blanco
requested to merge
training-charset
into
main
1 year ago
Overview
17
Commits
5
Pipelines
0
Changes
2
Expand
Closes
#190 (closed)
Edited
1 year ago
by
Manon Blanco
0
0
Merge request reports
Viewing commit
6e62ab4d
Prev
Next
Show latest version
2 files
+
6
−
6
Inline
Compare changes
Side-by-side
Inline
Show whitespace changes
Show one file at a time
Files
2
Search (e.g. *.vue) (Ctrl+P)
6e62ab4d
Add unknown token in charset
· 6e62ab4d
Manon Blanco
authored
1 year ago
dan/datasets/extract/extract.py
+
3
−
3
Options
@@ -277,7+277,7 @@
if
self
.
unknown_token
in
text
:
raise
UnknownTokenInText
(
element_id
=
element
.
id
)
image_path
=
Path
(
self
.
output
,
IMAGES_DIR
,
split
,
element
.
id
).
with_suffix
(
self
.
image_extension
)
@@ -293,7+293,7 @@
}
)
self
.
data
[
split
][
str
(
image_path
)]
=
self
.
format_text
(
text
=
self
.
format_text
(
text
,
# Do not replace unknown characters in train split
charset
=
self
.
charset
if
split
!=
TRAIN_NAME
else
None
,
)
if
split
==
TRAIN_NAME
:
self
.
charset
=
self
.
charset
.
union
(
set
(
text
))
self
.
data
[
split
][
str
(
image_path
)]
=
text
self
.
charset
=
self
.
charset
.
union
(
set
(
text
))
def
process_parent
(
self
,
Loading