Skip to content
GitLab
Explore
Sign in
Register
Primary navigation
Search or go to…
Project
B
BIO Parser
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Build
Pipelines
Jobs
Pipeline schedules
Artifacts
Deploy
Releases
Container Registry
Model registry
Analyze
Contributor analytics
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
Named Entity Recognition
BIO Parser
Commits
9d5df0ab
Commit
9d5df0ab
authored
1 year ago
by
Yoann Schneider
Browse files
Options
Downloads
Patches
Plain Diff
Correct char label between entities
parent
253bd2f7
No related branches found
No related tags found
1 merge request
!5
Correct char label between entities
Pipeline
#154690
passed
1 year ago
Stage: test
Stage: build
Stage: deploy
Changes
2
Pipelines
3
Hide whitespace changes
Inline
Side-by-side
Showing
2 changed files
bio_parser/parse/document.py
+8
-2
8 additions, 2 deletions
bio_parser/parse/document.py
tests/parse/test_document.py
+19
-0
19 additions, 0 deletions
tests/parse/test_document.py
with
27 additions
and
2 deletions
bio_parser/parse/document.py
+
8
−
2
View file @
9d5df0ab
...
...
@@ -315,7 +315,7 @@ class Document:
def
char_labels
(
self
)
->
list
[
str
]:
r
"""
Character-level IOB labels.
Spaces between two tokens with the same label get the same label, others get
'
O
'
.
Spaces between two tokens
part of the same entities
with the same label get the same label, others get
'
O
'
.
Examples:
The space between
'
I
'
and
'
run
'
is tagged as
'
I-Animal
'
, because it
'
s the same named entity label.
...
...
@@ -325,12 +325,18 @@ class Document:
The space between
'
run
'
and
'
fast
'
is tagged as
'
O
'
, because it
'
s not the same label.
>>>
Document
(
bio_repr
=
"
run B-Animal
\n
fast O
"
).
char_labels
[
'
B-Animal
'
,
'
I-Animal
'
,
'
I-Animal
'
,
'
O
'
,
'
O
'
,
'
O
'
,
'
O
'
,
'
O
'
]
The space between
'
dog
'
and
'
cat
'
is tagged as
'
O
'
, because it
'
s not the same entity.
>>>
Document
(
bio_repr
=
"
run B-Animal
\n
cat B-Animal
"
).
char_labels
[
'
B-Animal
'
,
'
I-Animal
'
,
'
I-Animal
'
,
'
O
'
,
'
B-Animal
'
,
'
I-Animal
'
,
'
I-Animal
'
]
"""
tags
=
[]
for
token
,
next_token
in
pairwise
(
self
.
tokens
+
[
None
]):
# Add token tags
tags
.
extend
(
token
.
labels
)
if
next_token
and
token
.
label
==
next_token
.
label
:
if
next_token
and
(
token
.
label
==
next_token
.
label
and
not
next_token
.
tag
==
Tag
.
BEGINNING
):
tags
.
append
(
next_token
.
iob_label
)
elif
next_token
:
tags
.
append
(
Tag
.
OUTSIDE
.
value
)
...
...
This diff is collapsed.
Click to expand it.
tests/parse/test_document.py
+
19
−
0
View file @
9d5df0ab
...
...
@@ -189,6 +189,25 @@ def test_parse_token(document: Document):
assert
token
.
chars
==
[
"
r
"
,
"
o
"
,
"
b
"
,
"
o
"
,
"
t
"
,
"
s
"
]
def
test_consecutive_entities
():
# BIO FILE
# dog B-Animal
# cat B-Animal
document
=
Document
(
"
dog B-Animal
\n
cat B-Animal
"
)
assert
document
.
chars
==
[
"
d
"
,
"
o
"
,
"
g
"
,
"
"
,
"
c
"
,
"
a
"
,
"
t
"
]
assert
document
.
char_labels
==
[
"
B-Animal
"
,
"
I-Animal
"
,
"
I-Animal
"
,
"
O
"
,
# Character between two new entities should be set to O
"
B-Animal
"
,
"
I-Animal
"
,
"
I-Animal
"
,
]
@pytest.mark.parametrize
(
"
annotation
"
,
[
"
Something something
"
,
"
Something A-GPE
"
,
"
Something GPE-A
"
,
"
Something A
"
],
...
...
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment