Skip to content
GitLab
Menu
Projects
Groups
Snippets
/
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
Menu
Open sidebar
Profiterole
SRCMF-UD
Commits
6f78640e
Commit
6f78640e
authored
Oct 29, 2021
by
Loïc Grobol
Browse files
improve stats
parent
38ca324d
Pipeline
#8626
passed with stages
in 1 minute and 56 seconds
Changes
1
Pipelines
1
Hide whitespace changes
Inline
Side-by-side
tools/get_statistics.py
View file @
6f78640e
...
...
@@ -30,13 +30,18 @@ def main(corpus_path: pathlib.Path):
trees
:
Dict
[
str
,
Tuple
[
str
,
conllu
.
TokenList
]]
=
dict
()
print
(
"Per file:"
)
print
(
"file
\t
tokens
\t
trees"
)
total_tokens
=
0
total_trees
=
0
for
f
in
corpus_path
.
glob
(
"*.conllu"
):
with
open
(
f
)
as
in_stream
:
file_trees
=
list
(
conllu
.
parse_incr
(
in_stream
))
trees
.
update
((
t
.
metadata
[
"sent_id"
],
(
f
.
stem
,
t
))
for
t
in
file_trees
)
n_trees
=
len
(
file_trees
)
n_tokens
=
sum
(
len
(
t
)
for
t
in
file_trees
)
n_trees
=
len
(
file_trees
)
total_tokens
+=
n_tokens
total_trees
+=
n_trees
print
(
f
"
{
f
.
stem
}
\t
{
n_tokens
}
\t
{
n_trees
}
"
)
print
(
f
"total
\t
{
total_tokens
}
\t
{
total_trees
}
"
)
print
()
print
(
"Tokens repartition:"
)
with
open
(
corpus_path
/
"split.json"
)
as
in_stream
:
...
...
Write
Preview
Supports
Markdown
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment