|
|
# Cleaning features
|
|
|
# Cleaning options
|
|
|
|
|
|
In addition to segmenting pages into articles and encoding them into XML-TEI, `soprano` has several options to clean their contents.
|
|
|
|
|
|
## `charWidth`
|
|
|
|
|
|
## `lineHeight`
|
|
|
|
|
|
## `noise`
|
|
|
|
|
|
This option (`-n` or `--noise`) expects as argument the path to a (UTF8-encoded) file containing one instance of each character which you know can't occur in the corpus and must be OCR noise (and only those characters, don't add newlines or spaces, though they are seldom represented directly in an ALTO file but instead yield the end of a `<String/>` or a `<TextLine/>`, any possible occurrence of them in the text would be deleted).
|
|
|
|
|
|
## `scoriae`
|
|
|
|
|
|
This option (`-s` or `--scoriae`) expects as argument the path to a CSV file containing a single column with the IDs of the `<String/>` elements to remove as is exported by [chaoui](https://gitlab.huma-num.fr/alicebrenon/chaoui). The column header is by convention `ID` in the files generated by `chaoui` but the first line is actually dropped in the loading process so you can give it any name you wish (make sure there is one though, or the first `ID` will simply get ignored).
|
|
|
|
|
|
## `scoriaThreshold`
|
|
|
|
|
|
# Cleaning steps
|
|
|
|
|
|
When processing files, `soprano` will apply a certain number of fixes on the files. The [relevant code](https://gitlab.huma-num.fr/disco-lge/soprano/-/blob/main/lib/Step.hs#L33) is of course the most exact reference but here is the big picture of what happens in the order.
|
|
|
|
|
|
All the "interesting" content for our purpose is situated under the `<PrintSpace/>` element, so this is where all that follows happens.
|
|
|
|
|
|
## Skimming
|
|
|
|
|
|
The XML children `<TextBlock/>` and `<TextLine/>` subtrees of `<PrintSpace/>` are traversed recursively.
|
|
|
|
|
|
Space elements `<SP/>` are directly removed (in the LGE corpus, they contain no relevant information : their `HPOS`, `VPOS` and `WIDTH` attributes are all set to `0`).
|
|
|
|
|
|
Word elements `<String/>` are filtered : the ones with an `ID` attribute listed in the [`scoriae`](#scoriae) CSV file provided are deleted, the content of all the others is filtered to remove the characters appearing in the [`noise`](#noise) file and they get deleted if they become empty in the process.
|
|
|
|
|
|
All the other kinds of elements (in the LGE corpus, this amounts only to `<Illustration/>`) are left untouched. The traversed `<TextBlock/>` and `<TextLine/>` are then removed when they have no content left after applying those skimming rules.
|
|
|
|
|
|
## Geometry checks
|
|
|
|
|
|
## OLR fixes
|
... | ... | |