|
|
# Cleaning options
|
|
|
|
|
|
In addition to segmenting pages into articles and encoding them into XML-TEI, `soprano` has several options to clean their contents.
|
|
|
In addition to [segmenting pages into articles and encoding them](Encoding) into XML-TEI, `soprano` has several options to clean their contents.
|
|
|
|
|
|
## `charWidth`
|
|
|
## `keep`
|
|
|
|
|
|
The (comma separated) [block types](Home#output-format) to keep at the end of the [filtering](#block-characterization-and-filtering) step when the [fixes](https://gitlab.huma-num.fr/disco-lge/soprano/-/blob/main/README.md#command-line-arguments) are enabled. All blocks are of course kept by default so the default value when this option isn't set is `Header,Footer,Text,Caption,Special`.
|
|
|
|
|
|
## `lineHeight`
|
|
|
|
|
|
The expected height of a single line in (document) pixels. This value is used at the end of the [block characterization](#block-characterization-and-filtering) step when checking whether a gap between blocks is caused by a single-column or a full-page figure. The default value for this parameter is 50 pixels.
|
|
|
|
|
|
## `noise`
|
|
|
|
|
|
This option (`-n` or `--noise`) expects as argument the path to a (UTF8-encoded) file containing one instance of each character which you know can't occur in the corpus and must be OCR noise (and only those characters, don't add newlines or spaces, though they are seldom represented directly in an ALTO file but instead yield the end of a `<String/>` or a `<TextLine/>`, any possible occurrence of them in the text would be deleted).
|
... | ... | @@ -16,6 +20,8 @@ This option (`-s` or `--scoriae`) expects as argument the path to a CSV file con |
|
|
|
|
|
## `scoriaThreshold`
|
|
|
|
|
|
During the [block level filtering](#block-characterization-and-filtering), the contents is skimmed to automatically remove possible scoriae in addition to the one manually identified and referenced in a CSV file with the above [`scoriae`](#scoriae) option. This parameter sets the level under which words are considered to be scoriae. Word confidence scores are real numbers between 0 and 1; setting this option at 0 effectively disables it since no word will have a word confidence strictly lower than 0, setting it at any value strictly greater than 1 will make every word be treated as a scoria and hence will clear all the document.
|
|
|
|
|
|
# Cleaning steps
|
|
|
|
|
|
When processing files, `soprano` will apply a certain number of fixes on the files. The [relevant code](https://gitlab.huma-num.fr/disco-lge/soprano/-/blob/main/lib/Step.hs#L33) is of course the most exact reference but here is the big picture of what happens in the order.
|
... | ... | @@ -69,3 +75,25 @@ This layout structure is internally represented by the following tree during `so |
|
|
Some other fixes are still applied later on but devising this layout allows to get the correct reading order for the page: it simply corresponds to the depth-first traversal of the previous tree.
|
|
|
|
|
|
## Block characterization and filtering
|
|
|
|
|
|
With the information just infered about their relative positions in addition to the one already available about their own contents, the various blocks on the page are given one of the [different possible block types](Home#output-format).
|
|
|
|
|
|
The page is supposed to have a vertical layout by default (that is, to be a column as in the previous example). When this is the case, the best candidate lines to be a header and a footer are identified (each one optional), all blocks in them are deemed to be of the corresponding type (`Header` or `Footer`), all the elements outside these two lines are labeled `Special` and the ones between are then processed as the body of the page. When the page doesn't have a vertical layout (be it a horizontal one or simply one element), the elements are processed as the body.
|
|
|
|
|
|
A second scoriae filtering is performed at this step, only on the `Special` rows and on the body (the `Header` and `Footer` rows are exempt). This one is based on the confidence index returned by the OCR (word confidence `WC` attribute on `<String/>` elements) and the [`scoriaThreshold`](#scoriathreshold) parameter. When all the `String` elements in a `TextBlock` have a word confidence below the `scoriaThreshold`, the whole block is deleted. This behaviour, by requiring all blocks to be of low confidence to avoid false positives, was devised to target spots on the page which are typically short and have low word confidence scores because they map badly to readable characters. It's quite conservative and won't catch all your real scoriae but can be used as a preprocessing step in conjunction with the `--fixed` step and no `--scoriae` CSV file, to pass a cleaner input to the human annotator.
|
|
|
|
|
|
![A scoria at the left bottom of T1 p56](media/scorie.png)
|
|
|
|
|
|
The body of pages generally has a horizontal layout, formed of the two columns of text. But when a figure spreads on the full width of the page, the layout becomes vertical and the reading order is modified : both columns above the figure must be read before proceeding to the columns underneath as in this excerpt from the `ABBAYE` article:
|
|
|
|
|
|
![Fac-simile from the ABBAYE article on T1 p72](media/abbaye.png)
|
|
|
|
|
|
When the situation occurs, the layout algorithm explained above will by default group blocks that way. Unfortunately there are also «regular» columns that get split in a vertical layout because they have (one column) figures that occur side by side:
|
|
|
|
|
|
![Fac-simile from T1 p58](media/normal_2_columns.png)
|
|
|
|
|
|
This is why before characterizing the blocks in the body an attempt is made to fusion rows in a vertical layout when their bottom or tops aren't aligned or there isn't room enough for a figure in the gap between them. The key parameter that controls this behaviour is [`lineHeight`](#lineheight) which sets the expected size of a line in pixels. A gap of minimum 3 rows is necessary for the vertical layout to be kept, and the alignments of blocks is enforced by checking that their vertical offsets remain within the `lineHeight` value.
|
|
|
|
|
|
Finally, the remaining rows (only one if there was no gap or if it got fusioned) are labeled recursively with a [block type](Home#output-format). Broadly speaking, blocks that fill their parent (aligned left or right for a lone element, to the top or the bottom for a column inside a row) are labeled `Text`, the ones that span across the horizontal middle of the page are labeled `Caption` and pretty much everything else, that doesn't seem to fit properly in the expected layout is labeled `Special`.
|
|
|
|
|
|
The tree-shaped layout is then undone by a depth-first traversal to get the blocks in the expected reading order. When the [`--keep`](#keep) option is used, only blocks of the types passed are kept, the others are simply discarded from the output. |