|
|
# soprano
|
|
|
|
|
|
Soprano is a multipurpose command-line tool that can be used for tasks as various as scraping text from articles of an encyclopedia or fixing the reading order of a page. This behaviour is obtained by combining options to enable processing steps and select an output format.
|
|
|
|
|
|
## Processing Steps
|
|
|
|
|
|
While `soprano` was designed as the link between the OCRized ALTO files of the encyclopedia and the TEI-encoded articles to morphosyntactically annotate in the processing line used for project [DISCO-LGE](https://www.collexpersee.eu/projet/disco-lge/), it can be used in interaction with [chaoui](https://gitlab.huma-num.fr/alicebrenon/chaoui) to clean out the pages and process specific parts of the corpus to work on encoding improvements. To do so, you can pass one of the following (mutually exclusive) options to tell it to stop at various steps:
|
|
|
|
|
|
- `--raw` is the very first processing step, leaving the input file(s) unchanged. It is however not equivalent to a mere `cat` of the document because the XML will be completely parsed before being re-encoded for output and may differ if the input file doesn't use the same encoding conventions in term of spacing in XML markups or newlines. Note that, combined with the [`--text` flag](#output-format), this can be very far from the identity function and used to simply extract raw text from the pages. This step will be used by default.
|
|
|
- `--fixed` enables `soprano` to [clean](Cleaning) the files, for instance emoving autodetected, manually-annotated scoriae or reordering the content. What operation is enabled can then be finely tuned by passing the appropriate `--noise`, `--scoriae` or `--scoriaThreshold` options and the output of this step is still whole ALTO pages.
|
|
|
- `--articles` applies the fixes too but then proceeds to cut the flow of pages into separate articles. When used on several pages, you should make sure they are sorted in the right order lest the article at the end of a page will be squashed with a different one at the beginning of the following one.
|
|
|
|
|
|
## Output format
|
|
|
|
|
|
By default, `soprano` will output XML. Depending on the step you ask it to reach, the files will be ALTO-XML or XML-TEI. ALTO is the format used to represent whole pages, and is the expected format for inputs. As long as the output is made of pages (default step `--raw` and next one `--fixed`), they will be in ALTO too. The last step, `--articles`, is designed to encode the articles in XML-TEI.
|
|
|
|
|
|
With the `--text` flag, you can instead make `soprano` output text. This can be used to dump pages or articles in raw text. If enabled, all content (pages headers and footers, figure captions) will appear in the linear order the appear in the document, meaning that if there's a figure in the middle of a column its caption will occur between two normally contiguous words in the text.
|
|
|
|
|
|
This flag is hence more useful when used in combination with the `--keep` option that allows to select only certain types of (ALTO) blocks in the output. ALTO has a geometrical view of the pages, grouping words (`String`) into lines (`TextLine`) and lines into «blocks» (`TextBlock`). In its attempt to fix the order of the pages, `soprano` characterizes each of those `blocks` (and will also divide some of them before that to fix OLR errors) according to its position and contents. All block types will be kept by default and the possible labels are
|
|
|
|
|
|
- Header: peritext elements that appear at the top of pages, usually page number and head words of the first and last articles occuring on the page
|
|
|
- Footer: peritext elements that occur periodically due to bookbinding reasons numbering the folded physical sheets of papers used to make the pages
|
|
|
- Text: «regular» text the columns are made of with the linear content of articles
|
|
|
- Caption: isolated lines occurring underneath figures
|
|
|
- Special: pretty much all the rest, including figures, blocks found above the header or below the footer and all unexpected content that doesn't fit in the detected page layout
|
|
|
|
|
|
By using `--keep` while in `--text` mode, you can for instance retrieve only the text of an article without the captions of the figures that appear within, or without all the peritext elements that feature in the pages, or only the headers of pages etc. Please note that if you remove `Text` blocks, there will be no articles in the output flow so when used in conjunction with `--articles`, all the output will appear in a single article named `article_1`, with the extension matching the selected output format. |