... | ... | @@ -2,6 +2,8 @@ |
|
|
|
|
|
Soprano is a multipurpose command-line tool that can be used for tasks as various as scraping text from articles of an encyclopedia or fixing the reading order of a page. This behaviour is obtained by combining options to enable processing steps and select an output format.
|
|
|
|
|
|
It operates XML-ALTO files, which can be written one per line on its standard input or optionally passed as its only argument. When run on several files, please note that the order in the input matters and is the order in which soprano will process the content.
|
|
|
|
|
|
## Processing Steps
|
|
|
|
|
|
While `soprano` was designed as the link between the OCRized ALTO files of the encyclopedia and the TEI-encoded articles to morphosyntactically annotate in the processing line used for project [DISCO-LGE](https://www.collexpersee.eu/projet/disco-lge/), it can be used in interaction with [chaoui](https://gitlab.huma-num.fr/alicebrenon/chaoui) to clean out the pages and process specific parts of the corpus to work on encoding improvements. To do so, you can pass one of the following (mutually exclusive) options to tell it to stop at various steps:
|
... | ... | |