... | ... | @@ -34,6 +34,12 @@ All the other kinds of elements (in the LGE corpus, this amounts only to `<Illus |
|
|
|
|
|
## Geometry checks
|
|
|
|
|
|
ALTO elements carry geometrical information about their position on the page : `HPOS` and `VPOS` are the cartesian coordinates of the upper-left corner of the element on the page while `WIDTH` and `HEIGHT` represent respectively its horizontal and vertical dimensions. All positions are expressed in the same coordinate system center on the upper-left corner of the whole page with `HPOS` increasing to the right and `VPOS` increasing downwards.
|
|
|
|
|
|
In the LGE corpus for which `soprano` was developed, some elements aren't perfectly adjusted to their contents : for instance, `<TextLine/>` elements may have an `HPOS` and `VPOS` strictly lower that all of their children, or a `WIDTH` larger than the one required to accomodate all of them.
|
|
|
|
|
|
There are natural occurrences of this issue before any modification of the files but the previous skimming step, by modifying the content of some blocks and hence their geometry, is all the more likely to cause others to occur. All the `<TextBlock/>` and `<TextLine/>` elements under the `<PrintSpace/>` in focus are traversed recursively to fit the blocks to their contents. Note that `<String/>` elements are left untouched and their dimensions are not recomputed even if some noise is removed from them.
|
|
|
|
|
|
## OLR fixes
|
|
|
|
|
|
## Layout
|
... | ... | |