Skip to content
Snippets Groups Projects

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Select target project
No results found

Target

Select target project
  • ecrinum/anthologia/htr_cpgr23
1 result
Show changes
Commits on Source (4)
Showing
with 445 additions and 9089 deletions
# Codex palatinus graecus 23
Codex palatinus graecus 23 - Ground Truth Dataset Medieval Greek Manuscripts
============================================================================
......@@ -7,7 +8,7 @@ Codex palatinus graecus 23 - Ground Truth Dataset Medieval Greek Manuscripts
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.123.svg)](https://doi.org/10.5281/zenodo.123)
-->
Dataset of HTR ground truth for the Codex palatinus graecus 23 (Palatine Anthology), byzantine writing from the X^th^ century.
Dataset of HTR ground truth for the Codex palatinus graecus 23 (Palatine Anthology), byzantine writing from the X<sup>th</sup> century.
## License
......@@ -15,6 +16,9 @@ This work is licensed under CC BY 4.0. To view a copy of this license, visit htt
## Dataset description
The model was trained from the ground truth produced by the Canada Research Chair on Digital Textualities, as part of the [Anthologia graeca project](https://anthologiagraeca.org/). We focused our ground truth on 50 pages (143-195) and did finetuning on 20 extra pages (196-215).
## Transcription guidelines
This dataset was produced by the Canada Research Chair on Digital Textualities, as part of the [Anthologia graeca project](https://anthologiagraeca.org/).
A first batch of 50 pages (143-195) were initially transcribed to train a transcription model prototype. We then added 20 pages (196-215) to produce the first version of a transcription model for Greek manuscripts. The transcription of these 70 pages can be found in `data/CPgr23`.
......@@ -24,7 +28,35 @@ A first batch of 50 pages (143-195) were initially transcribed to train a transc
<!-- to be completed -->
<!-- remember to illustrate! :) -->
To come.
### Interlinear Addition
Interlinear additions are transcribed when they form complete words, abbreviations or obvious corrections.
### Character Standardization
Sigma is always transcribed "σ" and never "ς". This decision reflects the scribe's practice of using exclusively "σ"
### Symbols
Every significant symbol was assigned a Unicode symbol using a correspondence table (see [table_CPgr23.csv](./data/CPgr23/table_CPgr23.csv) for every character used in our transcription).
These symbols have been chosen to be as similar as possible to the symbols used in the manuscript.
| Context | Symbol | Unicode hexadecimal |
|:-----------------------------------:|:------:|:-------------------:|
| Begining of an epigram | ⁛ | U+205B |
| End of an epigram | ⋇ | U+22C7 |
| Begining of a scholia in the margin | ∻ | U+223B |
| Begining of a scholia in main text | ※ | U+203B |
| End of a scholia | \~ | U+301C |
### Punctuation
Every punctuation sign have been simplified and transcribed as an interpunct (·).
### Abbreviations
All abbreviations have been transcribed in expanded form: for example, "ȣ" is transcribed "ου" and "ϗ" is transcribed "καί".
## Model description
......@@ -35,14 +67,20 @@ A transcription model for Greek manuscripts was trained using this dataset. It c
This ground truth is based on images of the codex palatinus graecus 23 digitized by the Universitätsbibliothek Heidelberg (where the first part of the manuscript is kept -- the second one being in the BNF, as Supplementum graecum 384), and then uploaded to eScriptorium using IIIF. Find the manuscript [here](https://doi.org/10.11588/diglit.3449).
## Sources
## Sources
| Place | Library | Signature | Date | Pages transcribed | IIIF Manifest URL |
|-------|---------|-----------|------|-------------------|-------------------|
|Universitätsbibliothek Heidelberg | Bibliotheca Palatina | Cod. Pal. gr. 23 | X<sup>th</sup> century | p. 143-215 | https://digi.ub.uni-heidelberg.de/diglit/iiif3/cpgraec23/manifest |
| Place | Library | Signature | Date | Pages transcribed | IIIF Manifest URL |
|:----------:|:----------:|:----------:|:----------:|:----------:|:----------:|
| Universitätsbibliothek Heidelberg | Bibliotheca Palatina | Cod. Pal. gr. 23 | X<sup>th</sup> century | p. 143-215 | https://digi.ub.uni-heidelberg.de/diglit/iiif3/cpgraec23/manifest |
The training has been done with images of the codex palatinus graecus 23 digitized by the Universitätsbibliothek Heidelberg (where the first part of the manuscript is kept -- the second one being in the BNF, as Supplementum graecum 384), and then uploaded to eScriptorium using IIIF. Find the manuscript [here](https://doi.org/10.11588/diglit.3449).
## Segmentation
The [SegmOnto](https://segmonto.github.io/) ontology was used to classify regions and lines of the manuscript.
## How to cite
<!-- copyright related info should go in "Licence"-->
<!--
......@@ -50,7 +88,7 @@ The training has been done with images of the codex palatinus graecus 23 digitiz
This dataset was built and is maintained by Maxime Guénette (@mguenette), Mathilde Verstraete (@mverstraete), Alix Chagué (@achague), Marcello Vitali-Rosati (@marviro). The digitization is not copyright-free, but the transcription is. However, properly annotating a corpus takes time and is a task that should be recognized. If you use any item from this corpus of ground truth, cite the dataset using the following information:
- [ ] Ajouter la référence Zenodo.
- \[ \] Ajouter la référence Zenodo.
-->
### Cite the Model
......
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
<?xml version="1.0" encoding="UTF-8"?>
<alto xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns="http://www.loc.gov/standards/alto/ns-v4#"
xsi:schemaLocation="http://www.loc.gov/standards/alto/ns-v4# http://www.loc.gov/standards/alto/v4/alto-4-2.xsd">
<Description>
<MeasurementUnit>pixel</MeasurementUnit>
<sourceImageInformation>
<fileName>174_10005_default.jpg</fileName>
</sourceImageInformation>
</Description>
<Tags>
<OtherTag ID="BT11389" LABEL="MarginTextZone" DESCRIPTION="block type MarginTextZone"/><OtherTag ID="BT11388" LABEL="MainZone" DESCRIPTION="block type MainZone"/><OtherTag ID="BT4" LABEL="Illustration" DESCRIPTION="block type Illustration"/><OtherTag ID="BT3" LABEL="Commentary" DESCRIPTION="block type Commentary"/><OtherTag ID="BT2" LABEL="Main" DESCRIPTION="block type Main"/><OtherTag ID="BT1" LABEL="Title" DESCRIPTION="block type Title"/>
<OtherTag ID="LT4582" LABEL="DefaultLine" DESCRIPTION="line type DefaultLine"/><OtherTag ID="LT4581" LABEL="InterlinearLine" DESCRIPTION="line type InterlinearLine"/>
</Tags>
<Layout>
<Page WIDTH="3269"
HEIGHT="5080"
PHYSICAL_IMG_NR="0"
ID="eSc_dummypage_">
<PrintSpace HPOS="0"
VPOS="0"
WIDTH="3269"
HEIGHT="5080">
<TextBlock HPOS="2300.0"
VPOS="722.0"
WIDTH="483.0"
HEIGHT="167.0"
ID="eSc_textblock_0717adf1"
TAGREFS="BT11389">
<Shape><Polygon POINTS="2783 778 2749 889 2312 889 2300 790 2345 733 2707 722"/></Shape>
<TextLine ID="eSc_line_48cccc6a"
TAGREFS="LT4582"
BASELINE="2317 798 2769 798"
HPOS="2317.0"
VPOS="728.0"
WIDTH="452.0"
HEIGHT="98.0">
<Shape><Polygon POINTS="2732 742 2447 728 2430 730 2402 730 2374 730 2357 730 2340 733 2317 764 2317 798 2317 826 2769 826 2769 798 2760 762 2741 747"/></Shape>
<String CONTENT="τοῦ αὐτοῦ εἰσ τ αυτ"
HPOS="2317.0"
VPOS="728.0"
WIDTH="452.0"
HEIGHT="98.0"></String>
</TextLine>
<TextLine ID="eSc_line_cbad6eb1"
TAGREFS="LT4582"
BASELINE="2323 863 2724 860"
HPOS="2322.0"
VPOS="821.0"
WIDTH="402.0"
HEIGHT="110.0">
<Shape><Polygon POINTS="2323 905 2362 922 2365 922 2368 922 2427 911 2455 908 2487 917 2523 925 2526 925 2529 925 2532 925 2546 920 2571 908 2597 922 2614 931 2616 931 2619 931 2724 928 2724 860 2718 824 2322 821 2323 863"/></Shape>
<String CONTENT="εἰσ τὸ αὐτό : ~"
HPOS="2322.0"
VPOS="821.0"
WIDTH="402.0"
HEIGHT="110.0"></String>
</TextLine>
</TextBlock>
<TextBlock HPOS="2289.0"
VPOS="1128.0"
WIDTH="370.0"
HEIGHT="167.0"
ID="eSc_textblock_c37cd51c"
TAGREFS="BT11389">
<Shape><Polygon POINTS="2596 1128 2659 1174 2659 1250 2625 1295 2312 1295 2289 1185 2345 1128"/></Shape>
<TextLine ID="eSc_line_e9aa573c"
TAGREFS="LT4582"
BASELINE="2303 1199 2616 1193"
HPOS="2300.0"
VPOS="1140.0"
WIDTH="316.0"
HEIGHT="82.0">
<Shape><Polygon POINTS="2300 1168 2303 1199 2303 1222 2614 1216 2616 1193 2611 1148 2460 1147 2328 1140"/></Shape>
<String CONTENT="τοῦ ἀυτοῦ"
HPOS="2300.0"
VPOS="1140.0"
WIDTH="316.0"
HEIGHT="82.0"></String>
</TextLine>
<TextLine ID="eSc_line_6ca59e3c"
TAGREFS="LT4582"
BASELINE="2306 1261 2625 1255"
HPOS="2303.0"
VPOS="1216.0"
WIDTH="324.0"
HEIGHT="107.0">
<Shape><Polygon POINTS="2306 1323 2391 1323 2391 1320 2393 1320 2396 1320 2405 1312 2430 1292 2475 1303 2503 1309 2506 1309 2625 1286 2625 1255 2627 1219 2303 1216 2306 1261"/></Shape>
<String CONTENT="εἰσ τὸ αὐτό"
HPOS="2303.0"
VPOS="1216.0"
WIDTH="324.0"
HEIGHT="107.0"></String>
</TextLine>
</TextBlock>
<TextBlock HPOS="2289.0"
VPOS="1738.0"
WIDTH="370.0"
HEIGHT="223.0"
ID="eSc_textblock_b6d347f4"
TAGREFS="BT11389">
<Shape><Polygon POINTS="2614 1749 2659 1938 2300 1961 2289 1772 2334 1738"/></Shape>
<TextLine ID="eSc_line_2030b96f"
TAGREFS="LT4582"
BASELINE="2297 1792 2614 1797"
HPOS="2295.0"
VPOS="1741.0"
WIDTH="319.0"
HEIGHT="87.0">
<Shape><Polygon POINTS="2608 1755 2580 1746 2551 1746 2523 1744 2492 1752 2489 1752 2464 1741 2441 1741 2430 1749 2424 1752 2317 1746 2297 1763 2297 1792 2295 1809 2608 1828 2614 1797 2614 1772 2608 1755"/></Shape>
<String CONTENT="ἀνάθημα"
HPOS="2295.0"
VPOS="1741.0"
WIDTH="319.0"
HEIGHT="87.0"></String>
</TextLine>
<TextLine ID="eSc_line_53fc7c41"
TAGREFS="LT4582"
BASELINE="2292 1862 2619 1857"
HPOS="2289.0"
VPOS="1814.0"
WIDTH="330.0"
HEIGHT="82.0">
<Shape><Polygon POINTS="2292 1896 2619 1888 2619 1857 2614 1814 2549 1814 2468 1818 2405 1820 2393 1820 2349 1817 2289 1820 2292 1862"/></Shape>
<String CONTENT="τῶ πανὶ πα"
HPOS="2289.0"
VPOS="1814.0"
WIDTH="330.0"
HEIGHT="82.0"></String>
</TextLine>
<TextLine ID="eSc_line_a36f4dfe"
TAGREFS="LT4582"
BASELINE="2306 1921 2647 1916"
HPOS="2303.0"
VPOS="1879.0"
WIDTH="344.0"
HEIGHT="90.0">
<Shape><Polygon POINTS="2628 1879 2303 1882 2306 1921 2306 1969 2647 1958 2647 1916 2645 1910 2642 1899"/></Shape>
<String CONTENT="ρὰ κηπουροῦ"
HPOS="2303.0"
VPOS="1879.0"
WIDTH="344.0"
HEIGHT="90.0"></String>
</TextLine>
</TextBlock>
<TextBlock HPOS="2278.0"
VPOS="2302.0"
WIDTH="403.0"
HEIGHT="122.0"
ID="eSc_textblock_9fee235a"
TAGREFS="BT11389">
<Shape><Polygon POINTS="2625 2314 2681 2390 2602 2424 2300 2413 2278 2336 2345 2302"/></Shape>
<TextLine ID="eSc_line_c65a083e"
TAGREFS="LT4582"
BASELINE="2306 2376 2731 2386"
HPOS="2299.0"
VPOS="2302.0"
WIDTH="427.0"
HEIGHT="115.0">
<Shape><Polygon POINTS="2546 2327 2481 2307 2472 2307 2447 2302 2422 2302 2388 2302 2345 2302 2303 2319 2303 2374 2299 2404 2718 2417 2726 2383 2722 2356 2681 2346 2642 2345"/></Shape>
<String CONTENT="γρ διψῶσαν: ~"
HPOS="2299.0"
VPOS="2302.0"
WIDTH="427.0"
HEIGHT="115.0"></String>
</TextLine>
</TextBlock>
<TextBlock HPOS="1937.0"
VPOS="3046.0"
WIDTH="1125.0"
HEIGHT="267.0"
ID="eSc_textblock_cad1d8cd"
TAGREFS="BT11389">
<Shape><Polygon POINTS="1937 3046 1937 3212 3062 3313 3062 3061"/></Shape>
<TextLine ID="eSc_line_ee4bb94a"
TAGREFS="LT4582"
BASELINE="2003 3190 2566 3237 3030 3297"
HPOS="2001.0"
VPOS="3169.0"
WIDTH="1028.0"
HEIGHT="146.0">
<Shape><Polygon POINTS="2152 3181 2106 3173 2067 3169 2004 3169 2002 3191 2001 3212 2313 3246 3022 3315 3027 3293 3029 3239 2380 3191 2284 3183 2236 3180"/></Shape>
<String CONTENT="λυκό φρ στόρθυξ δεδουπ τὸν κτανόντημυνα:"
HPOS="2001.0"
VPOS="3169.0"
WIDTH="1028.0"
HEIGHT="146.0"></String>
</TextLine>
</TextBlock>
<TextBlock HPOS="2279.0"
VPOS="2733.0"
WIDTH="486.0"
HEIGHT="328.0"
ID="eSc_textblock_70371604"
TAGREFS="BT11389">
<Shape><Polygon POINTS="2602 2743 2636 2909 2681 2946 2670 3033 2765 3061 2279 3061 2279 2733 2491 2739"/></Shape>
</TextBlock>
<TextBlock HPOS="2269.0"
VPOS="3293.0"
WIDTH="392.0"
HEIGHT="260.0"
ID="eSc_textblock_34ef6bd9"
TAGREFS="BT11389">
<Shape><Polygon POINTS="2661 3303 2580 3553 2272 3541 2269 3293"/></Shape>
</TextBlock>
<TextBlock HPOS="730.0"
VPOS="3872.0"
WIDTH="1393.0"
HEIGHT="305.0"
ID="eSc_textblock_2b32c777"
TAGREFS="BT11389">
<Shape><Polygon POINTS="730 3872 730 4177 2123 4171 2123 3872"/></Shape>
</TextBlock>
<TextBlock HPOS="233.0"
VPOS="556.0"
WIDTH="2000.0"
HEIGHT="3310.0"
ID="eSc_textblock_2e55f63b"
TAGREFS="BT11388">
<Shape><Polygon POINTS="361 556 320 1130 233 3276 359 3834 2203 3866 2199 3264 2059 3229 1969 3200 1921 3086 1992 2990 2233 2995 2181 584"/></Shape>
</TextBlock>
</PrintSpace>
</Page>
</Layout>
</alto>
<?xml version="1.0" encoding="UTF-8"?>
<alto xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns="http://www.loc.gov/standards/alto/ns-v4#"
xsi:schemaLocation="http://www.loc.gov/standards/alto/ns-v4# http://www.loc.gov/standards/alto/v4/alto-4-2.xsd">
<Description>
<MeasurementUnit>pixel</MeasurementUnit>
<sourceImageInformation>
<fileName>176_e2406_default.jpg</fileName>
</sourceImageInformation>
</Description>
<Tags>
<OtherTag ID="BT11389" LABEL="MarginTextZone" DESCRIPTION="block type MarginTextZone"/><OtherTag ID="BT11388" LABEL="MainZone" DESCRIPTION="block type MainZone"/><OtherTag ID="BT4" LABEL="Illustration" DESCRIPTION="block type Illustration"/><OtherTag ID="BT3" LABEL="Commentary" DESCRIPTION="block type Commentary"/><OtherTag ID="BT2" LABEL="Main" DESCRIPTION="block type Main"/><OtherTag ID="BT1" LABEL="Title" DESCRIPTION="block type Title"/>
<OtherTag ID="LT4582" LABEL="DefaultLine" DESCRIPTION="line type DefaultLine"/><OtherTag ID="LT4581" LABEL="InterlinearLine" DESCRIPTION="line type InterlinearLine"/>
</Tags>
<Layout>
<Page WIDTH="3269"
HEIGHT="5080"
PHYSICAL_IMG_NR="3"
ID="eSc_dummypage_">
<PrintSpace HPOS="0"
VPOS="0"
WIDTH="3269"
HEIGHT="5080">
<TextBlock HPOS="2304.0"
VPOS="919.0"
WIDTH="612.0"
HEIGHT="331.0"
ID="eSc_textblock_028e5282"
TAGREFS="BT11389">
<Shape><Polygon POINTS="2312 1216 2304 1195 2304 1107 2357 959 2760 959 2884 919 2884 1205 2916 1205 2862 1250 2304 1228 2304 1216"/></Shape>
</TextBlock>
<TextBlock HPOS="2297.0"
VPOS="2411.0"
WIDTH="526.0"
HEIGHT="207.0"
ID="eSc_textblock_c8ee3994"
TAGREFS="BT11389">
<Shape><Polygon POINTS="2297 2618 2297 2411 2823 2411 2823 2618"/></Shape>
</TextBlock>
<TextBlock HPOS="2264.0"
VPOS="2802.0"
WIDTH="665.0"
HEIGHT="292.0"
ID="eSc_textblock_afde6cf5"
TAGREFS="BT11389">
<Shape><Polygon POINTS="2264 3094 2264 2802 2929 2802 2929 3094"/></Shape>
</TextBlock>
<TextBlock HPOS="2236.0"
VPOS="3658.0"
WIDTH="386.0"
HEIGHT="175.0"
ID="eSc_textblock_2916cf2d"
TAGREFS="BT11389">
<Shape><Polygon POINTS="2236 3833 2236 3658 2622 3658 2622 3833"/></Shape>
</TextBlock>
<TextBlock HPOS="327.0"
VPOS="1006.0"
WIDTH="136.0"
HEIGHT="86.0"
ID="eSc_textblock_3bf33a3e"
TAGREFS="BT11389">
<Shape><Polygon POINTS="327 1006 327 1092 463 1092 463 1006"/></Shape>
</TextBlock>
<TextBlock HPOS="272.0"
VPOS="580.0"
WIDTH="2022.0"
HEIGHT="3287.0"
ID="eSc_textblock_4b3c476c"
TAGREFS="BT11388">
<Shape><Polygon POINTS="419 580 334 930 337 995 497 994 482 1102 366 1130 362 1811 283 1928 272 3830 2212 3867 2194 3687 2294 3643 2204 3147 2237 2454 1829 2430 1826 2309 2179 2301 2230 614"/></Shape>
</TextBlock>
<TextBlock HPOS="1838.0"
VPOS="2312.0"
WIDTH="568.0"
HEIGHT="112.0"
ID="eSc_textblock_334ce63e"
TAGREFS="BT11389">
<Shape><Polygon POINTS="1838 2404 1846 2312 2406 2312 2403 2424"/></Shape>
</TextBlock>
</PrintSpace>
</Page>
</Layout>
</alto>
......@@ -307,7 +307,7 @@
<TextLine ID="eSc_line_95db443f"
TAGREFS="LT4582"
BASELINE="857 1497 1052 1501 2571 1479"
HPOS="852.0"
VPOS="1423.0"
......@@ -834,18 +834,18 @@
<TextLine ID="eSc_line_419340e1"
BASELINE="373 3255 901 3278"
HPOS="370.0"
VPOS="3208.0"
WIDTH="531.0"
HEIGHT="120.0">
<Shape><Polygon POINTS="370 3278 421 3280 423 3283 457 3314 460 3314 463 3314 465 3314 468 3314 471 3314 496 3292 505 3286 550 3286 558 3297 564 3306 566 3306 566 3309 569 3309 572 3309 575 3309 614 3303 659 3295 679 3309 698 3325 701 3325 704 3325 895 3328 901 3278 901 3219 578 3216 519 3208 428 3232 371 3228 373 3255"/></Shape>
TAGREFS="LT4582"
BASELINE="373 3250 901 3278"
HPOS="366.0"
VPOS="3182.0"
WIDTH="530.0"
HEIGHT="148.0">
<Shape><Polygon POINTS="454 3182 412 3182 370 3182 370 3245 366 3292 888 3330 896 3275 896 3203 454 3182 454 3182"/></Shape>
<String CONTENT="γυναικῶ ἀρχίου:"
HPOS="370.0"
VPOS="3208.0"
WIDTH="531.0"
HEIGHT="120.0"></String>
HPOS="366.0"
VPOS="3182.0"
WIDTH="530.0"
HEIGHT="148.0"></String>
</TextLine>
......@@ -1031,18 +1031,18 @@
<TextLine ID="eSc_line_0a72882e"
BASELINE="366 2381 841 2418"
TAGREFS="LT4582"
BASELINE="366 2381 841 2413"
HPOS="357.0"
VPOS="2348.0"
WIDTH="504.0"
HEIGHT="110.0">
<Shape><Polygon POINTS="357 2429 829 2458 837 2416 861 2379 364 2348 362 2378"/></Shape>
VPOS="2319.0"
WIDTH="480.0"
HEIGHT="139.0">
<Shape><Polygon POINTS="362 2378 357 2429 829 2458 837 2412 837 2353 362 2319 362 2378"/></Shape>
<String CONTENT="ἀμυντίχου ἀλιέ"
HPOS="357.0"
VPOS="2348.0"
WIDTH="504.0"
HEIGHT="110.0"></String>
VPOS="2319.0"
WIDTH="480.0"
HEIGHT="139.0"></String>
</TextLine>
......