RCP: X.X, add XTZ import module
See full specification (listing three other tickets related to XTZ): https://groupes.renater.fr/wiki/txm-info/public/import_xtz
- copy XML/w+CSV import to XTZ+CSV
- entry menu
- scripts in scripts/import
- add new source directory sub-directories management
- 'dtd' sub-directory contains the dtd files to use with XSLs.
- See http://docs.oracle.com/javase/7/docs/api/javax/xml/stream/XMLInputFactory.html
- 'css' sub-directory contains the css files to use with HTML pages in editions
- MD: the pager must declare the css files in each HTML page with a path "css/cssfilename.css"
- MD: the css directory must be copied next to the HTML pages for each edition (Groovy or XSL)
- 'xsl' sub-directory contains different types of XSL sub-directories (if a directory is absent or empty it is not used)
- '1-split-merge' sub-subdirectory containing an XSL stylesheet used to split or merge source files to adapt them to the TXM corpus model (1 text = 1 file)
- this XSL receives a "binary-src-dir-path" parameter with a path to write result files
- the standard XSL output file of this stylesheet is not used
- examples: split-texts.xsl or merge-files.xsl
- '2-front' sub-sub-directory containing the XSL stylesheets to process the sources at the beginning of the import process (replaces the 'front XSL' section mecanism). The XSL are applied in the lexicographical order of their file names.
- examples: txm-filter-teip5-xmlw-preserve.xsl
- '3-posttok' sub-sub-directory containing the XSL stylesheets to process the xml-txm representation of the sources after the tokenization phase (all words are encoded). The XSL are applied in the lexicographical order of their file names. * examples: reduce-caesura.xsl, build-word-ref.xsl
- '4-edition' sub-sub-directory containing the XSL stylesheets to build the HTML edition from the xml-txm representation using the pagination done by the pager. The XSL are applied in the lexicographical order of their file names.
- example: in order, 1-default-html.xsl, 2-default-pager.xsl, to build the 'default' edition followed by, 3-facs-html.xsl, 4-facs-html.xsl to build the facsimile edition to hold the images
- all XSL receive the following parameters: "number-words-per-page", "pagination-element", "import-xml-path".
- Note: this XSL parameters are not mandatory (MD: tested)
- The XSL file writes the first word ID in each HTML file produced :
- . If there is no word in the page, then the "content" value is "w_0"
- Their file name is used to name the edition produced
- all sub-directories are copied to the binary corpus
- modify the import form :
- add section "Plans textuels"
- liste des balises codant le hors-texte (ni indexé ni édité) (transform to Regexp)
- MD : ajout d'un paramètre d'import "element.ignored.always" (anciennement codé dans des fichiers properties)
- liste des balises codant le hors-texte à éditer (affichées dans l'édition) (transform to Regexp)
- MD : ajout d'un paramètre d'import "element.edited.only" (anciennement codé dans des fichiers properties)
- liste des balises codant le hors-texte (ni indexé ni édité) (transform to Regexp)
- remove "front XSL" section
- note: "add parameter" is broken
- move "font" section after "editions" and before "commands"
- modify Éditions section
- "Editions" -> "Éditions"
- add 'images' URI declaration (see below)
- add section "Plans textuels"
[x] Construire l'édition Nombre de mots par page [500] Élément de pagination [pb] Répertoire local d'images de facsimilés [...]
- Import script update strategy :
-
if the import script is missing, the script is retrieved from the Toolbox jar Groovy files
if the script exits and its date is older than the Toolbox jar script, then it is replaced
-
- transfert the edition macros into the XTZ import module
Later
Integrate the XMLText2MetadataCSV macro content to pull metadata from teiHeaders directly.
Edited by Matthieu Decorde