TBX: 0.7.9, build word IDs if not present in w tags for back-to-text when not tokenizing

Currently when the ‘Tokenization’ import option is unchecked, no word IDs management is done. The result is that the back-to-text, URS Unit highlight, etc. functionalities don’t work with default text editions. It is a problem because word properties can be imported for different reasons but the back-to-text functionality should not be broken. The w@id attribute has a special status.

Logic

Decide a w ID management policy (the decision can be a new import parameter or a new TXM behavior):

a0) foreign IDs (coming from the sources) must be compatible with TXM w ID related functionalities otherwise the import must abort (all IDs present, right pattern, etc.)

a1) foreign IDs can be mixed with TXM built w IDs to manage, especially, back-to-text ->add IDs to w that don’t have an ID and all w ID related functionalities, like back-to-text, must be able to use those IDs

or a2) don’t mix foreign IDs with TXM built IDs

* a2.1) force w IDs to TXM built IDs

* a2.2.1) rename foreign IDs to ‘txm:foreign-id and build TXM w IDs with the ‘xml:id’ attribute even if not tokenizing

* a2.2.2) build TXM w IDs with an identifier specific to the corpus, and use that identifier instead of ‘id’ in all w ID related functionalities, like back-to-text even if not tokenizing

* a2.2.3) use the ‘txmid’ word property name (and later ‘txm:id’) to force and use TXM private IDs even when foreign ID are present and even if not tokenizing

Solution

When tokenizing or not tokenizing, apply the a2.2.1 policy on import (and load if possible), ID related functionalities.

UI

Before

Segmentation lexicale
- re-segmenter lexicalement des mots pré-encodés
- Construire les identifiants de mots
- Caractères séparateurs
  - Caractères d'élision
  - Caractères de fin de phrase

After

Segmentation lexicale (section)
- segmenter les mots
- re-segmenter les mots pré-encodés
  
  Caractères séparateurs
  - Caractères d'élision
  - Caractères de fin de phrase
construire les identifiants de mots
ajouter des tests paranoïaques :

-> si à la fin du module d'import les mots n'ont pas de propriété 'id' afficher un message d'erreur de haut niveau (= dans une boite de dialogue) indiquant que le retour au texte ne sera pas disponible pour ce corpus car les mots ne sont pas équipés d'identifiants

(from redmine: issue id 2364, created on 2018/04/10 by Serge Heiden)

Relations:
- relates #1636 (closed)
- relates #2160 (closed)

Edited Oct 16, 2025 by Serge Heiden