RCP: X.X, words not highlighted in editions
For some texts, words are not highlighted in editions.
The IDS of those words not highlighted contain characters that broke the CSS ID syntax rules (e.g " “, ”(" and more)
Discussion
Word IDs are built with <text identifier + number>or come from the sources.
If we forge the word ids in import modules, we must normalize/reduce text names to a text identifier, at the level of the corpus.
Three strategies:
- a) normalize/reduce characters or morphemes
- b) escape characters
- c) manage : hash
b) suppose to escape with respect to the syntax reading the identifier: for example CSS syntax. So different escape algorithms may need to be used depending on context. See the XXX Java library to escape for a lot of different syntaxes.
c) suppose to use the hash in various contexts: eg concordance references, etc.
Solution
Define the most simple common compatible syntax compatible with CSS ID syntax and CQL syntax.
Do a) fix the XMLw to XML-TXM step of import modules, in the XML2Ana class:
- normalize/reduce the word ID to the CSS id syntax (= the syntax of xml:id)
#2364)
Solution 2 (not done, see- add a new import option “force word id generation” for corpora having already word IDs.
- add a new load option “force word id generation” for corpora having already word IDs.
(from redmine: issue id 2160, created on 2017/04/19 by Matthieu Decorde)
- Relations:
- relates #2353 (closed)
- relates #2354 (closed)
- relates #2364