CWB-encode fails to process UTF-8 diacritics following accent marks

When diacritics accents are encoded on the next character cwb-encode fails. This issue has been observed only on Mac

Normalize the accents when writing CQP corpus sources files

At the beginning of the importer steps of import modules use : String s2 = java.text.Normalizer.normalize(s, java.text.Normalizer.Form.NFC)

This must be done before any other import step: tokenizer, annotate, etc.

Edited Apr 24, 2025 by Matthieu Decorde