Skip to content
Snippets Groups Projects
Commit 35659a30 authored by Marcello Vitali-Rosati's avatar Marcello Vitali-Rosati
Browse files

finalisation lemma

parent 9c9ee5b4
Branches master
No related tags found
No related merge requests found
%% Cell type:markdown id:356a391b-319e-4ae9-b705-0f4193a105c9 tags: %% Cell type:markdown id:356a391b-319e-4ae9-b705-0f4193a105c9 tags:
# Lemmatisation de l'AP avec stanza # Lemmatisation de l'AP avec stanza
Inspiré de https://github.com/OdysseusPolymetis/ia_et_shs/blob/main/nlp_greek_latin.ipynb Inspiré de https://github.com/OdysseusPolymetis/ia_et_shs/blob/main/nlp_greek_latin.ipynb
Merci Marianne Reboul Merci Marianne Reboul
%% Cell type:code id:280f1e75-6516-4d66-8c10-c0657b615e81 tags: %% Cell type:code id:280f1e75-6516-4d66-8c10-c0657b615e81 tags:
``` python ``` python
import stanza import stanza
stanza.download('grc', package="perseus") # on peut essayer l'autre modèle, peut-être meilleur pour les épigrammes plus récentes stanza.download('grc', package="perseus") # on peut essayer l'autre modèle, peut-être meilleur pour les épigrammes plus récentes
nlp_stanza = stanza.Pipeline(lang='grc', package="perseus", processors='tokenize,pos,lemma, depparse') nlp_stanza = stanza.Pipeline(lang='grc', package="perseus", processors='tokenize,pos,lemma, depparse')
``` ```
%% Output %% Output
2025-01-16 07:09:16 INFO: Downloaded file to /home/marcello/stanza_resources/resources.json 2025-01-16 07:09:16 INFO: Downloaded file to /home/marcello/stanza_resources/resources.json
2025-01-16 07:09:16 INFO: Downloading these customized packages for language: grc (Ancient_Greek)... 2025-01-16 07:09:16 INFO: Downloading these customized packages for language: grc (Ancient_Greek)...
================================ ================================
| Processor | Package | | Processor | Package |
-------------------------------- --------------------------------
| tokenize | perseus | | tokenize | perseus |
| pos | perseus_nocharlm | | pos | perseus_nocharlm |
| lemma | perseus_nocharlm | | lemma | perseus_nocharlm |
| depparse | perseus_nocharlm | | depparse | perseus_nocharlm |
| pretrain | conll17 | | pretrain | conll17 |
================================ ================================
2025-01-16 07:09:16 INFO: File exists: /home/marcello/stanza_resources/grc/tokenize/perseus.pt 2025-01-16 07:09:16 INFO: File exists: /home/marcello/stanza_resources/grc/tokenize/perseus.pt
2025-01-16 07:09:16 INFO: File exists: /home/marcello/stanza_resources/grc/pos/perseus_nocharlm.pt 2025-01-16 07:09:16 INFO: File exists: /home/marcello/stanza_resources/grc/pos/perseus_nocharlm.pt
2025-01-16 07:09:16 INFO: File exists: /home/marcello/stanza_resources/grc/lemma/perseus_nocharlm.pt 2025-01-16 07:09:16 INFO: File exists: /home/marcello/stanza_resources/grc/lemma/perseus_nocharlm.pt
2025-01-16 07:09:17 INFO: File exists: /home/marcello/stanza_resources/grc/depparse/perseus_nocharlm.pt 2025-01-16 07:09:17 INFO: File exists: /home/marcello/stanza_resources/grc/depparse/perseus_nocharlm.pt
2025-01-16 07:09:17 INFO: File exists: /home/marcello/stanza_resources/grc/pretrain/conll17.pt 2025-01-16 07:09:17 INFO: File exists: /home/marcello/stanza_resources/grc/pretrain/conll17.pt
2025-01-16 07:09:17 INFO: Finished downloading models and saved to /home/marcello/stanza_resources 2025-01-16 07:09:17 INFO: Finished downloading models and saved to /home/marcello/stanza_resources
2025-01-16 07:09:17 INFO: Checking for updates to resources.json in case models have been updated. Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES 2025-01-16 07:09:17 INFO: Checking for updates to resources.json in case models have been updated. Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES
2025-01-16 07:09:17 INFO: Downloaded file to /home/marcello/stanza_resources/resources.json 2025-01-16 07:09:17 INFO: Downloaded file to /home/marcello/stanza_resources/resources.json
2025-01-16 07:09:18 INFO: Loading these models for language: grc (Ancient_Greek): 2025-01-16 07:09:18 INFO: Loading these models for language: grc (Ancient_Greek):
================================ ================================
| Processor | Package | | Processor | Package |
-------------------------------- --------------------------------
| tokenize | perseus | | tokenize | perseus |
| pos | perseus_nocharlm | | pos | perseus_nocharlm |
| lemma | perseus_nocharlm | | lemma | perseus_nocharlm |
| depparse | perseus_nocharlm | | depparse | perseus_nocharlm |
================================ ================================
2025-01-16 07:09:18 INFO: Using device: cpu 2025-01-16 07:09:18 INFO: Using device: cpu
2025-01-16 07:09:18 INFO: Loading: tokenize 2025-01-16 07:09:18 INFO: Loading: tokenize
2025-01-16 07:09:21 INFO: Loading: pos 2025-01-16 07:09:21 INFO: Loading: pos
2025-01-16 07:09:21 INFO: Loading: lemma 2025-01-16 07:09:21 INFO: Loading: lemma
2025-01-16 07:09:23 INFO: Loading: depparse 2025-01-16 07:09:23 INFO: Loading: depparse
2025-01-16 07:09:23 INFO: Done loading processors! 2025-01-16 07:09:23 INFO: Done loading processors!
%% Cell type:code id:0b181bc3-547f-45da-890d-fe810941b725 tags: %% Cell type:code id:0b181bc3-547f-45da-890d-fe810941b725 tags:
``` python ``` python
import pandas import pandas
df = pandas.read_csv('../../DONNEES/DataIn/corpus_grc.csv') df = pandas.read_csv('../../DONNEES/DataIn/corpus_grc.csv')
``` ```
%% Cell type:code id:f4a4b439-85ca-4f3b-8b72-e5289870230b tags: %% Cell type:code id:f4a4b439-85ca-4f3b-8b72-e5289870230b tags:
``` python ``` python
def lemma(row): def lemma(row):
result=[] result=[]
for sentence in nlp_stanza(row['text']).sentences: for sentence in nlp_stanza(row['text']).sentences:
for token in sentence.words: for token in sentence.words:
result.append(token.lemma) result.append(token.lemma)
return ' '.join(result) return ' '.join(result)
``` ```
%% Cell type:code id:c24dae5a-7330-4686-a1ca-539e7e08c867 tags: %% Cell type:code id:c24dae5a-7330-4686-a1ca-539e7e08c867 tags:
``` python ``` python
df.apply(lemma, axis=1) df.apply(lemma, axis=1)
df['text_lemmatized'] = df.apply(lemma, axis=1) df['text_lemmatized'] = df.apply(lemma, axis=1)
``` ```
%% Cell type:code id:6af10a3c-fff0-4537-8f15-3a82d9a2a3d0 tags: %% Cell type:code id:6af10a3c-fff0-4537-8f15-3a82d9a2a3d0 tags:
``` python ``` python
df.head() df.head()
``` ```
%% Output
book fragment author \
0 1 1 anonymous
1 1 2 anonymous
2 1 3 anonymous
3 1 4 anonymous
4 1 5 anonymous
text \
0 \n ἃς οἱ πλάνοι καθεῖλον ...
1 \n θεῖος Ἰουστῖνος, Σοφίη...
2 \n ὁ πρὶν Ἰουστῖνος περικα...
3 \n τοῦτον Ἰωάννῃ, Χριστοῦ...
4 \n τόνδε Θεῷ κάμες οἶκον, ...
keyword_code \
0 ['https://anthologiagraeca.org/api/keywords/11...
1 ['https://anthologiagraeca.org/api/keywords/11...
2 ['https://anthologiagraeca.org/api/keywords/11...
3 ['https://anthologiagraeca.org/api/keywords/11...
4 ['https://anthologiagraeca.org/api/keywords/11...
text_lemmatized
0 ὅς ὁ πλάσσω καθαιρέω ἔνθατος εἰκών ἄναξ στηλόω...
1 θεῖος Ἰουστῖνος , Σοφίης πόσις , ὅς πόρω Χριστ...
2 ὁ πρίν Ἰουστῖνος περικαλλής δέομαι ναός οὗτος ...
3 οὗτος Ἰωννεύς , Χριστός μέγας θεράπων , Στούδι...
4 ὅδε Θεός κάμνω οἶκος , Ἀμάντιος , μεσσόθι πόντ...
%% Cell type:code id:9d7e049d-9bf1-44cc-b528-421deeec48cc tags: %% Cell type:code id:9d7e049d-9bf1-44cc-b528-421deeec48cc tags:
``` python ``` python
df.to_csv('../../DONNEES/DataIn/corpus_grc_tokenized.csv') df.to_csv('../../DONNEES/DataIn/corpus_grc_tokenized.csv')
``` ```
%% Cell type:code id:ffb2013a-ac98-48c2-8cee-8c0de8f87ccf tags:
``` python
```
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment