Implement "upsert" on data import
Context
In the current state, the import bot skips the entities (item or property) that already exist based on their business-level IDs (not the Wikibase property/item IDs) (e.g. SCOB ID, or DFIH ID), but does not handle updating them if they have been modified since the previous import (even if the ID stays the same).
Consequence: an already imported item (e.g. SCOB company) will be skipped during a subsequent import. In order to really update it, a pragmatic solution is to delete it and re-import it, and this can applied to the entire database.
A more satisfying solution would be to adopt an "upsert" pattern (which means "update or insert if it does not exist").
Upsert can be done at 2 levels:
- the item level: create the item if it does not exist, otherwise update it, meaning create the missing properties
- the simple property level: create a property if it does not exist, otherwise update it (e.g. company name)
- the inner property level: create a sub-property if it does not exist, otherwise update it (e.g. address / street name)
A SPARQL query would be done when the bot starts. Updating would mean: compare the value of the SPARQL result to the value in the imported data.
Caveats
- how to match properties when they don't have a stable ID? e.g. street name in address
Examples
(Q151:BANQUE DE FRANCE) -(P17:instance of)-> (Q1:company) (Q151:BANQUE DE FRANCE) -(P40:DFIH corporation ID)-> [(104, qualifiers, references)] (Q151:BANQUE DE FRANCE) -(P23:name)-> [(value, qualifiers, references)] ``` The process is to import all DFIH corporations. To do upsert at the item level: - query all items having P17 == Q1 (instance of == company) - for each company of `dfih_companies.csv`, create it if missing or skip To do upsert at the simple property level: - query all items having P17 == Q1 (instance of == company) - for each company of `dfih_companies.csv` - create it if missing - else for each CSV column of the company corresponding to a property - find the property in the SPARQL result - create it if missing or skip (or update, but we have to find the right value among multi-valued values, using qualifiers) To do upsert at the inner property level: - query all items having P17 == Q1 (instance of == company) - for each company of `dfih_companies.csv` - create it if missing - else for each CSV column of the company corresponding to a property (e.g. value is of type `Item` like `legal status`) - find the property in the SPARQL result - create it if missing - else open the linked item and reiterate at the item level
Tasks
-
explore several bot libraries (Wikidata Integrator, ...) to see if (and how) they implement upsert - upsert seems to be done at the item level only, not the property level
-
use it or implement it in the bot