Execution model — processing order & update triggers
For a single kgsteward run: the order things happen in, and what makes a
dataset be (re)processed. Logic lives in
yamlconfig.py (parsing) and
kgsteward.py (main()).
Lifecycle
Datasets are processed in declaration order. There is no run-time
topological sort; instead a parse-time rule guarantees that order is valid: a
parent: must be declared earlier in the file (parent: "*" = all datasets so
far). So every parent precedes its children, which is what lets status
propagation work in one forward pass.
flowchart TD
A[Read CLI args<br/>-F expands to -I + -D] --> B[Parse YAML config<br/>declaration order;<br/>parents before children]
B --> C[Connect to backend<br/>graphdb / rdf4j / fuseki / qlever]
C --> D{-I given?}
D -- yes --> E[Rewrite repository<br/>ERASE all RDF data]
D -- no --> F
E --> F[Select the UPDATE SET<br/>-D/-F: all · -d: named · -C: status-driven · none: report-only]
F --> G[plan_index_scope<br/>qlever: restrict rebuild<br/>to dependency closure]
G --> H[Per-dataset loop, in declaration order<br/>system → url → file → update → special]
H --> I[Post-processing:<br/>prefixes, -U restamp,<br/>-V validate, -Q queries, dumps]
I --> J[Recompute status + refine_status]
J --> K[Show current status table]
K --> L[ensure server running<br/>+ SPARQL-log summary]
Within a selected dataset the clauses always run in this single fixed order:
system→url→file→update→special, then the dataset’s metadata is persisted.
system typically produces the data the later clauses load; update SPARQL
statements run in file order then document order; special emits
kgsteward-generated triples (void / prefix / query descriptions). replace is
not a stage — it is the string-substitution map applied to the update text
before it runs. Any clause may be absent. This order is not configurable: to
run steps in any other order (e.g. an update before a file load), split the
work across two datasets and declare the second with the first as its
parent: — the dependency forces the first to be processed in full before the
second.
What triggers a rebuild
A dataset’s target checksum (get_sha256) is compared to the checksum stored
from its last load (kgsteward:checksum). The checksum covers the dataset’s
inputs:
| Hashed | Not hashed |
|---|---|
context IRI; system command strings; file byte content; url string + HTTP HEAD (Last-Modified/ETag); stamp (HEAD or content); replace pairs; update file text; special keys |
parent content (only parent names are hashed — see below); frozen status |
So a rebuild is triggered by an edited input file, a changed remote resource, an
edited update/system/replace/url/stamp entry, or a forcing flag.
Each dataset resolves to one status: forced (-d/-D/-F) → UPDATE;
else checksum matches → ok; else frozen → FROZEN (skipped); else
→ UPDATE. Finally, a not-frozen dataset with a parent that is EMPTY/UPDATE/
PROPAGATE becomes PROPAGATE.
Under -C, every dataset ending EMPTY / UPDATE / PROPAGATE is reprocessed.
Because datasets are evaluated in declaration order, a parent marked UPDATE
flips its not-frozen children to PROPAGATE in the same pass, cascading
downward; a frozen dataset never auto-marks and stops the cascade (refresh it
with -d or --force_unfreeze).
Parent content is not in the child checksum — only parent names are. A parent’s data changing rebuilds the child through
PROPAGATE(when the parent is in the update set), not through the checksum.
The three passes over a dataset
A dataset definition is traversed three times per run, each time over a
different subset of clauses. The store is modified only in the middle pass;
stamp (and context, parent, replace) feed the checksum passes only.
Passes 1 and 3 are unordered sets of hashed inputs (the order they are fed to the hash is irrelevant); only pass 2 is a true ordered sequence.
flowchart TB
subgraph S1["1 — compute checksum (selection, before system) → rebuild decision"]
a1[context] & a2[parent] & a3[system] & a4[file] & a5[url] & a6[stamp] & a7[replace] & a8[update] & a9[special] --> AH(["SHA-256"])
end
subgraph S2["2 — modify the store (execution, ordered)"]
direction LR
b3[system] --> b5[url] --> b4[file] --> b8["replace+update"] --> b9[special]
end
subgraph S3["3 — recompute checksum (persist, after system) → stored value"]
c1[context] & c2[parent] & c3[system] & c4[file] & c5[url] & c6[stamp] & c7[replace] & c8[update] & c9[special] --> CH(["SHA-256"])
end
S1 --> S2 --> S3
Pass 1 runs only under -C (-d/-D force the set and skip it). In passes 1
and 3 each clause is hashed (e.g. system = its command text); in pass 2 the
clauses are executed (e.g. system = the command runs) — and only there does
url actually precede file. Because the store is modified (pass 2) after
the deciding checksum (pass 1), a dataset’s system: cannot trigger its own
rebuild in the same run — its effect is captured by pass 3 and seen only by the
next run. So point stamp/url/file at the upstream source whose change
should trigger a rebuild, never at a file your own system: produces.
Reference
Situation (under -C) |
Result |
|---|---|
input file / remote resource / update text changed |
UPDATE |
| a parent is being rebuilt | child PROPAGATE (unless frozen) |
| nothing changed | ok — skipped |
changed but frozen: true |
FROZEN — skipped (use -d / --force_unfreeze) |
| parent content changed, child inputs unchanged, parent not in set | child stays ok — rebuild the parent, or use -d |
-d name / -D / -F |
forced UPDATE / all |