summaryrefslogtreecommitdiff
path: root/src/wikidata.go
AgeCommit message (Collapse)AuthorFilesLines
17 hoursfeat: extract actors, directors, screenwriters from Wikipedia APIdev1-1/+40
- Extract directors from infobox 'Directed by' field/list - Extract screenwriters from infobox 'Screenplay by' list - Extract actors from Cast section list (first link = person name) - Upsert into people table, link via who table (profession: actor=1, director=2, screenwriter=3) - Track processed entries with has_people flag column - Consumer inserts people and marks has_people=1 on success
26 hoursfix: prevent dropped wiki entries when channel fillsdev1-13/+4
- Remove non-blocking select/default that silently dropped entries - Channel sized to hold all pending entries (existing + SPARQL) - Blocking send backpressures SPARQL if consumer is slow
27 hoursfeat: add -wiki-only flag to rerun only wiki data extractiondev1-0/+25
- fetchWikiArticlesData is standalone again (re-extracted from consumer) - -wiki-only flag skips SPARQL pipeline, runs only wiki data fetch - Default behavior: full pipeline (SPARQL + wiki data in parallel)
27 hoursrefactor: pipeline SPARQL and wiki data in paralleldev1-40/+169
- Merge fetchWikiArticles + fetchWikiArticlesData into one pipeline - SPARQL producer fetches batches, commits each to DB, forwards resolved articles - Wiki data consumer runs concurrently, fetching at 2s/request - Each SPARQL batch commits independently (no global transaction) - Rate limits respected for both Wikidata SPARQL and wiki server - No parallel requests to either endpoint
29 hoursfix: decode wiki article names for clean storagedev1-1/+1
- wikidata.go: url.PathUnescape SPARQL titles before storing - wikiarticle.go: PathUnescape on read, PathEscape on send - DB holds decoded names; URLs are always freshly encoded
34 hoursfix: skip already-classified entries in wikidata querydev1-1/+1
Add has_no_wiki_article = 0 filter so entries previously marked as having no Wikipedia article are not re-queried on subsequent runs.
34 hoursfeat: set has_no_wiki_article flag for entries without Wikipedia articledev1-13/+34
- Mark entries as has_no_wiki_article=1 when Wikidata returns no result - Also mark entries in batches that failed with HTTP errors - Re-run populated 2705 wiki articles, 592 marked as no wiki
35 hoursfeat: fetch Wikipedia article titles via Wikidata SPARQLdev1-0/+239
- Query Wikidata SPARQL in batches of 30 for entries missing wiki_article - Store wiki_article title in imdb table - Respect rate limits with configurable delay and retry on 5xx/429 - Skip entries that have no Wikipedia article - Removed unique constraint on wiki_article (multiple entries can share one)