| Age | Commit message (Collapse) | Author | Files | Lines |
|
- fetchWikiArticlesData is standalone again (re-extracted from consumer)
- -wiki-only flag skips SPARQL pipeline, runs only wiki data fetch
- Default behavior: full pipeline (SPARQL + wiki data in parallel)
|
|
- Merge fetchWikiArticles + fetchWikiArticlesData into one pipeline
- SPARQL producer fetches batches, commits each to DB, forwards resolved articles
- Wiki data consumer runs concurrently, fetching at 2s/request
- Each SPARQL batch commits independently (no global transaction)
- Rate limits respected for both Wikidata SPARQL and wiki server
- No parallel requests to either endpoint
|
|
- wikidata.go: url.PathUnescape SPARQL titles before storing
- wikiarticle.go: PathUnescape on read, PathEscape on send
- DB holds decoded names; URLs are always freshly encoded
|
|
Add has_no_wiki_article = 0 filter so entries previously marked as
having no Wikipedia article are not re-queried on subsequent runs.
|
|
- Mark entries as has_no_wiki_article=1 when Wikidata returns no result
- Also mark entries in batches that failed with HTTP errors
- Re-run populated 2705 wiki articles, 592 marked as no wiki
|
|
- Query Wikidata SPARQL in batches of 30 for entries missing wiki_article
- Store wiki_article title in imdb table
- Respect rate limits with configurable delay and retry on 5xx/429
- Skip entries that have no Wikipedia article
- Removed unique constraint on wiki_article (multiple entries can share one)
|