| Age | Commit message (Collapse) | Author | Files | Lines |
|
- Merge fetchWikiArticles + fetchWikiArticlesData into one pipeline
- SPARQL producer fetches batches, commits each to DB, forwards resolved articles
- Wiki data consumer runs concurrently, fetching at 2s/request
- Each SPARQL batch commits independently (no global transaction)
- Rate limits respected for both Wikidata SPARQL and wiki server
- No parallel requests to either endpoint
|
|
|
|
- One-shot migration decoded 187 percent-encoded rows
- Removed decode-on-read from wikiarticle.go (no longer needed)
- wikidata.go still decodes SPARQL URLs before storing (for future inserts)
- wikiarticle.go encodes on send via url.PathEscape
|
|
- wikidata.go: url.PathUnescape SPARQL titles before storing
- wikiarticle.go: PathUnescape on read, PathEscape on send
- DB holds decoded names; URLs are always freshly encoded
|
|
- wiki_article values are already URL-encoded in the DB
- Build query URL manually instead of url.Values.Encode()
- Only escape username (not pre-encoded)
|
|
- queryWikiArticle returns HTTP status code alongside entry data
- Always record wiki_status_code for every request (success or failure)
- Skip entries with wiki_status_code = 404 in future runs
- Only update data fields on HTTP 200; non-200 only records status
- Log line shows updated vs skipped (non-200) counts
|
|
- Retry up to 5 times on HTTP 429 with 2s/4s/8s/16s backoff
- Move inter-request delay before each request (was after)
- Increase base delay from 1s to 2s between requests
- Fix: only sleep after first request (skip delay on first call)
|
|
- Add wiki_server and wiki_username config fields
- Query custom server for each wiki_article entry
- Extract description, synopsis (Plot), year, poster_url, license,
license_url, num_accolades from structured JSON response
- Serial processing with 1 req/s rate limit
- Update only entries missing at least one target column
|
|
Add has_no_wiki_article = 0 filter so entries previously marked as
having no Wikipedia article are not re-queried on subsequent runs.
|
|
- Mark entries as has_no_wiki_article=1 when Wikidata returns no result
- Also mark entries in batches that failed with HTTP errors
- Re-run populated 2705 wiki articles, 592 marked as no wiki
|
|
- Query Wikidata SPARQL in batches of 30 for entries missing wiki_article
- Store wiki_article title in imdb table
- Respect rate limits with configurable delay and retry on 5xx/429
- Skip entries that have no Wikipedia article
- Removed unique constraint on wiki_article (multiple entries can share one)
|
|
The previous run left partial data after a mid-transaction rollback.
INSERT IGNORE makes the junction table insert idempotent.
|
|
- genre table: (id, name) with unique name constraint
- imdb_genre table: (id, imdb_id, genre_id) junction table
- Upsert genres via INSERT ... ON DUPLICATE KEY UPDATE
- Link via imdb_genre using LAST_INSERT_ID
- Check missing genres via LEFT JOIN imdb_genre
|
|
- Parse genres field (rec[8]) from title.basics.tsv, split by comma
- Insert into genre table via SELECT to resolve imdb.id from imdb_id
- Update fetchAndUpdateImdbData to check for missing genres too
- Skip download if TSV already exists (supports stubbed downloadFile)
|
|
- Replace csv.Reader with bufio.Scanner to avoid quote-parsing issues
that skipped ~355 entries (e.g. tt1853728 was on line 4.8M and got
lost when csv.Reader encountered malformed quoted fields earlier)
- Fix column indices: startYear=rec[5], runtimeMinutes=rec[7]
(was rec[4]/rec[5] which mapped to isAdult/startYear)
- Update basics for ALL imdb entries, not just those missing ratings
|
|
|
|
|
|
- Check for imdb entries with NULL average_rating
- Download title.basics.tsv.gz and title.ratings.tsv.gz to imdbdata/
- Decompress alongside gzip originals
- Parse only rows matching our imdb_ids (memory-efficient)
- Update: average_rating, num_votes, title_type, primary_title,
original_title, start_year, runtime_minutes
- Results: 3394 ratings, 3093 basics updated out of 3448 entries
|
|
- Extract distinct IMDb title IDs from links.param (host=imdb.com)
- Skip IDs already in imdb table and non-title params (nm, ls, etc.)
- Insert 3448 unique title IDs into imdb.imdb_id
|
|
- Query links table for IMDB title URLs (field=1, host=imdb.com)
- Extract ttIDs via regex and batch-update links.param
- 5662 rows updated successfully
|
|
serialization)
|
|
|
|
- Replace Viper-based config with encoding/json (config.go)
- Add config.json with sensible defaults (gitignored)
- Add config.json.example with empty values as reference
- Initialize go module (go.mod)
- Update main.go to use LoadConfig()
|
|
|
|
|
|
|