hnimdbbot - Pure Cinema!

Age	Commit message (Collapse)	Author	Files	Lines
11 hours	feat: add license_short from Wikipedia license identifiermain	dev	2	-3/+8
	- Extract license.identifier (e.g. CC-BY-SA-4.0) into new license_short column - Warn if a license array has more than 1 entry (none seen yet) - Include license_short IS NULL in getExistingWikiArticles query
12 hours	fix: only count awards tables in num_accolades	dev	1	-0/+18
	extractAccolades was summing rows from all tables (including episode lists), producing inflated counts (e.g. 708 for Unreported_World which has 0 actual awards). Now it filters to tables whose headers contain 'Award'.
16 hours	feat: add three-level logging with per-request debug output	dev	5	-47/+135
	- New --log-level flag: debug (default info), info, silent debug: every API request logged (method, URL, status, duration) info: normal events (batch progress, entry counts, summaries) silent: only warnings and fatal errors - Replaced all log.Printf/Fatalf calls with level-gated helpers - API request timing added to queryWikiArticle, queryWikidataBatch, downloadFile - Retries and backoff logged in debug mode
17 hours	fix: add nil checks in extractPeople for missing infobox/section data	dev	1	-14/+19
	- Guard section.has_parts type assertion in extractPeople - Guard Cast section has_parts iteration with ok check
17 hours	feat: extract actors, directors, screenwriters from Wikipedia API	dev	2	-7/+184
	- Extract directors from infobox 'Directed by' field/list - Extract screenwriters from infobox 'Screenplay by' list - Extract actors from Cast section list (first link = person name) - Upsert into people table, link via who table (profession: actor=1, director=2, screenwriter=3) - Track processed entries with has_people flag column - Consumer inserts people and marks has_people=1 on success
26 hours	fix: prevent dropped wiki entries when channel fills	dev	1	-13/+4
	- Remove non-blocking select/default that silently dropped entries - Channel sized to hold all pending entries (existing + SPARQL) - Blocking send backpressures SPARQL if consumer is slow
26 hours	feat: add -wiki-only flag to rerun only wiki data extraction	dev	2	-11/+46
	- fetchWikiArticlesData is standalone again (re-extracted from consumer) - -wiki-only flag skips SPARQL pipeline, runs only wiki data fetch - Default behavior: full pipeline (SPARQL + wiki data in parallel)
27 hours	refactor: pipeline SPARQL and wiki data in parallel	dev	3	-166/+169
	- Merge fetchWikiArticles + fetchWikiArticlesData into one pipeline - SPARQL producer fetches batches, commits each to DB, forwards resolved articles - Wiki data consumer runs concurrently, fetching at 2s/request - Each SPARQL batch commits independently (no global transaction) - Rate limits respected for both Wikidata SPARQL and wiki server - No parallel requests to either endpoint
27 hours	.	dev	1	-8/+0

28 hours	refactor: decode wiki_article names once in DB, encode on send	dev	1	-4/+0
	- One-shot migration decoded 187 percent-encoded rows - Removed decode-on-read from wikiarticle.go (no longer needed) - wikidata.go still decodes SPARQL URLs before storing (for future inserts) - wikiarticle.go encodes on send via url.PathEscape
28 hours	fix: decode wiki article names for clean storage	dev	2	-4/+7
	- wikidata.go: url.PathUnescape SPARQL titles before storing - wikiarticle.go: PathUnescape on read, PathEscape on send - DB holds decoded names; URLs are always freshly encoded
28 hours	fix: avoid double URL-encoding of wiki article names	dev	1	-4/+11
	- wiki_article values are already URL-encoded in the DB - Build query URL manually instead of url.Values.Encode() - Only escape username (not pre-encoded)
29 hours	feat: track wiki_status_code and skip 404 entries on rerun	dev	1	-23/+47
	- queryWikiArticle returns HTTP status code alongside entry data - Always record wiki_status_code for every request (success or failure) - Skip entries with wiki_status_code = 404 in future runs - Only update data fields on HTTP 200; non-200 only records status - Log line shows updated vs skipped (non-200) counts
33 hours	fix: add 429 retry with exponential backoff and increase rate limit delay	dev	1	-9/+32
	- Retry up to 5 times on HTTP 429 with 2s/4s/8s/16s backoff - Move inter-request delay before each request (was after) - Increase base delay from 1s to 2s between requests - Fix: only sleep after first request (skip delay on first call)
33 hours	feat: fetch missing wiki data from custom server and populate imdb table	dev	3	-0/+291
	- Add wiki_server and wiki_username config fields - Query custom server for each wiki_article entry - Extract description, synopsis (Plot), year, poster_url, license, license_url, num_accolades from structured JSON response - Serial processing with 1 req/s rate limit - Update only entries missing at least one target column
33 hours	fix: skip already-classified entries in wikidata query	dev	1	-1/+1
	Add has_no_wiki_article = 0 filter so entries previously marked as having no Wikipedia article are not re-queried on subsequent runs.
33 hours	feat: set has_no_wiki_article flag for entries without Wikipedia article	dev	1	-13/+34
	- Mark entries as has_no_wiki_article=1 when Wikidata returns no result - Also mark entries in batches that failed with HTTP errors - Re-run populated 2705 wiki articles, 592 marked as no wiki
34 hours	feat: fetch Wikipedia article titles via Wikidata SPARQL	dev	2	-0/+243
	- Query Wikidata SPARQL in batches of 30 for entries missing wiki_article - Store wiki_article title in imdb table - Respect rate limits with configurable delay and retry on 5xx/429 - Skip entries that have no Wikipedia article - Removed unique constraint on wiki_article (multiple entries can share one)
3 days	fix: use INSERT IGNORE for imdb_genre to handle re-runs	dev	1	-1/+1
	The previous run left partial data after a mid-transaction rollback. INSERT IGNORE makes the junction table insert idempotent.
3 days	feat: adapt genre code for n:m relation via imdb_genre	dev	1	-9/+29
	- genre table: (id, name) with unique name constraint - imdb_genre table: (id, imdb_id, genre_id) junction table - Upsert genres via INSERT ... ON DUPLICATE KEY UPDATE - Link via imdb_genre using LAST_INSERT_ID - Check missing genres via LEFT JOIN imdb_genre
3 days	feat: populate genre table from title.basics.tsv	dev	1	-11/+47
	- Parse genres field (rec[8]) from title.basics.tsv, split by comma - Insert into genre table via SELECT to resolve imdb.id from imdb_id - Update fetchAndUpdateImdbData to check for missing genres too - Skip download if TSV already exists (supports stubbed downloadFile)
3 days	fix: correct TSV parsing — use line-by-line reader and proper column indices	dev	1	-30/+57
	- Replace csv.Reader with bufio.Scanner to avoid quote-parsing issues that skipped ~355 entries (e.g. tt1853728 was on line 4.8M and got lost when csv.Reader encountered malformed quoted fields earlier) - Fix column indices: startYear=rec[5], runtimeMinutes=rec[7] (was rec[4]/rec[5] which mapped to isAdult/startYear) - Update basics for ALL imdb entries, not just those missing ratings
3 days	chore: delete .gz files after extracting in downloadImdbDatasets	dev	1	-0/+3

3 days	move download path	dev	1	-1/+1

3 days	feat: fetchAndUpdateImdbData — download IMDB datasets and populate imdb table	dev	2	-0/+352
	- Check for imdb entries with NULL average_rating - Download title.basics.tsv.gz and title.ratings.tsv.gz to imdbdata/ - Decompress alongside gzip originals - Parse only rows matching our imdb_ids (memory-efficient) - Update: average_rating, num_votes, title_type, primary_title, original_title, start_year, runtime_minutes - Results: 3394 ratings, 3093 basics updated out of 3448 entries
3 days	feat: populate imdb table with unique title IDs from links	dev	1	-0/+91
	- Extract distinct IMDb title IDs from links.param (host=imdb.com) - Skip IDs already in imdb table and non-title params (nm, ls, etc.) - Insert 3448 unique title IDs into imdb.imdb_id
3 days	feat: extract IMDB title IDs from links URLs into param field	dev	3	-15/+87
	- Query links table for IMDB title URLs (field=1, host=imdb.com) - Extract ttIDs via regex and batch-update links.param - 5662 rows updated successfully
3 days	feat: add AccessToken back to Config struct (json:"-" to exclude from ↵	dev	1	-0/+1
	serialization)
3 days	chore: remove access_token from config (calculated by program)	dev	2	-4/+0

3 days	feat: switch config to JSON; add go.mod and config.json.example	dev	4	-88/+57
	- Replace Viper-based config with encoding/json (config.go) - Add config.json with sensible defaults (gitignored) - Add config.json.example with empty values as reference - Initialize go module (go.mod) - Update main.go to use LoadConfig()
3 days	chore: commit existing config.go changes	dev	1	-1/+2

3 days	Initial commit	dev	2	-0/+143