All Lessons (206)

Lesson 012: Bayesian Beta-Binomial Smoothing

The Artemis vote system shows 50 random images per ballot and asks voters to pick 5 favorites. With 500 ballots across 12,217 images, most images are shown only 1-2 times. A raw selection rate of "1 out of 1 shown = 100%" is meaningless — it tells you nothing about whether the image is actually pref...

Artemis 2026-05-24 algorithms

Lesson 013: Elo Rating for Image Comparison

The Artemis pairwise voting mode shows two images side by side and asks "which is better?" This produces binary outcomes (winner / loser) for specific pairs, not absolute ratings. We need to convert these relative comparisons into a single continuous strength score per image that can be combined wit...

Artemis 2026-05-24 algorithms

database statistics

Lesson 014: Bradley-Terry-Luce and When to Skip It

We have pairwise comparison data (image A beats image B) and want the best possible strength estimates. Bradley-Terry-Luce (BTL) is the textbook model for this — it's more principled than Elo. But with 2,000 comparisons across 12,217 images, we chose to skip BTL entirely. This lesson explains what B...

Artemis 2026-05-24 algorithms

data-modeling statistics

Lesson 015: Borda Count for Ranked Voting

The Artemis category voting mode asks voters to rank their top 3 images within a category. We need to convert these partial rankings into numeric scores that can be aggregated across voters and combined with batch and pairwise preference signals.

Artemis 2026-05-24 algorithms

Lesson 016: Krippendorff's Alpha for Sparse Agreement

We want to measure whether voters agree on which images are good. With 100 voters and 12,217 images, the voter-image matrix is >98% missing — most voters never saw most images. Standard agreement metrics require complete matrices. We need a reliability measure that handles extreme sparsity.

Artemis 2026-05-24 algorithms

Lesson 017: Composite Scoring with Heterogeneous Signals

We have three different types of preference data — batch selection rates, Elo ratings from pairwise comparisons, and Borda scores from category rankings. Each covers a different subset of images, uses a different scale, and captures a different aspect of preference. Most images have data from only o...

Artemis 2026-05-24 algorithms

Lesson 018: Run-ID Partitioned Scoring

The scoring pipeline will be re-run as new vote data arrives, as scoring methods are tuned, or as bugs are fixed. Each run produces a full set of scores for all 12,217 images. If each run overwrites the previous scores, we lose the ability to compare methods, audit changes, or roll back to a known-g...

Artemis 2026-05-24 algorithms

Lesson 019: NULL as Honest Missing Data

Several columns in the scoring output have no meaningful value for most images. Only ~200 images have Elo scores (from 2,000 pairwise votes). Only ~150 images have Borda scores (from 250 category rankings). The BTL model wasn't run at all. Fleiss' kappa and Kendall's W can't be computed with incompl...

Artemis 2026-05-24 implementation

database data-modeling python statistics

Lesson 020: DuckDB executemany Hangs — Use PyArrow Bulk Insert

DuckDB's `executemany` with parameterized INSERT statements can hang indefinitely at scale (10K+ rows). Replacing it with a PyArrow table and `INSERT INTO ... SELECT * FROM tbl` completes the same work in under a second. When DuckDB is your warehouse, bulk writes should go through its columnar inges...

Artemis 2026-05-24 data-engineering

database pipeline python statistics

Lesson 021: Calendar as Portfolio Optimization, Not Top-N Ranking

When selecting a fixed-size collection where the items must work together (a calendar, a playlist, a portfolio, a menu), the problem is constrained set optimization — not top-N ranking. The best collection often contains none of the individually top-ranked items, because collection-level properties...

Artemis 2026-05-24 implementation

ai statistics

Lesson 022: Heuristic Month-Fit Scoring Without Text Metadata

When images lack text metadata (titles, descriptions, captions), month or season suitability can still be approximated from visual features alone — color temperature, brightness, contrast, and content flags. The signal is coarse (3-4 seasonal buckets, not 13 distinct months) but sufficient to preven...

Artemis 2026-05-24 algorithms

Lesson 023: Maximum Marginal Relevance for Diversity-Aware Selection

Greedy Maximum Marginal Relevance (MMR) is the practical default for selecting a diverse, high-quality subset from a large pool. At each step, it picks the item that maximizes quality minus similarity to already-selected items. It runs in O(K × N) time, requires no optimization library, and naturall...

Artemis 2026-05-24 algorithms

serverless python ai

Lesson 024: Hungarian Algorithm for Optimal Assignment

When you need to assign N items to N slots where each item-slot pair has a fitness score, the Hungarian algorithm gives the provably optimal assignment in O(N^3) time. For small N (≤50), it runs in microseconds and eliminates the need for greedy heuristics, manual tuning, or iterative search. Use sc...

Artemis 2026-05-24 algorithms

Lesson 025: Multiple Selection Methods as Baselines

When building an optimizer, always generate multiple candidate solutions using different methods — including at least one naive baseline. The baseline proves the optimizer adds value. The alternatives expose the trade-off frontier. Without baselines, you can't distinguish "good optimization" from "e...

Artemis 2026-05-24 implementation

Lesson 026: Formalizing De Facto Dependencies

A dependency that's imported in production code but missing from the package manifest is a time bomb. It works on the developer's machine (where the package was installed for something else) and fails on fresh installs, CI, or new team members. Audit imports against declared dependencies whenever ad...

Artemis 2026-05-24 implementation

testing python

Lesson 027: Migration Ordering and Apply-on-Use Gaps

When a database migration creates a table that new code writes to, the migration must be applied before the code runs — not just before the next CLI invocation. If the code path that triggers the write doesn't call `apply_migrations()`, the table won't exist at runtime, even though the migration fil...

Artemis 2026-05-24 data-engineering

database deployment data-modeling pipeline

Lesson 028: Chi-Squared Tests for Bias Detection at Small Scale

We planted known biases in synthetic vote data — 10% of voters had position bias (preferring earlier-displayed images), 20% had visual-drama bias (preferring dramatic images). We need statistical tests that can detect these biases with only 100 voters and 500 ballots, without requiring heavy statist...

Artemis 2026-05-24 algorithms

Lesson 029: Ground-Truth Recovery as Optimizer Validation

We have a calendar optimizer that selects 13 images from 12,217 using a weighted objective function (popularity, diversity, month-fit, cover-fit, redundancy penalty). The optimizer reports an objective score, but a high score doesn't prove the optimizer is selecting the *right* images — it could be...

Artemis 2026-05-24 testing

Lesson 030: Reliability Delta as Noise Measurement

We know that 20% of synthetic voters are intentionally noisy (10% position-biased, 10% random). We compute Krippendorff's alpha on all voters and get a moderate value (~0.52). But how much of the low agreement is caused by these noisy voters vs. genuine preference diversity among neutral voters? We...

Artemis 2026-05-24 implementation

database pipeline statistics

Lesson 031: Read-Only DB Connections for Web Layers

When an embedded database (DuckDB, SQLite) serves both a batch pipeline and an interactive web app, the web layer should open the database in read-only mode. This avoids writer-lock conflicts entirely and makes the architecture self-documenting: the web app *cannot* mutate the warehouse, by construc...

Artemis 2026-05-24 data-engineering

Lesson 032: Startup Cache for Interactive Scoring

When an interactive web app needs sub-100ms responses from a scoring function that depends on large lookup tables, load those tables into memory at startup rather than querying the database per request. The cache size is bounded (you know exactly what's in the warehouse), startup cost is a one-time...

Artemis 2026-05-24 algorithms

database pipeline python ai

Lesson 033: Vanilla JS SPA Without a Build Step

A hash-routed single-page application built with vanilla JavaScript, ES modules, and dynamic `import()` can deliver a functional multi-page experience — navigation, pagination, filtering, modals, live API calls — with zero build toolchain. For internal tools and single-user apps, this eliminates npm...

Artemis 2026-05-24 frontend

database security pipeline python javascript

Lesson 034: Reusing Query Modules Across CLI and Web

When a CLI pipeline and a web API need the same data, import the query functions directly rather than duplicating SQL. Add the serialization layer (Pydantic models, JSON responses) at the API boundary, not in the query module. The query module returns plain Python objects (dataclasses, dicts, tuples...

Artemis 2026-05-24 data-engineering

database data-modeling pipeline python

Lesson 035: Design System Portability via Tokens

A design system built on CSS custom properties (design tokens) can be shared across completely independent frontends — static HTML pages, vanilla JS SPAs, embedded widgets — by copying two files. The tokens provide visual consistency without requiring a shared component library, a build system, or a...

Artemis 2026-05-24 architecture

python

Lesson 036: Linter Rules vs. Framework Idioms

When a linter rule flags code that follows a framework's official pattern, suppress the rule per-line with `noqa` rather than restructuring the code. Linter rules encode general best practices; framework idioms encode domain-specific patterns that intentionally violate those practices. Restructuring...

Artemis 2026-05-24 implementation

security pipeline python

Lesson 037: Static Site Generation via Fetch Shim

A FastAPI + JavaScript SPA can be deployed to GitHub Pages without rewriting frontend code by using a **fetch shim** — a small JavaScript interceptor injected into `index.html` that redirects API calls to pre-generated JSON files and handles filtering, sorting, and pagination client-side. The build...

Artemis 2026-05-24 frontend

database security deployment api frontend

Lesson 038: CI Path Portability and Release Artifacts

When a project develops on Windows but deploys via CI on Linux, hardcoded paths like `D:/artemis/warehouse.duckdb` will fail silently or crash. Every path that differs between dev and CI must be configurable via environment variable. Similarly, large binary dependencies (databases, model weights) sh...

Artemis 2026-05-24 deployment

database deployment pipeline

Lesson 039: Mock Tagger Pattern for Vision Pipeline Testing

The vision tagging pipeline uses Qwen2.5-VL (a 7B-parameter vision-language model) to classify image attributes. Running the real model requires a GPU, takes seconds per image, and produces non-deterministic outputs. The full pipeline — config loading, tagging, derived label computation, DB persiste...

Artemis 2026-05-24 data-engineering

testing api pipeline ai statistics

Lesson 040: Controlled Vocabulary as Schema Contract

The vision tagging pipeline needs a consistent set of image attributes shared across five components: the vision model prompt, the attribute parser/validator, the database schema, the voting block config, and the cluster labeling engine. If any component uses an attribute code the others don't recog...

Artemis 2026-05-24 architecture

database data-modeling pipeline python

Lesson 041: Utility Function Design for Synthetic Voting Bias

Synthetic vote generation needs to produce votes that exhibit detectable attribute-based bias while remaining statistically plausible. A biased voter block that always votes for images with specific attributes produces trivially detectable (and unrealistic) bias. A block with too much noise produces...

Artemis 2026-05-24 implementation

data-modeling pipeline statistics

Lesson 042: Lift as the Primary Bias Detection Metric

Block-aware statistics need a metric that answers: "does this voting block select images with attribute X more than expected?" Raw selection counts don't work because blocks have different sizes. Rate differences (block rate - global rate) are hard to interpret when base rates vary widely. The metri...

Artemis 2026-05-24 implementation

Lesson 043: PII Sanitization in Static JSON Exports

The static site serves pre-built JSON files from a public URL. The warehouse database contains voter surrogate keys (`voter_sk`), hashed voter IDs (`voter_public_hash`), random seeds, config hashes, and local file paths. None of these should appear in public-facing JSON. The sanitization must be rel...

Artemis 2026-05-24 security

database security data-modeling

Lesson 044: Acceptance Tests as Executable Specifications

The biased voting blocks pipeline spans six components: config validation, vote generation, attribute analysis, cluster analysis, score/calendar impact, and static export. Unit tests cover each component in isolation, but the interesting behaviors — "does a biased block produce detectable lift in th...

Artemis 2026-05-24 testing

database testing data-modeling pipeline python

Lesson 045: Embedding-Based Deduplication for Image Collection Curation

When working with a large image collection from an automated source, assume near-duplicates dominate the pool until proven otherwise. Embedding cosine similarity with connected-component grouping reduces a collection to its unique members in minutes, but the threshold choice dramatically affects the...

Artemis 2026-05-24 algorithms

Lesson 046: Lazy Imports for Deployment Compatibility

Import heavy dependencies inside the function that uses them, not at module scope. A module-level `import numpy` means every consumer of that module — including lightweight build scripts, CI pipelines, and serverless functions — must have numpy installed, even if they never call the code path that n...

Artemis 2026-05-24 deployment

serverless database deployment pipeline python

Lesson 047: CLIP Zero-Shot as a Database Column Factory

A single CLIP model, used for zero-shot classification against descriptive text prompts, functions as a general-purpose column generator for structured databases. Each new prompt produces a new confidence column — no training, no fine-tuning, no labeled data. The cost of adding a column is one forwa...

Artemis 2026-05-24 implementation

database deployment data-modeling pipeline

Lesson 048: Greedy Max-Min Diversity Selection

To select k items that maximally represent the diversity within a group, iteratively pick the item most distant from all already-selected items. This greedy max-min approach is O(n×k), produces near-optimal diversity in practice, and avoids the NP-hard max-dispersion problem entirely.

Artemis 2026-05-24 algorithms

serverless ai

Lesson 049: Drag-and-Drop as the Simplest Viable Interaction

When the user's mental model is "put this thing in that slot," drag-and-drop is less code and more intuitive than alternatives like dropdowns, search dialogs, or multi-step wizards. The key is spatial co-visibility: the source pool and target slots must be on screen simultaneously so the user can se...

Artemis 2026-05-24 frontend

Lesson 050: Connected Components for Transitive Deduplication

When deduplicating by pairwise similarity, use graph connected components to group items — not naive pair-based merging. Pairwise similarity is not transitive in theory (A~B and B~C doesn't guarantee A~C), but for near-duplicates in practice, transitivity holds and connected components correctly gro...

Artemis 2026-05-24 algorithms

database ai

Lesson 051: Sigmoid Calibration for Domain-Specific CLIP Scores

CLIP logits have domain-specific distributions. Converting them to meaningful [0,1] confidence scores requires a sigmoid transform calibrated to the actual logit range in your image collection. A universal threshold doesn't work — the sigmoid center and scale must be tuned empirically by examining l...

Artemis 2026-05-24 algorithms

Lesson 052: Incremental Feature Extraction Over Full Re-runs

When adding new features to an existing collection, delete-and-rewrite only the new columns rather than re-processing everything. The key enabler is tagging each row with its source (model version, label source, attribute code) so that surgical deletes and inserts are possible without touching exist...

Artemis 2026-05-24 implementation

Lesson 053: Audit-First Design

Before writing any code for a new feature, produce a written audit of the existing codebase: what exists, what can be reused, where new code slots in. The audit document prevents reimplementing existing functionality and identifies the exact extension points — saving more time than it costs to write...

Artemis 2026-05-24 process

database pipeline statistics

Lesson 054: Phased Autonomous Execution Plans

Breaking large projects into numbered, independently shippable phases — each with explicit entry criteria, exit criteria, and a commit checkpoint — transforms ambitious multi-session work from a coordination problem into a queue of self-contained tasks. The plan file is both the work instruction and...

Artemis 2026-05-24 process

testing deployment pipeline

Lesson 055: Session Continuity via Documentation Artifacts

When working across multiple AI-assisted sessions, continuity must be encoded in files, not in conversation history. A startup document, a plan file with status tracking, and a project CLAUDE.md that reflects current state eliminate ramp-up overhead and prevent context loss from session clears and c...

Artemis 2026-05-24 implementation

testing data-modeling

Lesson 056: Environment Self-Interference in AI-Assisted Development

An AI coding assistant that launches background processes (dev servers, database connections, build watchers) will fight with its own previous instances over shared resources like ports and file locks. Explicit cleanup before each launch — kill orphan processes, release locks, verify port availabili...

Artemis 2026-05-24 deployment

database deployment python docker

Lesson 057: Test-Gated Commits at Scale

Gate every commit on a passing test suite, not on "the feature looks done." With 1,500+ tests across a project, the suite catches regressions that visual inspection misses — wrong column names, broken imports, type mismatches, off-by-one errors. The test suite is the contract for "this commit is saf...

Artemis 2026-05-24 testing

testing python statistics

Lesson 058: DuckDB Cursor-Per-Request for Concurrent Web Handlers

When serving DuckDB through a multi-threaded web framework (FastAPI/uvicorn), never share a single connection object across concurrent request handlers. Instead, call `conn.cursor()` to create a per-request cursor. DuckDB's Python driver does not support concurrent queries on the same connection fro...

Artemis 2026-05-24 data-engineering

api python

Lesson 059: Derive Metrics from Immutable Fact Tables, Not Mutable State

When a dashboard metric can be computed either from a mutable state flag or from an immutable record table, always derive it from the immutable source. Mutable flags reflect the *current* state, which may not be the state your metric is trying to describe. Immutable fact tables preserve the *histori...

Artemis 2026-05-24 data-engineering

Lesson 060: Context Blocks Turn a Tool Into a Teaching Artifact

Adding a brief "why this page matters" block at the top of every page in a data application transforms it from an internal tool into a self-guided case study. A single sentence of context lets a reviewer understand what they're looking at without reading documentation or having the author present to...

Artemis 2026-05-24 implementation

Lesson 061: Centralize Project Metadata to Prevent Count Drift

When the same project-level number (image count, cluster count, lesson count) appears in multiple frontend modules, centralize it in a single metadata object. Better still, fetch live counts from the API at render time and use the centralized constant only as a fallback. Hardcoded numbers scattered...

Artemis 2026-05-24 implementation

javascript ai

Lesson 062: A Guided Reviewer Path for Portfolio Projects

Add a numbered "review this project in N minutes" path to the homepage of any portfolio project or case study. Without explicit guidance, reviewers wander randomly through pages and miss the strongest parts of the work. A curated path ensures every reviewer sees the same narrative arc, regardless of...

Artemis 2026-05-24 implementation

Lesson 063: Promise.all as an Accidental Concurrency Test

Any frontend page that fires multiple `fetch()` calls via `Promise.all()` is an implicit concurrency test for the backend. If your API endpoints work individually via `curl` but fail when the browser loads a page that hits them simultaneously, you have a shared-state concurrency bug — not a data or...

Artemis 2026-05-24 testing

database testing api python javascript

Lesson 064: Noscript Fallback as SEO Baseline for SPAs

A single-page application rendered entirely in JavaScript is invisible to search engine crawlers that don't execute JS. Adding a `<noscript>` block with the project's core content — title, summary, key links, and attribution — provides a crawlable baseline that costs minutes to implement and ensures...

Artemis 2026-05-24 implementation

testing deployment frontend pipeline javascript

Lesson: Algorithm Selection — sklearn KMeans vs Pillow Quantize for Dominant Colors

The visual feature extraction pipeline ran sklearn's `KMeans` on every thumbnail to find 5 dominant colors. Each call took ~147ms per image. For 12,217 images, that's ~30 minutes of CPU time on dominant color extraction alone — a feature that contributes a single JSON column to the feature table.

Artemis 2026-05-24 algorithms

testing pipeline python

Lesson: Batch Database Operations — INSERT/UPDATE Patterns at Scale

Multiple pipeline stages in Artemis started with per-row INSERT or UPDATE patterns that worked fine during development (5-10 rows) but became bottlenecks at full scale (12,000+ rows). The per-row pattern appeared in three places:

Artemis 2026-05-24 data-engineering

database security pipeline python ai

Lesson: Building with Synthetic Data Before Real Data Arrives

The Artemis project needed voter preference data to build its statistical models and calendar optimizer. But real vote data from ArtemisTimeline.com wasn't yet available — the vote export hadn't been requested, and the site's API only exposes aggregate leaderboards, not raw ballots.

Artemis 2026-05-24 implementation

data-modeling pipeline python statistics

Lesson: Choosing k for k-Means Clustering in a Constrained Selection Problem

We needed to cluster 12,217 Artemis II mission photos into visually distinct groups. The clusters serve a specific downstream purpose: ensuring the final 13-image calendar selection has visual diversity. The choice of k (number of clusters) directly affects whether the optimization can produce a goo...

Artemis 2026-05-24 algorithms

Lesson: Concurrent HTTP Downloads with Connection Pooling

The original thumbnail downloader processed 12,217 images sequentially. Each download created a new `httpx.Client` instance, which meant a fresh TCP connection and TLS handshake for every single request — all to the same Cloudflare R2 CDN endpoint. At 0.1s rate limiting plus ~50-200ms connection ove...

Artemis 2026-05-24 data-engineering

cloudflare database pipeline python

Lesson: Debugging with Surrogate Key Ranges

When investigating why multimodal clustering produced zero results, the breakthrough came from a simple query:

Artemis 2026-05-24 data-engineering

database data-modeling pipeline ai

Lesson: Disjoint Data Populations Breaking Multimodal Joins

Multimodal clustering required images to have both CLIP image embeddings AND text embeddings. The intersection of these two sets was empty — 0 images qualified. The clustering silently logged a warning and returned 0 results. The pipeline appeared to work, but an entire analysis dimension produced n...

Artemis 2026-05-24 implementation

Lesson: DuckDB Single-Writer Constraint in Concurrent Pipelines

While developing the concurrent thumbnail downloader, the DuckDB warehouse file (`warehouse.duckdb`) became locked by the download process. Any attempt to check progress, run `artemis-pipeline status`, or open a second connection failed with:

Artemis 2026-05-24 data-engineering

database pipeline python ai

Lesson: PEP 8 Compliance in Data Engineering Pipelines

Python scripts in the Artemis project span multiple roles: XML/JSON migration, schema validation, metadata enrichment, test harnesses, and lesson harvesting. Without a consistent style standard, each script drifts toward the author's (or AI assistant's) habits — camelCase here, inconsistent indentat...

Artemis 2026-05-24 data-engineering

database data-modeling pipeline python javascript

Lesson: Per-Record Overhead That Doesn't Matter at 10 Rows Kills You at 12,000

The original thumbnail downloader worked flawlessly on 5 images during development. When scaled to 12,217 images, it was unacceptably slow — not because of network latency, but because of per-image overhead that was invisible at small scale.

Artemis 2026-05-24 implementation

Lesson: Resume-Safe Pipeline Design — Surviving Interrupts and Partial Failures

The thumbnail download process was killed multiple times during development — once to change the rate limit, once to adjust the timeout, once at the user's request. Each time, the question was: how much progress was lost? Can we pick up where we left off?

Artemis 2026-05-24 data-engineering

database data-modeling pipeline python

Astro Plugin Peer Dependency Pinning

In the Astro ecosystem, plugin packages (`@astrojs/*`) release independently of the core framework and frequently break peer dependency compatibility. Pin plugin versions explicitly and test upgrades in isolation rather than accepting latest.

Data Readiness 2026-05-24 implementation

astro

Content-Driven Architecture for Regulatory Frameworks

When multiple domains share identical page structures but differ only in subject matter, model the variation as typed content collections and render everything through shared components. The architecture's value comes from enforcing uniform structure via schemas while allowing unlimited content vari...

Data Readiness 2026-05-24 implementation

data-modeling astro

GitHub Pages Deployment Configuration

GitHub Pages deployment with static site generators has three independently-failing configuration points — workflow file location, CNAME record, and site URL in the build config — and all three must be correct simultaneously. A deploy that "almost works" is usually missing exactly one of these.

Data Readiness 2026-05-24 deployment

cloudflare deployment pipeline astro

Hub Consolidation Over Per-Site Scaffolding

When building a platform that serves N variants of the same structure, start with a single consolidated site that treats variation as data, not as separate projects. Late consolidation — after scaffolding N separate sites — is expensive and produces a massive, risky changeset.

Data Readiness 2026-05-24 implementation

deployment astro

MDX Scoped Styles in Astro

Astro's scoped `<style>` blocks do not penetrate MDX `<Content />` output. Any styles that need to reach MDX-rendered HTML must live in global CSS or use `:global()` selectors. This is a framework-level constraint, not a bug to work around.

Data Readiness 2026-05-24 implementation

security astro

Prototype One Instance Before Scaling to N

When building a system that will serve N instances of the same pattern, build one instance end-to-end first — from scaffold through deployment — before replicating. The prototype surfaces architectural assumptions that only become visible under real content, real routing, and real build constraints.

Data Readiness 2026-05-24 deployment

security deployment data-modeling astro

Relative Link Fragility in Multi-Section Static Sites

Relative links in templated multi-section static sites break silently when page nesting depth varies. Use a systematic link strategy — either always-absolute paths from the site root, or a helper that resolves relative to the current topic — rather than hand-coding relative hrefs in content files.

Data Readiness 2026-05-24 implementation

astro

Astro Site URL vs Base Path Confusion

Astro's `site` and `base` config fields look like they do the same thing but serve completely different purposes. Getting them wrong produces a site that deploys successfully but generates incorrect canonical URLs, broken sitemaps, or broken routing -- and the failure mode changes depending on wheth...

HAx 2026-05-22 implementation

astro configuration deployment github-pages routing

Astro Type Errors as Silent Deploy Blockers

A site can build and run perfectly in dev mode while harboring type errors that fail `astro check` in CI. If CI gates on type checking (and it should), these latent errors become deploy blockers that surface only after pushing -- never during local development.

HAx 2026-05-22 deployment

ci astro typescript preact deployment

GitHub Pages Base Path Pitfall

When migrating a static site to a hosting platform that serves from a subdirectory (e.g., `username.github.io/repo/`), every hardcoded internal link breaks. The migration isn't done when the deploy workflow is green -- it's done when every `href`, asset path, and client-side route has been audited f...

HAx 2026-05-22 deployment

deployment github-pages astro routing

Base Adapter ABC Pattern

When integrating with multiple external APIs that share a common pipeline contract, define an abstract base class that handles cross-cutting concerns (rate limiting, timeouts, credential redaction, error classification) and requires subclasses to implement only the source-specific logic (`fetch` and...

GTMLeads 2026-05-20 architecture

authentication pipeline python

Config-First Development

When building a system that depends on external data sources, templates, or configuration-driven behavior, ship the configuration files before the code that consumes them. This forces you to validate your data model against real requirements before investing in implementation, and it makes each subs...

GTMLeads 2026-05-20 implementation

Four-Level Deduplication Strategy

When deduplicating records from heterogeneous sources with varying ID reliability, use a priority-ordered cascade of match strategies — from strongest (source-native IDs) to weakest (fuzzy metadata). Check each level in order and stop at the first match. This avoids both false negatives (missed dupl...

GTMLeads 2026-05-20 algorithms

database pipeline

FTS5 Integration with SQLite

SQLite's FTS5 extension provides production-quality full-text search without an external service. The key to making it work reliably is sync triggers (not application-level writes), `rowid`-based joins (not column joins), and treating the FTS table as a read-only projection of the main table.

GTMLeads 2026-05-20 implementation

database data-modeling python

Live API vs Mock Divergence

Mock-based tests validate your code's logic, not your assumptions about the external API. When an adapter passes all mock tests but fails against the real API, the bug is almost always in the mock — you encoded incorrect assumptions about field names, response structure, or protocol behavior.

GTMLeads 2026-05-20 testing

testing api data-modeling pipeline python

Nine-Phase Sequential Build

For a full-stack application built from scratch, a strict bottom-up phase order — schema, models, data, services, pipeline, API, UI, export — with one commit per phase and a green test suite at each boundary, produces a codebase where every layer is testable in isolation and integration bugs surface...

GTMLeads 2026-05-20 implementation

database testing api data-modeling pipeline

Phased Adapter Expansion

When scaling a plugin architecture, ship configuration and data files first (before any code), tier new plugins by API complexity, and close with registry-level consistency tests. This ordering catches integration mismatches early and keeps each phase independently shippable.

GTMLeads 2026-05-20 architecture

authentication data-modeling pipeline

Scoring Composition

When ranking records from heterogeneous sources, decompose the score into independent components with explicit weights, each normalized to 0.0–1.0. This makes the scoring system auditable (you can explain why a record scored high), tunable (change one weight without affecting others), and extensible...

GTMLeads 2026-05-20 algorithms

database testing statistics

ETL-Only Branch: Surgical Code Removal

When stripping a codebase down to a subset of its functionality, remove in dependency order — packages first, then CLI registrations, then migrations, then dependencies, then tests, then deployment artifacts. Each commit should leave the system runnable, not just compilable.

AI Benchmark 2026-05-17 data-engineering

database deployment data-modeling pipeline python

Model Slug Extraction: Dictionary Lookup Over Pure Regex

When extracting structured identifiers (model names, product versions, package names) from unstructured text, a dictionary of known values with normalization beats regex-only extraction. Regex handles the general case; the dictionary handles the important cases correctly.

AI Benchmark 2026-05-17 implementation

pipeline javascript

OAuth Credit Routing for CLI Tools

When a CLI tool supports multiple authentication methods with different billing paths, scripts that invoke it must explicitly select the intended billing path — otherwise, environment variable precedence silently routes charges to the wrong budget.

AI Benchmark 2026-05-17 deployment

authentication deployment pipeline cloud ai

Pipeline Data Quality Remediation: Design Doc First

When a data pipeline has multiple interacting failure modes, writing a design document that catalogs all errors before fixing any of them produces better fixes than addressing errors one at a time. The design doc reveals which failures share root causes and which fixes would conflict.

AI Benchmark 2026-05-17 data-engineering

cloudflare database deployment pipeline

Playwright Browser Lifecycle in Async Pipelines

Shared browser instances in async code need explicit synchronization at creation time and explicit cleanup at shutdown. Without both, you get leaked browser processes from races and resource warnings from incomplete teardown — problems that surface only under concurrent load, not in unit tests.

AI Benchmark 2026-05-17 implementation

cloudflare database testing pipeline python

Revert as a Design Signal

A git revert is a signal that the original change had a design gap — not just a bug. When you revert, don't just re-implement the same approach with a fix; use the revert as a forcing function to write down what the original approach missed before trying again.

AI Benchmark 2026-05-17 implementation

Revert-Restore-Reapply: Safe Source Catalog Changes

Never bundle additive changes (new sources) with destructive changes (dropping existing pages) in one commit. If a rollback is needed, you lose the additions along with the removals — and untangling them under pressure wastes time.

AI Benchmark 2026-05-17 implementation

RSS as Fallback Collection Method

When primary collection methods fail due to anti-bot defenses (Cloudflare, JS rendering), Google News RSS feeds provide a reliable fallback that requires no browser automation — but RSS item bodies are often useless title echoes that need enrichment from the actual article pages.

AI Benchmark 2026-05-17 implementation

cloudflare frontend pipeline javascript

SQLite Single-Writer for Async Pipelines

SQLite supports exactly one concurrent writer. When an async pipeline shares a database with a long-running server process, the fix is architectural (serialize writers) — not a PRAGMA tweak. WAL mode reduces contention but does not eliminate it.

AI Benchmark 2026-05-17 implementation

cloudflare database deployment pipeline python

Two-Stage Report Generation: Extract Then Synthesize

When building reports that combine deterministic data extraction with LLM synthesis, split them into two explicit stages: a repeatable extraction step that produces a structured intermediate file, and a separate synthesis step that feeds that file to the LLM. This makes each stage independently test...

AI Benchmark 2026-05-17 implementation

database authentication pipeline python ai

AI-Graded Content Validation

Large question banks authored by multiple sources (human or AI) accumulate factual errors that are invisible to structural validation. Using an LLM to independently attempt each question blind — without seeing the answer key — and then comparing its answer to the stored correct answer, surfaces wron...

Certification 2026-05-13 testing

data-modeling ai

Answer Position Bias

When humans author multiple-choice questions, the correct answer tends to cluster in certain positions (often B or C). Test-takers learn this pattern and use it as a guessing heuristic. Randomizing answer positions eliminates this bias and makes the quiz a better learning tool.

Certification 2026-05-13 implementation

python

Building a Codebase Review Skill

A structured review skill turns the ad-hoc "look at this code and tell me what's wrong" request into a repeatable, evidence-based audit that produces the same quality of findings regardless of who runs it or when. The skill's value comes from its taxonomy of problem categories (derived from real iss...

Certification 2026-05-13 security

security deployment data-modeling

Building a Lessons Skill for Claude Code

A Claude Code skill file is a structured prompt that turns a repeatable workflow into a single slash command. The skill's power comes from clearly separating modes (read-only vs write), defining explicit quality contracts for outputs, and providing the AI with enough heuristics to make judgment call...

Certification 2026-05-13 implementation

deployment data-modeling cloud

Building a Phase Execution Skill

A phased plan is only as good as its execution discipline. A `/phase` skill automates the mechanical parts of plan execution — picking the next task, timestamping start/completion, verifying work, committing atomically — so the human (or AI) can focus on doing the actual work rather than maintaining...

Certification 2026-05-13 implementation

deployment data-modeling git architecture

Bulk Metadata Enrichment Scripts

When hundreds of data records need the same type of update (adding titles, categories, tags, or enriched descriptions), writing a dedicated Python script that reads a manifest and patches the data files is orders of magnitude faster and more reliable than manual editing. The script is disposable, bu...

Certification 2026-05-13 implementation

data-modeling python

Client-Side State Persistence with localStorage

localStorage can serve as a full persistence layer for client-side applications when the data is user-specific, the data volume is small, and there is no multi-device sync requirement. The key challenges are key design, migration of storage formats, and graceful handling of storage limits and corrup...

Certification 2026-05-13 frontend

frontend data-modeling

Code Review Driven Remediation

A whole-codebase code review is only as valuable as the remediation that follows it. The review itself produces a findings document. The remediation requires a separate phased plan that prioritizes findings by severity, groups them into shippable phases, and tracks each fix to completion with test v...

Certification 2026-05-13 process

security data-modeling

Content Quality Auditing at Scale

When you have hundreds or thousands of content items authored by different sources at different times, quality varies wildly unless you define measurable thresholds and audit systematically. The audit itself is more valuable than the fixes it produces — it turns "the hints feel thin" into "22 of 33...

Certification 2026-05-13 implementation

data-modeling cloud

Content Security Policy for Static Sites

A Content Security Policy (CSP) is achievable on a static site without server-side headers by using a `<meta>` tag. The challenge is crafting a policy that's strict enough to block XSS but permissive enough to allow legitimate functionality — especially ES module imports from CDNs and inline styles...

Certification 2026-05-13 security

security data-modeling

Design System Migration

Migrating an existing multi-page site to a design system is a page-by-page operation, not a big-bang rewrite. The design system (tokens + components) must be complete and proven on one reference page before touching others. The migration ends with deleting the old stylesheets — if the old CSS files...

Certification 2026-05-13 data-engineering

data-modeling cloud

Design-First Development

Writing a design document and a Physical Design Requirements (PDR) document before coding catches architectural mistakes when they're cheapest to fix. The design doc explores the problem space; the PDR specifies the physical implementation. Skipping either leads to rework: skipping design means buil...

Certification 2026-05-13 process

data-modeling architecture

Equivalence Testing During Format Migration

When migrating a data format (XML to JSON) that feeds a rendering pipeline, the only way to prove the migration is correct is to run both formats through the pipeline and compare the outputs field-by-field. Unit tests of the new loader are necessary but insufficient — they prove the new code works,...

Certification 2026-05-13 data-engineering

database testing data-modeling pipeline javascript

Hint Quality as a Spectrum

A progressive hint system (brief nudge → full explanation → deep-dive knowledge) is more pedagogically effective than a single "show answer" button. But each level must serve a distinct purpose with a measurable quality bar, or they collapse into three versions of the same thin content.

Certification 2026-05-13 implementation

serverless docker cloud

Integration Testing a DOM Application with jsdom

A browser-based application that uses DOM APIs (querySelector, innerHTML, addEventListener) can be integration-tested in Node.js using jsdom, without launching a real browser. This is faster than Playwright/Selenium and simpler to set up, but requires dependency injection to decouple the application...

Certification 2026-05-13 testing

security javascript

Legacy Artifact Removal

After a migration, the old system's artifacts (files, code, tests, scripts) must be actively removed in a deliberate cleanup pass — they don't disappear on their own. The removal is safe only when you can prove the new system is fully operational, and the cleanup itself requires a plan because the o...

Certification 2026-05-13 implementation

database testing data-modeling

Lessons Learned as a Practice

Systematically extracting lessons from project work — and writing them as standalone documents — turns ephemeral experience into a durable knowledge base. The practice is most valuable when it is automated enough to be low-friction (discovery from git history) but requires human judgment for what ac...

Certification 2026-05-13 process

testing deployment data-modeling

Phased Release Planning

Breaking large features into ordered phases — each independently shippable, each ending with a commit — transforms ambitious work into manageable steps with explicit progress tracking. The phase plan is both a work queue and an audit trail.

Certification 2026-05-13 process

testing data-modeling

Provider-Agnostic Plugin Architecture

When a system needs to support multiple "providers" (vendors, brands, data sources) that share the same behavior but differ in branding and content, the architecture should make adding a new provider a data-only operation with minimal code changes. The code that distinguishes providers should be con...

Certification 2026-05-13 architecture

deployment data-modeling cloud

Scaling Content Without Scaling Complexity

Adding 50+ exams across 10 providers to a quiz application required zero changes to the core quiz engine, data loader, or results page. The architecture held because the provider abstraction was clean, the data format was standardized, and provider-specific logic was confined to a single function an...

Certification 2026-05-13 implementation

database cloud

Schema Enforcement at the Data Layer

Adding runtime schema validation to your data loading layer catches entire categories of bugs that would otherwise surface as confusing UI glitches. The cost is a one-time schema definition and a few lines of validation code. The payoff is immediate, clear error messages instead of silent wrong beha...

Certification 2026-05-13 architecture

data-modeling

Schema Variant Consolidation

When multiple people or processes author data files for the same system without a shared schema, variant schemas emerge. The variants look similar enough to pass casual inspection but differ in element names, nesting structure, or attribute naming — causing parser failures on some files but not othe...

Certification 2026-05-13 architecture

Static Site as Application Platform

A full-featured application (quiz engine, progress persistence, scoring, results dashboards, 10 providers, 50+ exams) can be built with vanilla HTML, CSS, and ES6 modules — no framework, no build step, no server. This approach trades developer convenience (hot reload, component abstractions, state m...

Certification 2026-05-13 frontend

database security testing deployment frontend

Testing Provider Detection Logic

When critical logic is embedded in a class that's hard to test (DOM-coupled UI class), developers sometimes copy the logic into the test file and test the copy instead. This creates a dangerous illusion of coverage: the tests pass, but they're not testing the real code. When the real code diverges f...

Certification 2026-05-13 testing

security testing cloud

Verbatim Answer Leakage in Hints

When hints contain the exact text of the correct answer choice, they short-circuit learning. The learner reads the hint, sees the answer verbatim, and selects it without understanding why it's correct. This is a subtle content defect that is invisible in manual review but easy to detect programmatic...

Certification 2026-05-13 implementation

serverless data-modeling cloud

XML Entity Encoding Pitfalls

XML entity encoding bugs (`Q&A` vs `Q&A`) are the most common class of data corruption in XML content pipelines. They're invisible in many editors, they pass casual visual inspection, and they cause parse failures that manifest as "the file won't load" with no useful error message. Any pipeline...

Certification 2026-05-13 security

data-modeling pipeline cloud

XML to JSON Migration

When migrating a live data format (XML to JSON), the key risk is not the conversion itself — it's proving that the new format produces identical behavior. The migration succeeded because the conversion was treated as a pipeline problem (convert, validate, prove equivalence) rather than a rewrite.

Certification 2026-05-13 data-engineering

testing frontend data-modeling pipeline python

XSS in Trusted-Data Applications

Using `innerHTML` to render content from "your own" data files (XML, JSON, markdown) is an XSS vulnerability even when the data is self-authored today. The threat model changes when the data pipeline changes: content contributions, bulk imports from external sources, or AI-generated content can all...

Certification 2026-05-13 security

security data-modeling pipeline ai

Canonical Model as Single Source of Truth

When a system must produce multiple visual representations of the same architecture, build a single normalized graph model and derive all outputs from it. Renderers that read the same model cannot drift from each other; renderers that maintain their own state always will.

Diagram 2026-05-13 implementation

data-modeling ai

Entity ID Stability Through Format Conventions

When building a knowledge graph that must support regeneration, deduplication, and cross-system references, enforce a structured ID format from day one. An ID like `entity_type.domain.name` is simultaneously human-readable, machine-parseable, and stable across re-extraction — properties that free-fo...

Diagram 2026-05-13 implementation

database data-modeling pipeline python ai

Multi-Renderer Architecture with Shared Style Palettes

When building multiple rendering backends for the same data model, define the visual language (colors, shape semantics, edge style categories) once and map it to each renderer's syntax independently. Visual consistency across output formats is easy to lose when renderers are built in isolation; a sh...

Diagram 2026-05-13 implementation

frontend data-modeling javascript

Phased Plan Execution with One Commit Per Phase

When building a multi-phase system, track progress at the row level within each phase (Open → Started → Completed with timestamps), commit only when an entire phase is green, and never batch multiple phases into one commit. This granularity makes it possible to resume mid-phase, measure velocity, an...

Diagram 2026-05-13 process

database testing api data-modeling pipeline

Preset Configs vs. Custom Configs for Diagram Generation

When building a query or configuration system, provide a registry of named presets for the common cases and a full custom endpoint for everything else. Presets give users instant value without learning the schema; the custom path preserves full flexibility for power users.

Diagram 2026-05-13 implementation

Proxy-Based Frontend-Backend Integration

In a split-stack project (separate frontend and backend processes on different ports), configure the frontend dev server to proxy API requests to the backend rather than hardcoding backend URLs or relying on CORS alone. The proxy eliminates cross-origin issues during development, keeps the frontend...

Diagram 2026-05-13 implementation

deployment api python javascript

Rule-Based Extraction Before LLM Extraction

When building an entity extraction pipeline, implement rule-based heuristics first and defer LLM-assisted extraction until the deterministic baseline is tested and measured. The rule-based layer gives you a reproducible, cost-free, fast foundation that LLM extraction can extend — not replace.

Diagram 2026-05-13 implementation

api pipeline python ai statistics

Base URL Misconfiguration Breaks Subdirectory Deploys

When deploying a static site to a subdirectory path (e.g., `github.io/project/` instead of a custom domain root), every internal link must be prefixed with the base path. Setting the framework's `site` config is not enough — you must also set `base`, and every hardcoded absolute `href` in components...

MoreLessons 2026-05-13 deployment

security deployment astro ai

CI Dependency Lists Drift from pyproject.toml

When CI workflows hand-maintain `pip install` commands that duplicate what `pyproject.toml` already declares, the two lists will drift. New dependencies added to `pyproject.toml` will be missing in CI, causing build failures that can't be reproduced locally. The fix is to use `pip install .` so `pyp...

MoreLessons 2026-05-13 deployment

testing deployment pipeline python javascript

Enable GitHub Pages Before First Deploy

A GitHub Actions workflow that deploys to GitHub Pages will fail on the first run if Pages is not enabled in the repository settings. The workflow will build successfully but the deploy step returns a 404 — "Ensure GitHub Pages has been enabled." This is a configuration prerequisite, not a code bug,...

MoreLessons 2026-05-13 deployment

deployment api pipeline python astro

Preflight Checks Prevent Push-then-Fix Cycles

Running the same lint, format, and test checks locally before pushing catches failures that would otherwise require a push-fix-push cycle through CI. The cost of a local preflight is seconds; the cost of a CI round-trip is minutes plus noise (failed build notifications, red badges, extra commits). A...

MoreLessons 2026-05-13 deployment

security testing deployment python javascript

Structured Code Review as a Phase Gate

Running a systematic, category-driven code review after implementation is complete catches a class of issues that per-phase testing and acceptance criteria miss. Per-phase verification asks "does this phase work?" — a structured review asks "what's wrong across the whole codebase?" The two are compl...

MoreLessons 2026-05-13 process

security testing pipeline python astro

XSS via innerHTML in LLM Chat Interfaces

Any UI that displays LLM-generated text has two untrusted input sources: the user's query and the model's response. Both must be sanitized before DOM insertion. The model's output is especially dangerous because developers intuitively trust "their own backend" — but the LLM's response is no more tru...

MoreLessons 2026-05-13 security

security deployment frontend python javascript

AI Scoring Weights as a Balance Lever

When building an AI opponent for a strategy game, express its decision-making as a single weights table that scores every legal action. This table simultaneously defines AI behavior and serves as a balance tuning surface — changing one number shifts both how the AI plays and how the game feels.

CorpBattleCards 2026-05-11 algorithms

deployment

Counter Wheel as Asymmetric Balance

A circular advantage mechanic (A beats B beats C beats D beats A) creates asymmetric matchups from symmetric starting positions. The modifier can be small (+1/-1) and still be load-bearing if it touches enough systems — attacks, defenses, public effects, SWOT traits, and upgrade synergies.

CorpBattleCards 2026-05-11 implementation

Data-Driven Game Design with a Verb Grammar

Define a small, fixed grammar of mechanical verbs and make every game effect a data declaration using those verbs. This keeps the engine small and testable while allowing card variety to scale independently of code complexity.

CorpBattleCards 2026-05-11 implementation

Phased Plans with Interstitial Phases

When mid-project discoveries require new work that doesn't fit the original phase structure, insert interstitial phases (3.5, 6.5) rather than renumbering downstream phases. This preserves commit history references, plan file anchors, and team communication while accommodating scope changes.

CorpBattleCards 2026-05-11 process

data-modeling git statistics

Simulation as Acceptance Test

For games and complex interactive systems, unit tests verify correctness but batch simulation verifies balance. Run N automated games, record metrics, and treat the first run's numbers as a baseline. Future changes must either match the baseline or explain the deviation.

CorpBattleCards 2026-05-11 testing

testing data-modeling pipeline python statistics

Spreadsheet-to-Code Pipeline for Game Content

When game content is authored by designers in spreadsheets, build a one-way generator script that converts the spreadsheet into schema-validated data files. The spreadsheet stays authoritative; the generated files are artifacts. This separates content authoring from code and catches errors at genera...

CorpBattleCards 2026-05-11 data-engineering

developer-experience configuration localhost multi-project ports

UI State Machine for Turn-Based Games

Model every distinct "what is the UI waiting for?" moment as an explicit state in an enum. The state machine eliminates the most common game UI bugs — wrong input handled at the wrong time, dialogs that don't dismiss, and turn phases that skip or repeat — by making the set of valid transitions expli...

CorpBattleCards 2026-05-11 implementation

architecture ui

Dev Port Registry for Multi-Project Work

A centralized port assignment table in shared developer config prevents localhost collisions when running multiple projects simultaneously.

Lessons Hub 2026-05-11 process

Public and Private Repo Harvesting

A multi-repo content pipeline must handle mixed visibility gracefully — token scope, clone failure semantics, and local fallbacks all need explicit design.

Lessons Hub 2026-05-11 architecture

harvesting github authentication private-repos ci-cd error-handling

Tidy Skill for Repo Housekeeping

A structured, behavior-preserving housekeeping pass prevents repo entropy without the risk of accidental refactors.

Lessons Hub 2026-05-11 process

housekeeping maintenance dead-code documentation-drift archiving skills

Acceptance Testing with Playwright

Use BFS link crawling and smoke tests against live URLs to catch broken navigation and UI regressions before users do

Lessons Hub 2026-05-10 process

testing playwright acceptance crawling ci automation

Adding a Voice Reader to Lesson Pages

Integrating browser-native text-to-speech into a static site requires handling platform quirks, script timing, and progressive enhancement — the Web Speech API is powerful but fragile across browsers.

Lessons Hub 2026-05-10 frontend

web-speech-api text-to-speech accessibility astro progressive-enhancement browser-quirks

GitHub Pages Custom Domain Setup

Configuring a custom domain for GitHub Pages requires coordinating DNS, repo settings, build tool config, and deployment mode — each can silently break the others.

Lessons Hub 2026-05-10 deployment

github-pages dns deployment custom-domain astro

Live Infrastructure for Integration Testing

When local services are already running, skip mocks and test the real pipeline end-to-end

Lessons Hub 2026-05-10 process

testing integration ollama chromadb rag mocks

Mock vs Live Testing Trade-offs

A decision framework for when to mock dependencies and when to test against real infrastructure

Lessons Hub 2026-05-10 process

testing mocks integration decision-framework cloud adapters

Preflight Gates as Local CI

Run the same checks CI will run before pushing to prevent the most common build failure patterns

Lessons Hub 2026-05-10 process

testing ci preflight lint automation workflow

Skill-Driven Workflow Automation

Composable slash-command skills turn multi-step developer workflows into repeatable single-command operations that enforce guardrails automatically

Lessons Hub 2026-05-10 process

workflow automation skills ci process claude-code

Test Pyramid for Static Sites

Layer unit, integration, and acceptance tests so each catches what the others cannot in a static site with a backend API

Lessons Hub 2026-05-10 architecture

testing test-pyramid pytest playwright integration architecture

Testing Cross-Repo Content Pipelines

Validate harvested content spanning multiple repositories with severity levels, slug uniqueness, schema enforcement, and link resolution

Lessons Hub 2026-05-10 implementation

testing validation pipeline cross-repo content schema

Adapter Pattern for Multi-Cloud Portability

Abstract base classes with minimal interfaces let the same RAG pipeline run on four different cloud providers without conditional logic in business code.

Lessons Hub 2026-05-09 architecture

python adapter-pattern cloud dependency-injection architecture

Code Review as Requirements Source

Systematic triage of code review findings produces a traceable requirements document — turning ad hoc observations into prioritized, implementable work.

Lessons Hub 2026-05-09 process

code-review requirements process traceability prioritization

Five-Stage Design-to-Execution Workflow

A repeatable workflow — Design, PDR, Plan, Execute, Commit — with table-driven task tracking and one-commit-per-phase discipline, applied across 18 project phases.

Lessons Hub 2026-05-09 process

workflow planning phased-execution project-management process

Lazy Imports for Optional Cloud Dependencies

Deferring cloud SDK imports to runtime lets the same codebase run with or without any given SDK installed, and enables testing without real dependencies.

Lessons Hub 2026-05-09 architecture

python imports cloud testing dependency-management

Phased Multi-Cloud Infrastructure as Code

Three cloud stacks (AWS, Azure, GCP) built in separate phases with OIDC federation, avoiding cross-cloud coupling while sharing a common authentication pattern.

Lessons Hub 2026-05-09 deployment

aws azure gcp oidc infrastructure ci-cd cloudformation bicep

RAG Corpus Chunking Strategy

Splitting documents at H2 headings with stable IDs and content hashes produces predictable, debuggable chunks that support incremental re-indexing.

Lessons Hub 2026-05-09 architecture

rag chunking embeddings vector-search content-hashing

Rule-Based Gap Detection Without ML

Seven heuristic rules detect when a RAG corpus can't answer a query, without training data or a classifier — transparent and debuggable, with known trade-offs.

Lessons Hub 2026-05-09 architecture

rag gap-detection heuristics nlp quality

Choosing the Right Similarity Algorithm

Before choosing a similarity algorithm, understand whether your data uses binary membership (item has feature or doesn't) or continuous scores (item has every feature at varying levels). Set-based metrics like Jaccard collapse to a constant when every item has every feature — the signal is in the sc...

JobClass 2026-05-08 algorithms

database data-modeling python

Crosswalk and Taxonomy Evolution

Occupation codes are not stable identifiers across taxonomy revisions. The same SOC code can refer to different occupations in different versions, and naively comparing values across revisions produces misleading results. A crosswalk — an explicit mapping from old codes to new codes with cardinality...

JobClass 2026-05-08 architecture

Data Quality Traps in Government Sources

Government data sources contain artifacts of their internal production processes — temp files in archives, renamed columns between releases, duplicate hierarchical rows, suppressed values that look like nulls but carry legal meaning, and CDN configurations that reject non-browser HTTP clients. Defen...

JobClass 2026-05-08 data-engineering

data-modeling pipeline python

Derived Metrics from Base Observations

Base observations are source truth; derived values are computed artifacts. Mixing them in the same table creates ambiguity about whether a number is a measurement or a calculation. Separating them into distinct tables — with explicit derivation methods and base-metric linkage — makes the distinction...

JobClass 2026-05-08 implementation

Design System Cross-Pollination

When two projects share an author, the stronger design system should inform the weaker one — but adopting visual feel is a different task than adopting architecture. Port the tokens and typography; don't port the rendering pipeline.

JobClass 2026-05-08 architecture

security testing deployment frontend pipeline

Dimensional Modeling for Labor Data

A four-layer warehouse architecture (raw, staging, core, marts) with strict separation of concerns at each layer produces a system where raw data is always recoverable, business meaning is assigned in exactly one place, and analytical queries never need to understand source formats.

JobClass 2026-05-08 architecture

database data-modeling pipeline

Extract Patterns for Government APIs

Federal data sources are not designed for programmatic access. They block bare HTTP requests, publish in heterogeneous formats, embed preamble rows in spreadsheets, and experience periodic outages around major releases. A robust extract layer must handle all of these realities with browser-like head...

JobClass 2026-05-08 implementation

data-modeling pipeline python

Fetch Shim Architecture

A static site can replicate a dynamic API by intercepting JavaScript `fetch()` calls and redirecting them to pre-built JSON files. The key technique is a monkey-patch of the global `fetch` function that routes API URLs to static file paths, with client-side filtering for search and client-side compo...

JobClass 2026-05-08 frontend

database deployment frontend python javascript

Geography Comparison Pitfalls

Geographic wage comparisons are inherently incomplete: nominal gaps do not account for cost-of-living differences, suppressed cells create invisible holes in small-occupation maps, and the same query pattern must work across national, state, and metro levels without separate code paths. A dimension-...

JobClass 2026-05-08 implementation

Idempotent Pipeline Design

Data pipelines fail — downloads timeout, parsers hit unexpected formats, database connections drop. Idempotency (running the same operation twice produces the same result as running it once) must be designed into every layer: delete-before-insert for facts, check-before-insert for dimensions, and gr...

JobClass 2026-05-08 data-engineering

database testing deployment data-modeling pipeline

Inflation Adjustment with CPI

Comparing nominal wages across years is misleading because the dollar's purchasing power changes over time. Converting to constant dollars using CPI-U deflation separates genuine labor market shifts from background price-level changes and is essential for any multi-vintage wage trend analysis.

JobClass 2026-05-08 implementation

database data-modeling pipeline

Multi-Vintage Query Pitfalls

Once a warehouse holds multiple vintages of the same dataset, every query must explicitly decide whether it wants the latest snapshot or all history. Forgetting this decision produces silent data quality bugs — duplicate rows, empty columns, or misleading percentages — that look correct at the SQL l...

JobClass 2026-05-08 data-engineering

database

Ranked Movers and Outlier Interpretation

Percentage changes on small bases are statistically volatile and can dominate ranked lists even when the absolute economic impact is trivial. Any ranked-change display must show both percentage and absolute values so users can distinguish genuine labor market shifts from small-sample noise.

JobClass 2026-05-08 implementation

database pipeline javascript

Schema Drift Detection

Government data sources change column names, add or remove columns, and retype columns between releases — often without notice. A pipeline that assumes a fixed schema will silently break or load garbage. Proactive drift detection at the staging boundary turns silent corruption into a loud, actionabl...

JobClass 2026-05-08 architecture

data-modeling pipeline python

Static Site Generation

A server-side web application can be deployed to a static hosting platform by pre-rendering every page and API response as files, then injecting a JavaScript fetch shim that transparently redirects API calls to the corresponding JSON files. The application's JavaScript never knows it's running on a...

JobClass 2026-05-08 frontend

security deployment api frontend pipeline

Testing and Deployment

Separating tests by their infrastructure requirements — fixtures-only, in-memory server, real database — lets CI run fast on every push while reserving expensive real-data validation for local runs. The deployment pipeline then layers lint, format, test, build, and deploy into a strict sequence wher...

JobClass 2026-05-08 deployment

database security testing deployment data-modeling

The Federal Labor Data Landscape

When building an analytical warehouse from multiple federal data products, the single most important architectural decision is identifying the stable external key that connects them. For labor data, that key is the Standard Occupational Classification (SOC) code — every design decision flows from tr...

JobClass 2026-05-08 data-engineering

The Multi-Vintage Challenge

When loading multiple vintages of the same dataset, dimension tables must deduplicate on business key alone — not on business key plus source release. Including the source release in dimension lookups gives the same real-world entity different surrogate keys in each vintage, making cross-vintage joi...

JobClass 2026-05-08 data-engineering

database

Thread-Safe Database Connections

When a web framework dispatches synchronous endpoint handlers to a thread pool, a shared database connection will produce intermittent wrong results — not errors, but silently incorrect data. The fix is per-thread connections via `threading.local()`, with a global override path for test injection.

JobClass 2026-05-08 data-engineering

database security testing deployment python

Time-Series Normalization

Fact tables store snapshots — single measurements at single points in time. Time-series analysis requires a separate normalization step that aligns snapshots across periods into a conformed schema with explicit metric definitions, and a further separation between base observations and derived series...

JobClass 2026-05-08 implementation