Lessons Hub — Project Walkthrough
Lessons Hub is a full-stack system that collects engineering lessons from multiple GitHub repositories and consolidates them into a single searchable, browsable, AI-queryable library. It combines a static site (Astro + Pagefind) with a RAG-powered chatbot backend (FastAPI + pluggable LLM/vector adapters) and includes corpus gap detection, GitHub project discovery, and multi-cloud deployment infrastructure.
This document walks through the entire system end-to-end: how data flows from source repositories to a deployed site, how the RAG backend answers questions, and how gaps in knowledge trigger automated discovery of new sources.
Table of Contents
- High-Level Architecture
- Repository Layout
- The Source Repository Contract
- The Repo Registry
- The Harvest Pipeline
- Validation
- The RAG Corpus
- The Astro Static Site
- Search (Pagefind)
- The FastAPI Backend
- The Adapter Pattern
- Gap Detection
- GitHub Discovery
- Deployment
- CI/CD
- Testing
- Configuration Reference
- Getting Started
1. High-Level Architecture
The system has two halves that can run independently:
Static site (V1 core): A build-time pipeline clones source repos, extracts lesson markdown, normalizes metadata, generates JSON, builds an Astro static site, and indexes it with Pagefind. Deployed to GitHub Pages. Works with zero backend dependencies.
RAG backend (V2): A FastAPI service that chunks the lesson corpus, embeds it into a vector store, and answers natural-language questions with grounded citations. When the corpus can't answer a question well, the gap detector logs the gap. Gaps can trigger GitHub discovery to find new source repositories.
Source Repos (GitHub)
│
▼
data/repos.yml ──► harvest_lessons.py ──► clone + parse + normalize
│ │
│ ▼
│ src/content/generated/lessons.json
│ src/content/generated/repos.json
│ src/content/generated/tags.json
│ public/exports/lessons-pack.json
│ │
│ ┌───────────────────┤
│ ▼ ▼
│ build_rag_corpus.py Astro build
│ │ │
│ ▼ ▼
│ data/rag-chunks.json dist/ (static HTML)
│ │ │
│ ▼ ▼
│ embed_rag_corpus.py Pagefind index
│ │ │
│ ▼ ▼
│ ChromaDB / cloud GitHub Pages
│ vector store
│ │
│ ▼
│ FastAPI Backend
│ ├── /api/chat (RAG Q&A)
│ ├── /api/retrieve (raw chunk search)
│ ├── /api/gaps (gap records)
│ └── /api/discovery (GitHub search)
│ │
│ ▼
│ Gap Detector ──► GitHub Discovery ──► New source repos
│ │
└─────────────────────────────────────────────────────────┘
(cycle: new repos feed back into harvest)
2. Repository Layout
lessons/
├── data/
│ ├── repos.yml # Source repo registry (the single config file)
│ ├── platform-keywords.json # Keywords for gap type classification
│ ├── rag-chunks.json # Generated: chunked lessons for RAG
│ ├── rag-manifest.json # Generated: corpus metadata
│ ├── embed-hashes.json # Generated: incremental embedding tracking
│ ├── chromadb/ # Generated: local vector store
│ └── gaps/ # Generated: corpus gap records (JSON)
│
├── scripts/
│ ├── harvest_lessons.py # Clone repos, parse lessons, generate JSON
│ ├── validate_lessons.py # Validate generated lesson data
│ ├── build_rag_corpus.py # Chunk lessons by heading for RAG
│ ├── validate_rag_corpus.py # Validate RAG corpus integrity
│ ├── embed_rag_corpus.py # Embed chunks into vector store
│ └── lesson_core.py # Shared harvesting utilities
│
├── backend/
│ ├── Dockerfile # Multi-profile container build
│ ├── pyproject.toml # Python package with optional dependency groups
│ └── app/
│ ├── main.py # FastAPI app, middleware, router registration
│ ├── api/
│ │ ├── health.py # GET /health
│ │ ├── chat.py # POST /api/chat
│ │ ├── retrieve.py # POST /api/retrieve
│ │ ├── gaps.py # CRUD for corpus gap records
│ │ ├── todos.py # Coordination TODO records
│ │ └── github_discovery.py # POST /api/discovery/search
│ ├── rag/
│ │ ├── retriever.py # Embed query → vector search → top-k chunks
│ │ ├── generator.py # Build prompt → LLM chat → extract citations
│ │ ├── gap_detector.py # 7-rule gap detection engine
│ │ ├── gap_store.py # JSON-file gap persistence
│ │ ├── prompt_builder.py # System/user prompt construction
│ │ └── cache.py # TTL cache for retrieval/generation
│ ├── adapters/
│ │ ├── llm/
│ │ │ ├── base.py # Abstract LLMAdapter (embed + chat)
│ │ │ ├── ollama_adapter.py # Local: Ollama
│ │ │ ├── openai_adapter.py # OpenAI API
│ │ │ ├── bedrock_adapter.py # AWS Bedrock
│ │ │ ├── azure_openai_adapter.py # Azure OpenAI
│ │ │ └── vertex_adapter.py # GCP Vertex AI
│ │ └── vector/
│ │ ├── base.py # Abstract VectorAdapter
│ │ ├── chromadb_adapter.py # Local: ChromaDB
│ │ ├── opensearch_adapter.py # AWS OpenSearch
│ │ ├── azure_search_adapter.py # Azure AI Search
│ │ └── vertex_adapter.py # GCP Vertex AI Vector Search
│ ├── discovery/
│ │ ├── github_search.py # GitHub API search client
│ │ ├── candidate_scorer.py # Score/rank discovered repos
│ │ ├── repo_intake.py # Clone candidates locally
│ │ ├── lesson_extractor.py # Extract lessons from candidates
│ │ └── todo_writer.py # Create coordination TODOs
│ ├── models/
│ │ └── schemas.py # Pydantic v2 models for all API types
│ ├── metrics.py # Prometheus metrics (optional)
│ └── logging_config.py # Structured logging setup
│
├── src/
│ ├── pages/
│ │ ├── index.astro # Home page
│ │ ├── ask.astro # Chat interface ("Ask the Lessons")
│ │ ├── gaps.astro # Corpus gap browser
│ │ ├── candidate-lessons.astro # Discovered candidate lessons
│ │ ├── lessons/
│ │ │ ├── index.astro # All lessons listing
│ │ │ └── [id].astro # Individual lesson page
│ │ ├── repos/
│ │ │ ├── index.astro # All repos listing
│ │ │ └── [repo].astro # Lessons filtered by repo
│ │ ├── tags/[tag].astro # Lessons filtered by tag
│ │ ├── phases/[phase].astro # Lessons filtered by phase
│ │ └── types/[type].astro # Lessons filtered by type
│ ├── layouts/ # Astro layout templates
│ ├── components/ # Reusable Astro/TS components
│ └── content/generated/ # Generated JSON (not committed)
│
├── public/exports/ # AI-readable export packs (generated)
├── .github/workflows/ # 8 CI/CD workflows
├── docs/ # Design docs, PDRs, plans
├── tests/ # Python tests for harvest/validation
└── tmp/ # Temporary clone directory
3. The Source Repository Contract
Any GitHub repository can feed lessons into the hub. The contract is simple:
- Store lesson files at a configurable path (default:
docs/lessons/*.md). - Each lesson is a standalone markdown file.
- Files may include YAML frontmatter. Missing frontmatter is fine — the harvester normalizes everything.
Recommended frontmatter fields:
| Field | Required | Notes |
|---|---|---|
title |
Required (after normalization) | Falls back to first H1 or filename |
summary |
Recommended | One-line description |
date |
Recommended | ISO date |
phase |
Recommended | Project phase (e.g., "design", "implementation") |
lesson_type |
Recommended | One of: architecture, implementation, testing, deployment, debugging, data-design, ai-assisted-development, documentation, maintenance, process, other |
tags |
Recommended | List of topic tags |
status |
Optional | active (default), superseded, draft, deprecated |
Minimal valid lesson: a file with either a title in frontmatter, a first-line # Heading, or a filename that can be slugified into a title. The body must not be empty.
4. The Repo Registry
The entire system is driven by one configuration file: data/repos.yml.
repos:
- id: certification # Stable kebab-case identifier
name: Certification # Display name
owner: bonjohen # GitHub owner
repo: certification # GitHub repo name
branch: main # Branch to harvest
lessons_path: docs/lessons # Path to lesson files in the repo
project_url: https://github.com/bonjohen/certification
enabled: true # Optional, defaults to true
- id: artemis
name: Artemis
owner: bonjohen
repo: Artemis
branch: main
lessons_path: docs/lessons
project_url: https://github.com/bonjohen/Artemis
enabled: true
Adding a new source repo is a one-line edit. Add an entry to repos.yml and the next harvest picks it up. No code changes needed. All repos — whether you own them or not — go through the same pipeline.
5. The Harvest Pipeline
The harvest is the core data pipeline. It runs via npm run harvest (which calls python scripts/harvest_lessons.py).
What it does
- Reads
data/repos.ymlto get the list of source repos. - Clones each enabled repo into
tmp/repos/usinggit clone --depth 1. If aLESSONS_REPO_TOKENenvironment variable is set, it authenticates the clone for private repos. - Scans each repo's
lessons_pathfor*.mdfiles. - Parses each file with
python-frontmatterto extract YAML frontmatter and markdown body. - Normalizes metadata:
- Title: from frontmatter, first H1, or filename.
- ID:
{repo_id}-{slugified-title}. - Tags: coerced to a list of lowercase strings (handles YAML quirks where scalars appear as lists).
- Date: parsed and ISO-formatted.
- Reading time: estimated from word count.
- Source URL: constructed from GitHub owner/repo/branch/path.
- Generates output files:
src/content/generated/lessons.json— all lesson records.src/content/generated/repos.json— repo metadata.src/content/generated/tags.json— tag index.src/content/generated/phases.json— phase index.src/content/generated/lesson_types.json— type index.public/exports/lessons-pack.json— full lesson data for AI consumption.public/exports/lessons-index.json— lightweight index.public/exports/lessons-pack.md— all lessons in a single markdown file.
Shared utilities (scripts/lesson_core.py)
The harvester and validator share a core library with functions for slug generation, tag normalization, date coercion, title extraction, and validation constants (required fields, ID patterns, controlled vocabularies).
6. Validation
Validation runs via npm run validate:lessons (calls python scripts/validate_lessons.py).
It reads the generated lessons.json and checks every lesson record against the schema. Issues are classified by severity:
ERROR (build fails):
- Missing or malformed
repos.yml. - Duplicate lesson IDs.
- Empty lesson content.
- Unreadable files.
- Invalid generated JSON.
WARNING (build continues):
- Missing
summary,date,tags,phase, orlesson_type. - Unknown values in controlled vocabularies (e.g., an unrecognized
lesson_type). - Very short content.
A separate validator exists for the RAG corpus (npm run validate:corpus) that checks data/rag-chunks.json for structural integrity — valid chunk IDs, non-empty text, correct lesson references, and consistent metadata.
7. The RAG Corpus
The RAG corpus is built from the harvested lessons via npm run corpus (calls python scripts/build_rag_corpus.py).
Chunking strategy
Lessons are split by H2 headings. Each heading boundary becomes a chunk. Content before the first H2 becomes the "Introduction" chunk (index 0). This preserves semantic structure — each chunk is a coherent section of a lesson.
Each chunk record contains:
{
"chunk_id": "certification-schema-drift-0",
"lesson_id": "certification-schema-drift",
"repo_id": "certification",
"title": "Schema Drift Validation",
"summary": "How we caught schema drift in production",
"lesson_url": "/lessons/certification-schema-drift",
"chunk_index": 0,
"heading_path": "Schema Drift Validation > Introduction",
"chunk_text": "...",
"token_count": 142,
"tags": ["validation", "schema"],
"content_hash": "a1b2c3d4e5f6g7h8"
}
The content_hash enables incremental embedding — only chunks whose content has changed since the last run need re-embedding.
Embedding
npm run embed (calls python scripts/embed_rag_corpus.py) reads rag-chunks.json, embeds each chunk via the configured LLM adapter, and indexes the embeddings into the vector store. Change detection uses data/embed-hashes.json to skip unchanged chunks.
8. The Astro Static Site
The frontend is an Astro static site configured for pure static output (output: 'static' in astro.config.mjs). It generates HTML at build time from the JSON produced by the harvest pipeline.
Pages
| Route | Source | Description |
|---|---|---|
/ |
src/pages/index.astro |
Home page with lesson count, repo list, recent lessons |
/lessons/ |
src/pages/lessons/index.astro |
All lessons, sortable and filterable |
/lessons/[id] |
src/pages/lessons/[id].astro |
Individual lesson with rendered markdown |
/repos/ |
src/pages/repos/index.astro |
All source repositories |
/repos/[repo] |
src/pages/repos/[repo].astro |
Lessons from a single repo |
/tags/[tag] |
src/pages/tags/[tag].astro |
Lessons with a specific tag |
/phases/[phase] |
src/pages/phases/[phase].astro |
Lessons from a project phase |
/types/[type] |
src/pages/types/[type].astro |
Lessons of a specific type |
/ask |
src/pages/ask.astro |
RAG chat interface ("Ask the Lessons") |
/gaps |
src/pages/gaps.astro |
Corpus gap browser |
/candidate-lessons |
src/pages/candidate-lessons.astro |
Discovered candidate lessons |
Build
npm run build runs the Astro build. It reads JSON from src/content/generated/ and produces static HTML in dist/. The site uses marked for markdown rendering and sanitize-html for XSS protection.
9. Search (Pagefind)
After the Astro build, npm run index runs Pagefind against the dist/ directory. Pagefind generates a client-side search index that works entirely in the browser with no backend. Users can search by keyword across all lesson content, titles, tags, and metadata.
Pagefind was chosen because it works with any static host (no server-side search infrastructure) and scales well for documentation-sized corpora.
10. The FastAPI Backend
The backend is a Python FastAPI application at backend/app/main.py. It runs independently of the static site — if the backend is down, the static site still works (the chat panel shows a fallback message).
API Endpoints
All API routes are mounted at both /api/ and /api/v1/ for forward-compatible versioning.
| Method | Endpoint | Description |
|---|---|---|
GET |
/health |
Health check with component status |
POST |
/api/chat |
Ask a question, get a grounded answer with citations |
POST |
/api/retrieve |
Raw chunk retrieval without answer generation |
GET/POST |
/api/gaps |
List, create, update corpus gap records |
POST |
/api/discovery/search |
Search GitHub for repos relevant to a gap |
GET/POST |
/api/todos |
Coordination TODOs for external lesson handling |
GET |
/metrics |
Prometheus metrics (if prometheus_client is installed) |
Chat flow
When a user sends a message to /api/chat:
- Retriever embeds the query using the LLM adapter, queries the vector store for the top-k most similar chunks, and returns ranked results. Results are cached with a TTL.
- Generator takes the retrieved chunks, builds a grounded prompt (system message with instructions + user question + context chunks), sends it to the LLM, and extracts the answer with lesson citations.
- Gap Detector evaluates whether the answer is well-supported by the corpus. If not, it creates or updates a gap record.
The response includes the answer text, a list of relevant lessons with similarity scores and links, and a flag indicating whether a corpus gap was detected.
Request/Response Models
All API types are defined as Pydantic v2 models in backend/app/models/schemas.py:
ChatRequest/ChatResponse— message, optional filters (repo, tags, lesson_type), answer, relevant lessons, gap info.RetrieveRequest/RetrieveResponse— query, ranked chunk results.GapRecord— gap ID, trigger query, normalized topic, gap type, status, suggested GitHub queries.CandidateRepo— discovered repo with score, harvest status, candidate lesson paths.TodoRecord— coordination TODO with priority, severity, linked gap/project.
Middleware
- CORS: Configurable via
CORS_ORIGINSenvironment variable. Defaults to localhost ports 4321 and 3000. - Metrics: Request duration and count tracked per path/method/status when
prometheus_clientis installed.
11. The Adapter Pattern
All LLM and vector store operations go through abstract base classes. This is how the system supports local development, multiple cloud providers, and testing with the same application code.
LLM Adapters
The LLMAdapter abstract class defines two methods:
class LLMAdapter(ABC):
def embed(self, texts: list[str]) -> list[list[float]]: ...
def chat(self, messages: list[dict]) -> str: ...
Implementations:
| Adapter | Provider | Use Case |
|---|---|---|
ollama_adapter.py |
Ollama (local) | Local development, no cloud dependencies |
openai_adapter.py |
OpenAI API | OpenAI models |
bedrock_adapter.py |
AWS Bedrock | AWS deployment |
azure_openai_adapter.py |
Azure OpenAI | Azure deployment |
vertex_adapter.py |
GCP Vertex AI | GCP deployment |
Vector Adapters
The VectorAdapter abstract class defines:
class VectorAdapter(ABC):
def index_chunks(self, chunks, embeddings) -> int: ...
def query(self, embedding, top_k, filters) -> list[dict]: ...
def delete_collection(self) -> None: ...
def count(self) -> int: ...
Implementations:
| Adapter | Provider | Use Case |
|---|---|---|
chromadb_adapter.py |
ChromaDB (local) | Local development |
opensearch_adapter.py |
AWS OpenSearch | AWS deployment |
azure_search_adapter.py |
Azure AI Search | Azure deployment |
vertex_adapter.py |
GCP Vertex AI Vector Search | GCP deployment |
Lazy imports
Cloud adapters import their SDKs (boto3, azure-*, google-cloud-*) inside __init__, not at module level. This means the application starts without requiring any cloud SDK installed — only the adapter you actually use needs its SDK present.
Testing cloud adapters
Tests use sys.modules.setdefault() to inject mock modules before the adapter's __init__ runs. This avoids needing real cloud SDKs installed in the test environment while still testing the adapter logic.
12. Gap Detection
The gap detector (backend/app/rag/gap_detector.py) runs after every chat response and evaluates whether the corpus adequately answered the question. It uses 7 rules:
- No results: Zero chunks retrieved.
- Low relevance: All chunk similarity scores below the threshold (default 0.3).
- Related but unanswered: Chunks are somewhat related but not directly answering the question.
- Thin coverage: Fewer than the minimum distinct lessons (default 2) match.
- Missing platform: Query mentions a specific platform/technology not represented in results.
- Weak-answer signals: The LLM response contains phrases like "does not appear to contain," "no relevant lessons," etc.
- General-knowledge answer: The LLM gave a long answer but with weak evidence from the corpus (it's generating from training data, not from lessons).
Each detected gap is classified into a type:
missing_topic— the corpus has no coverage of this area.thin_coverage— some related lessons exist but not enough depth.missing_platform— platform-specific gap (e.g., "Azure Container Apps").missing_example,missing_deployment_pattern,missing_failure_case,missing_comparison,missing_reference_implementation— specific knowledge types.
Gap records are stored as JSON files in data/gaps/ and exposed via the /api/gaps endpoint.
13. GitHub Discovery
When a corpus gap is identified, the discovery system can search GitHub for repositories that might fill it.
Flow
- Search (
discovery/github_search.py): Takes queries derived from the gap record. Searches the GitHub API for repositories, filters by language and star count, respects rate limits. - Score (
discovery/candidate_scorer.py): Ranks discovered repos by relevance — does the repo's description, topics, and README match the gap? Does it have lessons-like documentation? - Intake (
discovery/repo_intake.py): Clones high-scoring candidates locally. - Extract (
discovery/lesson_extractor.py): Applies the lesson extraction workflow to the cloned repo — looks for documentation that could be converted to lessons. - TODO (
discovery/todo_writer.py): Creates a coordination TODO so the maintainer can review the candidate lessons and, if appropriate, contact the repo owner before proposing contributions.
Generated candidate lessons always include clear attribution (source project, URL, harvest date, review status). Nothing is treated as authoritative until a human reviews it.
14. Deployment
The project supports multiple deployment targets. The static site and backend can be deployed independently.
Static Site — GitHub Pages
The primary deployment. The build-deploy.yml workflow builds the site and deploys to GitHub Pages on every push to main. The site is available at https://bonjohen.github.io/lessons/.
Backend — Docker
The backend Dockerfile uses a build argument for deployment profile:
FROM python:3.11-slim
ARG PROFILE=local
RUN pip install --no-cache-dir -e ".[${PROFILE}]"
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]
Build with a specific profile to pull in the right cloud SDK dependencies:
docker build --build-arg PROFILE=aws -t lessons-backend . # AWS
docker build --build-arg PROFILE=azure -t lessons-backend . # Azure
docker build --build-arg PROFILE=gcp -t lessons-backend . # GCP
docker build --build-arg PROFILE=local -t lessons-backend . # Local (default)
Platform-Specific Deployments
| Platform | Workflow | Config |
|---|---|---|
| GitHub Pages | .github/workflows/build-deploy.yml |
astro.config.mjs |
| AWS | .github/workflows/deploy-aws.yml |
Bedrock + OpenSearch adapters |
| Azure | .github/workflows/deploy-azure.yml |
Azure OpenAI + AI Search adapters |
| GCP | .github/workflows/deploy-gcp.yml |
Vertex AI + Vector Search adapters |
| Fly.io | .github/workflows/deploy-flyio.yml |
fly.toml |
| Railway | .github/workflows/deploy-railway.yml |
railway.toml |
| Staging | .github/workflows/deploy-staging.yml |
Pre-production validation |
| Production | .github/workflows/deploy-production.yml |
Production promotion |
Cloud deployments use OIDC / Workload Identity Federation — no long-lived credentials stored in GitHub secrets.
15. CI/CD
The main workflow (.github/workflows/build-deploy.yml) runs on every push, PR, and daily at 06:00 UTC (to pick up new lessons from source repos).
Pipeline stages
- Checkout the repo.
- Setup Python 3.11 and Node 22.
- Install Python and Node dependencies.
- Lint the backend with
ruff checkandruff format --check. - Test project tests (
pytest tests/— 76 tests covering harvesting, validation, slugs, corpus). - Test backend tests (
pytest backend/tests/— 102 tests covering health, chat, adapters, gaps, discovery, cloud). - Harvest lessons from all source repos.
- Validate the harvested lesson data.
- Build the RAG corpus.
- Validate the RAG corpus.
- Build the Astro static site.
- Index with Pagefind.
- Deploy to GitHub Pages (skipped on PRs).
The daily schedule means the site automatically picks up new lessons added to source repos without manual intervention.
16. Testing
Project tests (tests/)
76 tests covering the harvest and validation pipeline:
- Repo registry loading and validation.
- Frontmatter parsing and normalization.
- Slug generation edge cases.
- Tag coercion (YAML list-valued scalars to strings).
- Date parsing across formats.
- Export file generation.
- RAG corpus chunking.
- Validation rule coverage.
Backend tests (backend/tests/)
102 tests covering the FastAPI backend:
- Health endpoint responses.
- Chat request/response serialization.
- Retriever and generator logic.
- Gap detection rules (all 7).
- Gap store persistence.
- GitHub discovery search, scoring, intake, extraction.
- TODO creation.
- Cloud adapter initialization and method calls (mocked via
sys.modules.setdefault()). - Prompt builder output.
- Cache TTL behavior.
- Metrics middleware.
Running tests
# Project tests
python -m pytest tests/
# Backend tests
python -m pytest backend/tests/
# Lint
ruff check backend/
ruff format --check backend/
17. Configuration Reference
Environment Variables
| Variable | Purpose | Default |
|---|---|---|
LESSONS_REPO_TOKEN |
GitHub token for cloning private repos | None (public repos only) |
GITHUB_TOKEN |
GitHub API token for discovery searches | None (anonymous, lower rate limits) |
CORS_ORIGINS |
Comma-separated allowed origins for backend | http://localhost:4321,http://localhost:3000 |
GAP_MIN_RELEVANCE_THRESHOLD |
Minimum chunk relevance to avoid gap detection | 0.3 |
GAP_MIN_DISTINCT_LESSONS |
Minimum distinct lessons for adequate coverage | 2 |
GAP_RELATED_BUT_UNANSWERED_THRESHOLD |
Threshold for rule 3 | 0.5 |
GAP_MIN_CHUNKS_FOR_RELATED |
Minimum chunks for related-but-unanswered | 3 |
GAP_MIN_ANSWER_LENGTH_FOR_WEAK_EVIDENCE |
Length threshold for general-knowledge detection | 200 |
npm Scripts
| Script | What it does |
|---|---|
npm run dev |
Start Astro dev server with hot reload |
npm run build |
Build static site to dist/ |
npm run harvest |
Clone source repos and generate JSON |
npm run validate:lessons |
Validate generated lesson data |
npm run corpus |
Build RAG corpus from lessons.json |
npm run validate:corpus |
Validate RAG corpus integrity |
npm run embed |
Embed chunks into vector store |
npm run index |
Run Pagefind indexing on dist/ |
npm run build:full |
Full pipeline: harvest → validate → corpus → validate → build → index |
npm run backend |
Start FastAPI backend with hot reload |
18. Getting Started
Prerequisites
- Node.js 22+
- Python 3.11+
- Git
Quick start (static site only)
git clone https://github.com/bonjohen/lessons.git
cd lessons
npm install
pip install -r requirements.txt
npm run build:full
npm run dev # Open http://localhost:4321/lessons/
Full stack (with RAG backend)
# Install backend with local development dependencies
pip install -e "backend[dev]"
# Build the corpus and embed it
npm run harvest
npm run corpus
npm run embed
# Start the backend (in one terminal)
npm run backend # Runs on http://localhost:8000
# Start the frontend (in another terminal)
npm run dev # Open http://localhost:4321/lessons/ask
Adding a new source repository
- Edit
data/repos.yml— add an entry withid,name,owner,repo,branch, andlessons_path. - Run
npm run build:full. - The new repo's lessons appear on the site.
Writing a lesson
Create a markdown file in your source repo's docs/lessons/ directory:
---
title: What We Learned About Schema Drift
summary: Schema drift caught in production after mock tests passed
date: 2026-01-15
lesson_type: debugging
tags: [validation, schema, production]
phase: implementation
---
## Context
We deployed a database migration that changed column types...
## What Went Wrong
Our mocked tests didn't exercise the actual database...
## What We Changed
We switched to integration tests against a real database...
The next harvest will pick it up automatically.