Merit Engine - Software Build Plan
Merit Engine: Software Build Plan
Engineering Specification - Version 1.0
Engineer: Claude (Sonnet 4.6) acting as Lead Software Engineer for FSP Client: Fairlawn Strategy Partners / Tonya R. Dawson Date: June 29, 2026
Overview
This document translates the Merit Engine concept into a concrete software engineering plan. The goal is to build an IRT-adaptive tutoring platform that can be piloted with one public safety agency within 90-120 days of development start.
The stack is chosen for: - Rapid iteration - we will learn from beta users and adjust - Minimal ops burden - small team, no full-time DevOps needed - Psychometric correctness - the IRT math must be right; this is a validated professional tool
Recommended Tech Stack
Core Application
| Layer | Technology | Why |
|---|---|---|
| Frontend (Candidate Portal) | React + TypeScript | Component-based, strong ecosystem, easy to hire for |
| Frontend (Admin Dashboard) | React + Recharts / D3.js | Data visualization for theta curves and heat maps |
| Backend API | Python (FastAPI) | Native Python data science ecosystem; IRT libraries available |
| Database | PostgreSQL | Relational structure suits item banks, user responses, session logs |
| Auth | Auth0 or Supabase Auth | SOC 2 compliant, handles multi-tenant (per-agency) isolation |
| File Storage | AWS S3 or Azure Blob | SOP document ingestion and storage |
| Hosting | AWS or Azure | Required for SOC 2 compliance path |
IRT Engine
| Component | Technology | Notes |
|---|---|---|
| IRT item parameter estimation | Python catsim library or custom implementation |
catsim is purpose-built for Computerized Adaptive Testing |
| Spaced repetition scheduler | SM-2 algorithm (SuperMemo) - open source | Battle-tested; used by Anki, Duolingo |
| Document parsing / NLP | LangChain + Claude API (Anthropic) | For SOP ingestion, question generation, and summarization |
| Question generation | Claude API with structured prompts | Generates IRT-calibratable items from ingested SOP text |
| Voice AI simulation | ElevenLabs (TTS) + Whisper (STT) + Claude API | Oral board role-play pipeline |
| Embedding / semantic search | OpenAI text-embedding-3-small or Claude embeddings | For policy retrieval in conversational mode |
| Vector database | Pinecone or pgvector (Postgres extension) | Stores SOP embeddings for retrieval |
Data Model (Core Tables)
Agency
- id
- name
- state
- sop_document_ids[]
- item_bank_id (sovereign per agency)
ItemBank
- id
- agency_id
- items[]
Item (Question)
- id
- item_bank_id
- text
- answer_options (JSON)
- correct_answer
- domain (e.g., "Use of Force", "Search and Seizure", "Budget")
- criticality_weight (1-10, set at ingestion)
- irt_a (discrimination parameter - 2PL/3PL)
- irt_b (difficulty parameter)
- irt_c (pseudo-guessing parameter - 3PL only)
- calibration_status (uncalibrated | provisional | calibrated)
- response_count (how many officer responses used to estimate params)
Candidate
- id
- agency_id
- anonymized_id (for admin dashboard - never PII to command)
- current_theta (float, updates after every session)
- theta_history (JSON array of {timestamp, theta})
- enrollment_date
- exam_date
Session
- id
- candidate_id
- session_type (diagnostic | mastery | simulation | mock_exam)
- start_time
- end_time
- phase (1 | 2 | 3)
Response
- id
- session_id
- item_id
- answer_given
- is_correct
- response_time_ms (for decision latency tracking)
- theta_before
- theta_after
SpacedRepetitionSchedule
- id
- candidate_id
- item_id
- next_review_date
- interval_days
- ease_factor (SM-2 parameter)
- repetition_count
ElearningTrigger
- id
- candidate_id
- domain
- theta_at_trigger
- stage_reached (1=targeted_practice | 2=source_review | 3=elearning_invite)
- trigger_date
- stage2_source_shown (bool - was verified source displayed at Stage 2?)
- module_invited (bool)
- module_enrolled (bool)
- module_completed (bool)
- not_now_count (int - how many times candidate dismissed Stage 3 invite)
SourceVerificationLog
- id
- item_id
- check_timestamp
- check_type (AUTOMATED | AI_CHECK | HUMAN_REVIEW | SCHEDULED)
- result (PASS | FAIL | FLAGGED)
- failure_reason (null if pass; description if fail)
- reviewed_by (null if automated; FSP user id if human)
Item table additions (append to existing Item record):
- source_document (e.g., "Alabama Code Title 13A")
- source_section (e.g., "Section 13A-6-132")
- source_text_excerpt (the specific paragraph the item tests)
- source_url (if document is accessible online or in agency portal)
- source_content_hash (hash of source_text_excerpt at last verification)
- source_verification_status (UNVERIFIED | VERIFIED | SUSPENDED | ARCHIVED)
- source_verified_date (timestamp of last successful verification)
- source_verification_notes (human reviewer comments if manually reviewed)
IRT Implementation Plan
Phase 1: Rasch Model (1PL) - Start here
The simplest IRT model. Every item has only a difficulty parameter (b). Discrimination is assumed equal across all items.
Why start here: - Item calibration requires less data (can work with 100-200 responses vs. 400+ for 3PL) - The math is simpler to validate and explain to clients - Still dramatically superior to CTT
The adaptive algorithm: 1. Start candidate at θ = 0 (average ability) 2. Serve item with b closest to current θ estimate 3. Score response; update θ using maximum likelihood estimation (MLE) or expected a posteriori (EAP) 4. Serve next item closest to updated θ 5. Repeat until standard error of θ falls below threshold (typically SE < 0.30) or item count reaches session cap
Phase 2: 2-Parameter Logistic (2PL) - After first 500 item responses
Adds discrimination parameter (a) - how sharply an item separates high-ability from low-ability candidates.
Migration path: Once you have real response data from the pilot agency, run an mirt analysis (Python or R) to estimate both a and b parameters per item. Update the database. The adaptive engine uses the same logic; it just queries more parameters.
Phase 3: 3-Parameter Logistic (3PL) - After first 2,000 item responses
Adds pseudo-guessing parameter (c). Most important for multiple-choice where lucky guessing inflates scores.
This is the “guessing filter” you pitch to Chiefs - the Merit Engine distinguishes true mastery from probability.
Document Ingestion Pipeline
This is how agency SOPs become quiz items:
1. Agency uploads SOP document (PDF, Word, or URL to their policy portal)
2. Pipeline splits document into sections by policy number / chapter
3. Each section is tagged with:
- Domain classification (NLP classifier: Use of Force, Search and Seizure, etc.)
- Criticality weight (rules-based on domain + keyword signals)
4. Claude API generates N candidate quiz items per section using structured prompt:
- "Generate 3 multiple-choice questions at difficulty levels: easy, medium, hard"
- "Each question must be answerable from the text provided, no inference required"
- "Include one plausible distractor per incorrect answer option"
5. Items enter the item bank with calibration_status = 'uncalibrated'
6. Uncalibrated items are served with a flag - their b parameter is estimated from
content difficulty signals (section heading, sentence complexity score)
until real response data calibrates them properly
7. After 200+ responses per item, run calibration and update irt_b
Candidate Portal: Screen Flow
Login
-> Dashboard (Day X of 60, Today's Session, Theta Curve, Upcoming Schedule)
-> Today's Session (varies by phase and SM-2 schedule)
-> Adaptive Quiz (IRT-driven item selection)
-> Oral Board Simulation (voice-AI)
-> Mock Exam (full-length, timed)
-> My Progress
-> Theta Curve over time
-> Domain Heat Map (green/yellow/red per competency area)
-> Missed Items Review
-> Predicted Score Range
-> E-Learning (if flagged)
-> Module catalog
-> Booking / access link
Admin / HR Dashboard: Key Views
Force Readiness Heat Map - anonymized unit-level theta distribution. Green = ready, Yellow = approaching threshold, Red = needs intervention. Refreshes after every candidate session.
Policy Blind Spot Report - which items have the highest error rate across all candidates. Tells the Chief which General Orders the whole force is misreading. Exportable as PDF for training bulletin.
Bench Strength Summary - for each promotional rank, how many candidates are projected above the passing threshold as of today. Trend line over the 60-day cycle.
Individual Candidate Exports - for HR to attach to a promotional file. Shows θ trajectory, domains mastered, simulation performance, and predicted score range. Formatted as a professional report, not raw data.
Voice-AI Oral Board Simulation: Architecture
Candidate speaks (or types)
-> Whisper (OpenAI STT) transcribes
-> Claude API receives:
- Scenario prompt (e.g., "You are a newly promoted Sergeant.
Officer Jones has filed a grievance against you for an
assignment decision. How do you respond?")
- Conversation history
- Agency-specific policy context (retrieved from vector DB)
-> Claude generates evaluator response + internal scoring
-> ElevenLabs TTS reads the response aloud (or text shown)
-> After session ends:
- Decision Latency (response time per turn) is logged
- Content accuracy is scored against policy citations
- Behavioral consistency across scenarios is flagged
- Transcript saved to candidate profile
Key engineering constraint: The voice loop needs to feel responsive. Target latency from candidate speech end to AI speech start: under 2 seconds. This requires streaming responses from Claude and parallel TTS generation.
Security and Compliance Requirements
| Requirement | Implementation |
|---|---|
| Data isolation per agency | Row-level security in PostgreSQL (agency_id on all tables); separate S3 buckets per agency for SOP documents |
| No cross-agency training | IRT calibration runs per agency item bank, never pooled |
| Candidate anonymization | Admin dashboard never shows candidate name or badge number - internal candidate_id only |
| SOC 2 Type II path | AWS infrastructure with CloudTrail logging; plan for Year 1 audit |
| Data retention policy | Candidate response data retained for 7 years (standard civil service audit window); configurable per agency |
| Candidate data portability | Candidates can export their own full profile as PDF at any time |
Iteration Plan: What Changes as We Learn from Beta Users
The following components are deliberately designed to be easy to change based on what we hear from the first agency pilot:
| Component | What might change | How to keep it flexible |
|---|---|---|
| Session length | 15-minute micro-sessions may be too long or too short for shift workers | Session cap is a configurable parameter per agency, not hardcoded |
| Oral board scenarios | Agency culture varies - SDPD scenarios may not fit a small rural department | Scenario library is seeded by agency at onboarding; agencies can add custom scenarios |
| Criticality weights | Our domain weights are assumptions; an agency’s Training Officer may disagree | Criticality weights are editable by the agency admin post-ingestion |
| E-learning trigger threshold | The theta cutoff that triggers an e-learning upsell needs beta testing to calibrate | Threshold is a configurable setting per agency per domain |
| Pricing tiers | Bronze/Silver/Gold may not map to how agencies actually budget | Track how agencies describe their constraints; adjust bundling accordingly |
Build vs. Buy Decisions
| Component | Build or Buy | Rationale |
|---|---|---|
| IRT adaptive engine | Build (using catsim + custom logic) | Core IP - this is what differentiates Merit Engine from every competitor |
| Document parsing | Buy (Claude API) | Commodity capability; not worth building from scratch |
| Voice AI | Buy (ElevenLabs + Whisper) | Commodity capability; focus dev effort on IRT and analytics |
| Spaced repetition | Build (SM-2 implementation is ~50 lines of code) | Simple enough to own; dependency risk not worth it |
| Auth / user management | Buy (Supabase or Auth0) | Security-critical; not a differentiator |
| Analytics dashboard | Build (React + Recharts) | The specific visualizations (theta curve, heat map) require custom implementation |
First Sprint: 2-Week MVP Scope
To have something real to show a pilot agency in 2 weeks, build only this:
- SOP upload + ingestion - agency uploads one PDF chapter, system parses and generates 20 items, items appear in a review queue
- Static adaptive quiz - candidate answers 20 items, system uses 1PL Rasch scoring with a pre-seeded b-parameter per item, outputs a theta estimate and a domain heat map
- Candidate profile page - shows current theta, domain scores (color coded), and a recommended “focus area” for next session
- One admin view - shows all candidates enrolled, their current theta, and flagged low-theta domains
That is the minimum viable proof of concept. Everything else - voice simulation, spaced repetition, oral boards, the 60-day scheduler - is Phase 2.
Open Engineering Questions (Needs Decision Before Build)
1PL vs. 3PL at launch: Start with Rasch for speed of calibration, or invest in 3PL upfront to lead with the strongest pitch? Recommendation: start with 1PL, market it as “IRT-validated” (accurate), upgrade transparently as data accumulates.
On-premise vs. cloud deployment: Do pilot agencies require on-premise? This changes the architecture significantly. Needs to be answered in discovery calls before any build begins.
Who generates the initial item bank: FSP staff manually review AI-generated items before they go live, or auto-publish with a post-hoc quality flag? Recommendation: manual review for the pilot; scale with confidence scores later.
How does the 60-day scheduler handle a candidate who misses days? Does it compress, extend, or recalibrate? Needs a defined rule before build.
Single sign-on with agency systems: Some departments will want Merit Engine to authenticate against their existing Active Directory. Is that required for the pilot, or can we use standalone auth? Needs discovery call answer.
Document Version 1.0 - June 29, 2026 Fairlawn Strategy Partners, LLC, an affiliate of the Institute for Transformative Change - Engineering Confidential