Merit Engine - Software Build Plan

Published

June 29, 2026

internal confidential

Merit Engine: Software Build Plan

Engineering Specification - Version 1.0

Engineer: Claude (Sonnet 4.6) acting as Lead Software Engineer for FSP Client: Fairlawn Strategy Partners / Tonya R. Dawson Date: June 29, 2026

Overview

This document translates the Merit Engine concept into a concrete software engineering plan. The goal is to build an IRT-adaptive tutoring platform that can be piloted with one public safety agency within 90-120 days of development start.

The stack is chosen for: - Rapid iteration - we will learn from beta users and adjust - Minimal ops burden - small team, no full-time DevOps needed - Psychometric correctness - the IRT math must be right; this is a validated professional tool

Recommended Tech Stack

Core Application

Layer	Technology	Why
Frontend (Candidate Portal)	React + TypeScript	Component-based, strong ecosystem, easy to hire for
Frontend (Admin Dashboard)	React + Recharts / D3.js	Data visualization for theta curves and heat maps
Backend API	Python (FastAPI)	Native Python data science ecosystem; IRT libraries available
Database	PostgreSQL	Relational structure suits item banks, user responses, session logs
Auth	Auth0 or Supabase Auth	SOC 2 compliant, handles multi-tenant (per-agency) isolation
File Storage	AWS S3 or Azure Blob	SOP document ingestion and storage
Hosting	AWS or Azure	Required for SOC 2 compliance path

IRT Engine

Component	Technology	Notes
IRT item parameter estimation	Python `catsim` library or custom implementation	catsim is purpose-built for Computerized Adaptive Testing
Spaced repetition scheduler	SM-2 algorithm (SuperMemo) - open source	Battle-tested; used by Anki, Duolingo
Document parsing / NLP	LangChain + Claude API (Anthropic)	For SOP ingestion, question generation, and summarization
Question generation	Claude API with structured prompts	Generates IRT-calibratable items from ingested SOP text
Voice AI simulation	ElevenLabs (TTS) + Whisper (STT) + Claude API	Oral board role-play pipeline
Embedding / semantic search	OpenAI text-embedding-3-small or Claude embeddings	For policy retrieval in conversational mode
Vector database	Pinecone or pgvector (Postgres extension)	Stores SOP embeddings for retrieval

Data Model (Core Tables)

Agency
  - id
  - name
  - state
  - sop_document_ids[]
  - item_bank_id (sovereign per agency)

ItemBank
  - id
  - agency_id
  - items[]

Item (Question)
  - id
  - item_bank_id
  - text
  - answer_options (JSON)
  - correct_answer
  - domain (e.g., "Use of Force", "Search and Seizure", "Budget")
  - criticality_weight (1-10, set at ingestion)
  - irt_a (discrimination parameter - 2PL/3PL)
  - irt_b (difficulty parameter)
  - irt_c (pseudo-guessing parameter - 3PL only)
  - calibration_status (uncalibrated | provisional | calibrated)
  - response_count (how many officer responses used to estimate params)

Candidate
  - id
  - agency_id
  - anonymized_id (for admin dashboard - never PII to command)
  - current_theta (float, updates after every session)
  - theta_history (JSON array of {timestamp, theta})
  - enrollment_date
  - exam_date

Session
  - id
  - candidate_id
  - session_type (diagnostic | mastery | simulation | mock_exam)
  - start_time
  - end_time
  - phase (1 | 2 | 3)

Response
  - id
  - session_id
  - item_id
  - answer_given
  - is_correct
  - response_time_ms (for decision latency tracking)
  - theta_before
  - theta_after

SpacedRepetitionSchedule
  - id
  - candidate_id
  - item_id
  - next_review_date
  - interval_days
  - ease_factor (SM-2 parameter)
  - repetition_count

ElearningTrigger
  - id
  - candidate_id
  - domain
  - theta_at_trigger
  - stage_reached           (1=targeted_practice | 2=source_review | 3=elearning_invite)
  - trigger_date
  - stage2_source_shown     (bool - was verified source displayed at Stage 2?)
  - module_invited          (bool)
  - module_enrolled         (bool)
  - module_completed        (bool)
  - not_now_count           (int - how many times candidate dismissed Stage 3 invite)

SourceVerificationLog
  - id
  - item_id
  - check_timestamp
  - check_type              (AUTOMATED | AI_CHECK | HUMAN_REVIEW | SCHEDULED)
  - result                  (PASS | FAIL | FLAGGED)
  - failure_reason          (null if pass; description if fail)
  - reviewed_by             (null if automated; FSP user id if human)

Item table additions (append to existing Item record):

  - source_document         (e.g., "Alabama Code Title 13A")
  - source_section          (e.g., "Section 13A-6-132")
  - source_text_excerpt     (the specific paragraph the item tests)
  - source_url              (if document is accessible online or in agency portal)
  - source_content_hash     (hash of source_text_excerpt at last verification)
  - source_verification_status  (UNVERIFIED | VERIFIED | SUSPENDED | ARCHIVED)
  - source_verified_date        (timestamp of last successful verification)
  - source_verification_notes   (human reviewer comments if manually reviewed)

IRT Implementation Plan

Phase 1: Rasch Model (1PL) - Start here

The simplest IRT model. Every item has only a difficulty parameter (b). Discrimination is assumed equal across all items.

Why start here: - Item calibration requires less data (can work with 100-200 responses vs. 400+ for 3PL) - The math is simpler to validate and explain to clients - Still dramatically superior to CTT

The adaptive algorithm: 1. Start candidate at θ = 0 (average ability) 2. Serve item with b closest to current θ estimate 3. Score response; update θ using maximum likelihood estimation (MLE) or expected a posteriori (EAP) 4. Serve next item closest to updated θ 5. Repeat until standard error of θ falls below threshold (typically SE < 0.30) or item count reaches session cap

Phase 2: 2-Parameter Logistic (2PL) - After first 500 item responses

Adds discrimination parameter (a) - how sharply an item separates high-ability from low-ability candidates.

Migration path: Once you have real response data from the pilot agency, run an mirt analysis (Python or R) to estimate both a and b parameters per item. Update the database. The adaptive engine uses the same logic; it just queries more parameters.

Phase 3: 3-Parameter Logistic (3PL) - After first 2,000 item responses

Adds pseudo-guessing parameter (c). Most important for multiple-choice where lucky guessing inflates scores.

This is the “guessing filter” you pitch to Chiefs - the Merit Engine distinguishes true mastery from probability.

Document Ingestion Pipeline

This is how agency SOPs become quiz items:

1. Agency uploads SOP document (PDF, Word, or URL to their policy portal)
2. Pipeline splits document into sections by policy number / chapter
3. Each section is tagged with:
   - Domain classification (NLP classifier: Use of Force, Search and Seizure, etc.)
   - Criticality weight (rules-based on domain + keyword signals)
4. Claude API generates N candidate quiz items per section using structured prompt:
   - "Generate 3 multiple-choice questions at difficulty levels: easy, medium, hard"
   - "Each question must be answerable from the text provided, no inference required"
   - "Include one plausible distractor per incorrect answer option"
5. Items enter the item bank with calibration_status = 'uncalibrated'
6. Uncalibrated items are served with a flag - their b parameter is estimated from
   content difficulty signals (section heading, sentence complexity score)
   until real response data calibrates them properly
7. After 200+ responses per item, run calibration and update irt_b

Candidate Portal: Screen Flow

Login
  -> Dashboard (Day X of 60, Today's Session, Theta Curve, Upcoming Schedule)
  -> Today's Session (varies by phase and SM-2 schedule)
      -> Adaptive Quiz (IRT-driven item selection)
      -> Oral Board Simulation (voice-AI)
      -> Mock Exam (full-length, timed)
  -> My Progress
      -> Theta Curve over time
      -> Domain Heat Map (green/yellow/red per competency area)
      -> Missed Items Review
      -> Predicted Score Range
  -> E-Learning (if flagged)
      -> Module catalog
      -> Booking / access link

Admin / HR Dashboard: Key Views

Force Readiness Heat Map - anonymized unit-level theta distribution. Green = ready, Yellow = approaching threshold, Red = needs intervention. Refreshes after every candidate session.
Policy Blind Spot Report - which items have the highest error rate across all candidates. Tells the Chief which General Orders the whole force is misreading. Exportable as PDF for training bulletin.
Bench Strength Summary - for each promotional rank, how many candidates are projected above the passing threshold as of today. Trend line over the 60-day cycle.
Individual Candidate Exports - for HR to attach to a promotional file. Shows θ trajectory, domains mastered, simulation performance, and predicted score range. Formatted as a professional report, not raw data.

Voice-AI Oral Board Simulation: Architecture

Candidate speaks (or types)
  -> Whisper (OpenAI STT) transcribes
  -> Claude API receives:
       - Scenario prompt (e.g., "You are a newly promoted Sergeant. 
         Officer Jones has filed a grievance against you for an
         assignment decision. How do you respond?")
       - Conversation history
       - Agency-specific policy context (retrieved from vector DB)
  -> Claude generates evaluator response + internal scoring
  -> ElevenLabs TTS reads the response aloud (or text shown)
  -> After session ends:
       - Decision Latency (response time per turn) is logged
       - Content accuracy is scored against policy citations
       - Behavioral consistency across scenarios is flagged
       - Transcript saved to candidate profile

Key engineering constraint: The voice loop needs to feel responsive. Target latency from candidate speech end to AI speech start: under 2 seconds. This requires streaming responses from Claude and parallel TTS generation.

Security and Compliance Requirements

Requirement	Implementation
Data isolation per agency	Row-level security in PostgreSQL (agency_id on all tables); separate S3 buckets per agency for SOP documents
No cross-agency training	IRT calibration runs per agency item bank, never pooled
Candidate anonymization	Admin dashboard never shows candidate name or badge number - internal candidate_id only
SOC 2 Type II path	AWS infrastructure with CloudTrail logging; plan for Year 1 audit
Data retention policy	Candidate response data retained for 7 years (standard civil service audit window); configurable per agency
Candidate data portability	Candidates can export their own full profile as PDF at any time

Iteration Plan: What Changes as We Learn from Beta Users

The following components are deliberately designed to be easy to change based on what we hear from the first agency pilot:

Component	What might change	How to keep it flexible
Session length	15-minute micro-sessions may be too long or too short for shift workers	Session cap is a configurable parameter per agency, not hardcoded
Oral board scenarios	Agency culture varies - SDPD scenarios may not fit a small rural department	Scenario library is seeded by agency at onboarding; agencies can add custom scenarios
Criticality weights	Our domain weights are assumptions; an agency’s Training Officer may disagree	Criticality weights are editable by the agency admin post-ingestion
E-learning trigger threshold	The theta cutoff that triggers an e-learning upsell needs beta testing to calibrate	Threshold is a configurable setting per agency per domain
Pricing tiers	Bronze/Silver/Gold may not map to how agencies actually budget	Track how agencies describe their constraints; adjust bundling accordingly

Build vs. Buy Decisions

Component	Build or Buy	Rationale
IRT adaptive engine	Build (using catsim + custom logic)	Core IP - this is what differentiates Merit Engine from every competitor
Document parsing	Buy (Claude API)	Commodity capability; not worth building from scratch
Voice AI	Buy (ElevenLabs + Whisper)	Commodity capability; focus dev effort on IRT and analytics
Spaced repetition	Build (SM-2 implementation is ~50 lines of code)	Simple enough to own; dependency risk not worth it
Auth / user management	Buy (Supabase or Auth0)	Security-critical; not a differentiator
Analytics dashboard	Build (React + Recharts)	The specific visualizations (theta curve, heat map) require custom implementation

First Sprint: 2-Week MVP Scope

To have something real to show a pilot agency in 2 weeks, build only this:

SOP upload + ingestion - agency uploads one PDF chapter, system parses and generates 20 items, items appear in a review queue
Static adaptive quiz - candidate answers 20 items, system uses 1PL Rasch scoring with a pre-seeded b-parameter per item, outputs a theta estimate and a domain heat map
Candidate profile page - shows current theta, domain scores (color coded), and a recommended “focus area” for next session
One admin view - shows all candidates enrolled, their current theta, and flagged low-theta domains

That is the minimum viable proof of concept. Everything else - voice simulation, spaced repetition, oral boards, the 60-day scheduler - is Phase 2.

Open Engineering Questions (Needs Decision Before Build)

1PL vs. 3PL at launch: Start with Rasch for speed of calibration, or invest in 3PL upfront to lead with the strongest pitch? Recommendation: start with 1PL, market it as “IRT-validated” (accurate), upgrade transparently as data accumulates.
On-premise vs. cloud deployment: Do pilot agencies require on-premise? This changes the architecture significantly. Needs to be answered in discovery calls before any build begins.
Who generates the initial item bank: FSP staff manually review AI-generated items before they go live, or auto-publish with a post-hoc quality flag? Recommendation: manual review for the pilot; scale with confidence scores later.
How does the 60-day scheduler handle a candidate who misses days? Does it compress, extend, or recalibrate? Needs a defined rule before build.
Single sign-on with agency systems: Some departments will want Merit Engine to authenticate against their existing Active Directory. Is that required for the pilot, or can we use standalone auth? Needs discovery call answer.

Document Version 1.0 - June 29, 2026 Fairlawn Strategy Partners, LLC, an affiliate of the Institute for Transformative Change - Engineering Confidential