Merit Engine - Software Build Plan

Published

June 29, 2026

internal confidential

Merit Engine: Software Build Plan

Engineering Specification - Version 1.0

Engineer: Claude (Sonnet 4.6) acting as Lead Software Engineer for FSP Client: Fairlawn Strategy Partners / Tonya R. Dawson Date: June 29, 2026


Overview

This document translates the Merit Engine concept into a concrete software engineering plan. The goal is to build an IRT-adaptive tutoring platform that can be piloted with one public safety agency within 90-120 days of development start.

The stack is chosen for: - Rapid iteration - we will learn from beta users and adjust - Minimal ops burden - small team, no full-time DevOps needed - Psychometric correctness - the IRT math must be right; this is a validated professional tool


Data Model (Core Tables)

Agency
  - id
  - name
  - state
  - sop_document_ids[]
  - item_bank_id (sovereign per agency)

ItemBank
  - id
  - agency_id
  - items[]

Item (Question)
  - id
  - item_bank_id
  - text
  - answer_options (JSON)
  - correct_answer
  - domain (e.g., "Use of Force", "Search and Seizure", "Budget")
  - criticality_weight (1-10, set at ingestion)
  - irt_a (discrimination parameter - 2PL/3PL)
  - irt_b (difficulty parameter)
  - irt_c (pseudo-guessing parameter - 3PL only)
  - calibration_status (uncalibrated | provisional | calibrated)
  - response_count (how many officer responses used to estimate params)

Candidate
  - id
  - agency_id
  - anonymized_id (for admin dashboard - never PII to command)
  - current_theta (float, updates after every session)
  - theta_history (JSON array of {timestamp, theta})
  - enrollment_date
  - exam_date

Session
  - id
  - candidate_id
  - session_type (diagnostic | mastery | simulation | mock_exam)
  - start_time
  - end_time
  - phase (1 | 2 | 3)

Response
  - id
  - session_id
  - item_id
  - answer_given
  - is_correct
  - response_time_ms (for decision latency tracking)
  - theta_before
  - theta_after

SpacedRepetitionSchedule
  - id
  - candidate_id
  - item_id
  - next_review_date
  - interval_days
  - ease_factor (SM-2 parameter)
  - repetition_count

ElearningTrigger
  - id
  - candidate_id
  - domain
  - theta_at_trigger
  - stage_reached           (1=targeted_practice | 2=source_review | 3=elearning_invite)
  - trigger_date
  - stage2_source_shown     (bool - was verified source displayed at Stage 2?)
  - module_invited          (bool)
  - module_enrolled         (bool)
  - module_completed        (bool)
  - not_now_count           (int - how many times candidate dismissed Stage 3 invite)

SourceVerificationLog
  - id
  - item_id
  - check_timestamp
  - check_type              (AUTOMATED | AI_CHECK | HUMAN_REVIEW | SCHEDULED)
  - result                  (PASS | FAIL | FLAGGED)
  - failure_reason          (null if pass; description if fail)
  - reviewed_by             (null if automated; FSP user id if human)

Item table additions (append to existing Item record):

  - source_document         (e.g., "Alabama Code Title 13A")
  - source_section          (e.g., "Section 13A-6-132")
  - source_text_excerpt     (the specific paragraph the item tests)
  - source_url              (if document is accessible online or in agency portal)
  - source_content_hash     (hash of source_text_excerpt at last verification)
  - source_verification_status  (UNVERIFIED | VERIFIED | SUSPENDED | ARCHIVED)
  - source_verified_date        (timestamp of last successful verification)
  - source_verification_notes   (human reviewer comments if manually reviewed)

IRT Implementation Plan

Phase 1: Rasch Model (1PL) - Start here

The simplest IRT model. Every item has only a difficulty parameter (b). Discrimination is assumed equal across all items.

Why start here: - Item calibration requires less data (can work with 100-200 responses vs. 400+ for 3PL) - The math is simpler to validate and explain to clients - Still dramatically superior to CTT

The adaptive algorithm: 1. Start candidate at θ = 0 (average ability) 2. Serve item with b closest to current θ estimate 3. Score response; update θ using maximum likelihood estimation (MLE) or expected a posteriori (EAP) 4. Serve next item closest to updated θ 5. Repeat until standard error of θ falls below threshold (typically SE < 0.30) or item count reaches session cap

Phase 2: 2-Parameter Logistic (2PL) - After first 500 item responses

Adds discrimination parameter (a) - how sharply an item separates high-ability from low-ability candidates.

Migration path: Once you have real response data from the pilot agency, run an mirt analysis (Python or R) to estimate both a and b parameters per item. Update the database. The adaptive engine uses the same logic; it just queries more parameters.

Phase 3: 3-Parameter Logistic (3PL) - After first 2,000 item responses

Adds pseudo-guessing parameter (c). Most important for multiple-choice where lucky guessing inflates scores.

This is the “guessing filter” you pitch to Chiefs - the Merit Engine distinguishes true mastery from probability.


Document Ingestion Pipeline

This is how agency SOPs become quiz items:

1. Agency uploads SOP document (PDF, Word, or URL to their policy portal)
2. Pipeline splits document into sections by policy number / chapter
3. Each section is tagged with:
   - Domain classification (NLP classifier: Use of Force, Search and Seizure, etc.)
   - Criticality weight (rules-based on domain + keyword signals)
4. Claude API generates N candidate quiz items per section using structured prompt:
   - "Generate 3 multiple-choice questions at difficulty levels: easy, medium, hard"
   - "Each question must be answerable from the text provided, no inference required"
   - "Include one plausible distractor per incorrect answer option"
5. Items enter the item bank with calibration_status = 'uncalibrated'
6. Uncalibrated items are served with a flag - their b parameter is estimated from
   content difficulty signals (section heading, sentence complexity score)
   until real response data calibrates them properly
7. After 200+ responses per item, run calibration and update irt_b

Candidate Portal: Screen Flow

Login
  -> Dashboard (Day X of 60, Today's Session, Theta Curve, Upcoming Schedule)
  -> Today's Session (varies by phase and SM-2 schedule)
      -> Adaptive Quiz (IRT-driven item selection)
      -> Oral Board Simulation (voice-AI)
      -> Mock Exam (full-length, timed)
  -> My Progress
      -> Theta Curve over time
      -> Domain Heat Map (green/yellow/red per competency area)
      -> Missed Items Review
      -> Predicted Score Range
  -> E-Learning (if flagged)
      -> Module catalog
      -> Booking / access link

Admin / HR Dashboard: Key Views

  1. Force Readiness Heat Map - anonymized unit-level theta distribution. Green = ready, Yellow = approaching threshold, Red = needs intervention. Refreshes after every candidate session.

  2. Policy Blind Spot Report - which items have the highest error rate across all candidates. Tells the Chief which General Orders the whole force is misreading. Exportable as PDF for training bulletin.

  3. Bench Strength Summary - for each promotional rank, how many candidates are projected above the passing threshold as of today. Trend line over the 60-day cycle.

  4. Individual Candidate Exports - for HR to attach to a promotional file. Shows θ trajectory, domains mastered, simulation performance, and predicted score range. Formatted as a professional report, not raw data.


Voice-AI Oral Board Simulation: Architecture

Candidate speaks (or types)
  -> Whisper (OpenAI STT) transcribes
  -> Claude API receives:
       - Scenario prompt (e.g., "You are a newly promoted Sergeant. 
         Officer Jones has filed a grievance against you for an
         assignment decision. How do you respond?")
       - Conversation history
       - Agency-specific policy context (retrieved from vector DB)
  -> Claude generates evaluator response + internal scoring
  -> ElevenLabs TTS reads the response aloud (or text shown)
  -> After session ends:
       - Decision Latency (response time per turn) is logged
       - Content accuracy is scored against policy citations
       - Behavioral consistency across scenarios is flagged
       - Transcript saved to candidate profile

Key engineering constraint: The voice loop needs to feel responsive. Target latency from candidate speech end to AI speech start: under 2 seconds. This requires streaming responses from Claude and parallel TTS generation.


Security and Compliance Requirements

Requirement Implementation
Data isolation per agency Row-level security in PostgreSQL (agency_id on all tables); separate S3 buckets per agency for SOP documents
No cross-agency training IRT calibration runs per agency item bank, never pooled
Candidate anonymization Admin dashboard never shows candidate name or badge number - internal candidate_id only
SOC 2 Type II path AWS infrastructure with CloudTrail logging; plan for Year 1 audit
Data retention policy Candidate response data retained for 7 years (standard civil service audit window); configurable per agency
Candidate data portability Candidates can export their own full profile as PDF at any time

Iteration Plan: What Changes as We Learn from Beta Users

The following components are deliberately designed to be easy to change based on what we hear from the first agency pilot:

Component What might change How to keep it flexible
Session length 15-minute micro-sessions may be too long or too short for shift workers Session cap is a configurable parameter per agency, not hardcoded
Oral board scenarios Agency culture varies - SDPD scenarios may not fit a small rural department Scenario library is seeded by agency at onboarding; agencies can add custom scenarios
Criticality weights Our domain weights are assumptions; an agency’s Training Officer may disagree Criticality weights are editable by the agency admin post-ingestion
E-learning trigger threshold The theta cutoff that triggers an e-learning upsell needs beta testing to calibrate Threshold is a configurable setting per agency per domain
Pricing tiers Bronze/Silver/Gold may not map to how agencies actually budget Track how agencies describe their constraints; adjust bundling accordingly

Build vs. Buy Decisions

Component Build or Buy Rationale
IRT adaptive engine Build (using catsim + custom logic) Core IP - this is what differentiates Merit Engine from every competitor
Document parsing Buy (Claude API) Commodity capability; not worth building from scratch
Voice AI Buy (ElevenLabs + Whisper) Commodity capability; focus dev effort on IRT and analytics
Spaced repetition Build (SM-2 implementation is ~50 lines of code) Simple enough to own; dependency risk not worth it
Auth / user management Buy (Supabase or Auth0) Security-critical; not a differentiator
Analytics dashboard Build (React + Recharts) The specific visualizations (theta curve, heat map) require custom implementation

First Sprint: 2-Week MVP Scope

To have something real to show a pilot agency in 2 weeks, build only this:

  1. SOP upload + ingestion - agency uploads one PDF chapter, system parses and generates 20 items, items appear in a review queue
  2. Static adaptive quiz - candidate answers 20 items, system uses 1PL Rasch scoring with a pre-seeded b-parameter per item, outputs a theta estimate and a domain heat map
  3. Candidate profile page - shows current theta, domain scores (color coded), and a recommended “focus area” for next session
  4. One admin view - shows all candidates enrolled, their current theta, and flagged low-theta domains

That is the minimum viable proof of concept. Everything else - voice simulation, spaced repetition, oral boards, the 60-day scheduler - is Phase 2.


Open Engineering Questions (Needs Decision Before Build)

  1. 1PL vs. 3PL at launch: Start with Rasch for speed of calibration, or invest in 3PL upfront to lead with the strongest pitch? Recommendation: start with 1PL, market it as “IRT-validated” (accurate), upgrade transparently as data accumulates.

  2. On-premise vs. cloud deployment: Do pilot agencies require on-premise? This changes the architecture significantly. Needs to be answered in discovery calls before any build begins.

  3. Who generates the initial item bank: FSP staff manually review AI-generated items before they go live, or auto-publish with a post-hoc quality flag? Recommendation: manual review for the pilot; scale with confidence scores later.

  4. How does the 60-day scheduler handle a candidate who misses days? Does it compress, extend, or recalibrate? Needs a defined rule before build.

  5. Single sign-on with agency systems: Some departments will want Merit Engine to authenticate against their existing Active Directory. Is that required for the pilot, or can we use standalone auth? Needs discovery call answer.


Document Version 1.0 - June 29, 2026 Fairlawn Strategy Partners, LLC, an affiliate of the Institute for Transformative Change - Engineering Confidential