Merit Engine - Item Generation Pipeline

Published

June 29, 2026

internal

Merit Engine: Item Generation Pipeline

Multiple Choice Architecture, Distractor Design, and Assessment Type Roadmap

Prepared by: Fairlawn Strategy Partners Date: June 29, 2026 Version: 1.0

Part 1: Challenging the Pilot Scope Assumption

The Assumption Being Challenged

The assumption is that pilot items must be multiple choice only because situational judgment tests (SJTs), structured interviews, oral boards, and other assessment types require subject matter expert (SME) input - specifically scenario development and critical incident building - that is not yet available.

This assumption is partially correct. The challenge is that it is not entirely correct, and understanding the distinction will shape how aggressively you can expand the pilot scope.

What Actually Requires SME Input (And Why)

Situational Judgment Tests (SJTs) SJTs present a realistic scenario and ask the candidate to choose the most and least effective response from a set of options. The assumption that SJTs require SME input is CORRECT for two reasons:

First, scenario validity. The scenario must reflect actual critical incidents that occur at the target rank. A Sergeant’s SJT scenario about managing a subordinate’s misconduct needs to be drawn from real departmental experiences, not invented. A scenario that feels inauthentic to officers will be dismissed as irrelevant, and worse - if an agency’s union challenges the promotional process, inauthentic scenarios are a vulnerability.

Second, scoring key development. SJT scoring requires expert consensus on which responses are most and least effective. This is not something AI can reliably determine without reference to validated expert judgment. The standard methods (consensus scoring, expert scoring, empirical scoring) all require either multiple subject matter experts rating the options or criterion-related validity data linking SJT scores to job performance outcomes. Neither exists without SME involvement.

Verdict on SJTs for the pilot: The assumption is mostly right, but there is a simplified path. Publicly available SJT item frameworks from IACP model policies and POST scenario libraries can seed a basic SJT format for the pilot - clearly labeled as developmental and not used for high-stakes scoring. This lets you demonstrate the format to agencies without claiming it is validated. Use it as a preview, not a scored component.

Structured Interviews (Behavioral Event Interviews) Structured interviews elicit past behavior (“Tell me about a time when…”) and rate it against defined behavioral anchors. These require a full job analysis to define the behavioral dimensions being rated, and validated Behaviorally Anchored Rating Scales (BARS) for each dimension. This is a multi-month I-O Psychology engagement before a single interview question can be defensibly scored.

Verdict on structured interviews for the pilot: The assumption is correct. Do not include structured interviews in the pilot. They require job analysis data specific to the agency, dimension development, and anchor validation that cannot be shortcut. Including them prematurely would undermine the platform’s psychometric credibility.

Oral Board Simulations Oral boards are the assessment type where the Merit Engine has the most exciting long-term potential, but the assumption is substantially correct for the pilot. Effective oral board simulation requires:

Scenario content drawn from real critical incidents at the specific rank
Rating dimensions grounded in a job analysis
Behavioral anchors that define what a 1, 3, and 5 response looks like per dimension
Calibrated rater scoring (even if the rater is AI, it needs ground truth to calibrate against)

None of these can be built without SME input and agency-specific data. An AI oral board simulation without validated anchors produces scores that feel meaningful but cannot be defended in a civil service audit or a grievance hearing.

Verdict on oral boards for the pilot: The assumption is correct with one caveat. The voice-AI simulation module can be included in the pilot as a practice tool - a low-stakes environment for candidates to rehearse responses and hear themselves think through scenarios. Label it explicitly as a practice tool, not a scored assessment. Collect the transcripts. Those transcripts become the raw data from which, later, you build validated rating dimensions with a real SME. The pilot earns you the data that makes the scored version possible.

What the Multiple Choice Format Can Do That You May Be Underestimating

This is the more important challenge to your assumption. You are right to start with MC for the pilot - but you may be underestimating how much of the core value proposition MC items can deliver on their own.

MC items can test application, not just recall. A well-written MC item does not ask “What is the definition of probable cause?” It presents a scenario with specific facts and asks “Under the standard established in Illinois v. Gates, does Officer Jones have probable cause to arrest the suspect?” That is applied reasoning, not memorization. The scenario is embedded in the item stem. You do not need a separate SJT format to test situational judgment at a basic level - you need better item stems.

This is the key insight for the pilot: The difference between a knowledge-recall MC item and a situational-judgment-lite MC item is the item stem construction. The former tests whether the candidate can recite a rule. The latter tests whether they can apply it correctly to a realistic set of facts. The IRT model works identically for both. The discrimination parameter (a) will be higher for the application items because they separate candidates more sharply - which is exactly what you want for a promotional exam.

The implication for distractor design is significant. A recall item’s wrong answers are other definitions. An application item’s wrong answers are other plausible legal conclusions a candidate might reach from the same set of facts. Those distractors are richer, more discriminating, and more defensible. They are also harder to write, which is why the item generation pipeline needs to be designed carefully.

Part 2: Multiple Choice Item Architecture

Format Specification

The Merit Engine pilot will use 4-option multiple choice items: - 1 correct answer (the keyed response) - 3 distractors (incorrect options) - Of the 3 distractors, 1 is designated the primary distractor - the most plausible incorrect answer - Of the remaining 2, they are secondary distractors - plausible but less attractive than the primary

Why 4 options, not 2: A 2-option (binary) format has a guessing parameter (c) of 0.50. A candidate who knows nothing has a 50% chance of getting it right. The IRT 3PL model corrects for this, but high guessing parameters reduce item information and require more items to achieve the same θ precision. The 4-option format drops the theoretical guessing floor to 0.25, which is the standard for professional certification and promotional exams.

Why specify a primary distractor: The primary distractor is what separates this from a generic quiz. It represents the most common misconception or the most tempting wrong conclusion. Candidates who select the primary distractor reveal something specific about their misunderstanding - they are not guessing, they are reasoning incorrectly. When the IRT model identifies a candidate who consistently selects primary distractors in a given domain, that is diagnostic information the engine uses to target the source material review in Stage 2.

The Four Distractor Types (and Which to Use)

Type	Description	IRT Value	When to Use
Plausible Alternative	Related content that is wrong in this context	High discrimination	Primary distractor - always include one
Common Misconception	A belief many candidates hold that is factually incorrect	Very high discrimination	Primary distractor when misconception is known
Partially Correct	True in some circumstances but not the specific facts presented	High discrimination	Secondary distractor - use for application items
Foil	Clearly wrong to anyone with basic knowledge	Low discrimination, inflates guessing	Avoid - adds noise, reduces item quality

The AI generation pipeline will be instructed to produce at minimum one primary distractor of type Plausible Alternative or Common Misconception, and to explicitly avoid Foil distractors. Foils appear frequently in low-quality practice test banks and are easy to generate - resisting that temptation is part of what makes the Merit Engine item bank professionally defensible.

Item Difficulty Targeting by Stem Type

The item stem construction directly predicts the pre-calibration b parameter estimate. The pipeline generates items at three explicit difficulty targets per source section.

Stem Type	Typical b Range	Example Construction
Definition/Rule Recall	-1.0 to 0.0	“Under Alabama Code 13A-X-X, which of the following defines…”
Single-Fact Application	0.0 to +1.0	“Officer Jones observes X. Which charge is most appropriate?”
Multi-Fact Application	+0.5 to +1.5	“Given the following set of facts [3-4 facts], what is the legal conclusion?”
Exception or Limitation	+1.0 to +2.0	“Which of the following circumstances would NOT justify…?”
Competing Principles	+1.5 to +2.5	“A supervisor faces both X requirement and Y constraint. What takes priority?”

The pipeline requests all five stem types from each source section where the content supports them. Not every section will yield all five - a narrow policy section may only support Definition/Rule Recall and Single-Fact Application items. The domain taxonomy document identifies which domains are likely to yield high-difficulty items.

Part 3: The Item Generation Pipeline

Overview

The pipeline converts verified source text into reviewed, pre-calibrated MC items ready for the active bank. It runs in stages with human gates at two points - after AI generation and after source verification - before any item reaches a candidate.

STAGE 0: SOURCE PREPARATION
  Source text section is ingested and tagged
  Domain, criticality weight, and source citation are assigned
  Source content hash is recorded
  Source verification status = UNVERIFIED

STAGE 1: AI ITEM GENERATION (Claude API)
  Input: Source text section + domain + target difficulty levels + format spec
  Output: JSON array of candidate items at 3-5 difficulty levels
  Each item includes:
    - stem
    - option_a through option_d
    - keyed_answer
    - primary_distractor designation
    - distractor_rationale (why each wrong answer is plausible)
    - estimated_b (difficulty estimate based on stem type)
    - source_section reference
    - item_type (recall | single_application | multi_application |
                  exception | competing_principles)
  Items enter bank with status = GENERATED, not yet reviewable

STAGE 2: AUTOMATED QUALITY CHECKS
  Check 1: Stem length is within bounds (20-120 words)
  Check 2: All four options are present and non-empty
  Check 3: No option is a subset of the stem (avoids "all of the above" constructions)
  Check 4: Keyed answer is unambiguous - AI self-check: given only the source text,
           is there one clearly correct answer?
  Check 5: Primary distractor is identifiably wrong - AI self-check: a candidate who
           read the source carefully would not select this
  Check 6: No foil detected - AI check: is any option obviously wrong to anyone with
           basic domain knowledge?
  Items passing all 6 checks advance to STAGE 3
  Items failing any check are flagged with failure reason and routed to human review

STAGE 3: SOURCE VERIFICATION (Default, Required)
  Verification point 1: Source URL resolves and content hash matches ingestion record
  Verification point 2: AI reads current source text and confirms it supports keyed answer
  Verification point 3: Source verification date is current (within window)
  Pass: source_verification_status = VERIFIED, date stamped
  Fail: item suspended, flagged in FSP dashboard, routed to human review queue

STAGE 4: HUMAN REVIEW QUEUE (FSP Admin - Tonya or designated reviewer)
  Reviewer sees: item stem, all four options, keyed answer, distractor rationales,
                 source text excerpt, estimated difficulty, automated check results
  Reviewer actions:
    APPROVE: item enters active bank
    APPROVE WITH EDIT: reviewer edits stem or options inline, then approves
    REJECT: item archived with rejection reason logged
    FLAG FOR SME: item flagged for external subject matter expert review before
                  activation (used when policy interpretation is ambiguous)

STAGE 5: IRT PRE-CALIBRATION
  Approved items receive:
    - estimated_b based on stem type (from difficulty targeting table)
    - estimated_a = 1.0 (default discrimination, assumes average, revised after
                         empirical data accumulates)
    - estimated_c = 0.25 (default guessing floor for 4-option MC)
    - calibration_status = PRE-CALIBRATION
  Items are now eligible to be served to candidates

STAGE 6: EMPIRICAL CALIBRATION (Ongoing, Automated)
  After 200 candidate responses per item:
    - Run maximum likelihood estimation on response data
    - Update b estimate with empirical value
    - If using 2PL: update a estimate
    - calibration_status = PROVISIONAL
  After 500 candidate responses per item:
    - Run full calibration with confidence intervals
    - Flag items where empirical b differs from estimated b by more than 0.5
      (signals item may be ambiguous or source has changed)
    - calibration_status = CALIBRATED

STAGE 7: PERIODIC MAINTENANCE
  Scheduled re-verification of source citations (90/30 day windows)
  Items failing re-verification: suspended from serving, flagged in FSP dashboard
  Items with b shift detected: flagged for human review
  Items with high error rate at all theta levels: flagged as potentially flawed

The Claude API Prompt Template

This is the structured prompt that drives Stage 1. Consistency in the prompt is critical - the JSON output format must be identical every time so the ingest pipeline can parse it reliably.

SYSTEM PROMPT:
You are a psychometrician and item writer specializing in public safety
promotional examinations. You write multiple choice items that meet
professional testing standards: one unambiguously correct answer,
no trick questions, no ambiguous stems, no "all of the above" or
"none of the above" options, and no obviously wrong (foil) distractors.

Every item must be answerable correctly by a candidate who has read
and understood the provided source text. No item should require
knowledge beyond what is in the source text.

USER PROMPT:
Source Domain: {domain}
Source Section: {source_section}
Source Text: {source_text_excerpt}
Target Rank: {rank} (Sergeant | Lieutenant | Captain | Fire Officer I | etc.)
Criticality Weight: {criticality_weight} (1-10)

Generate items at the following difficulty levels:
- 1 item at RECALL level (stem type: Definition/Rule Recall)
- 1 item at BASIC APPLICATION level (stem type: Single-Fact Application)
- 1 item at APPLIED REASONING level (stem type: Multi-Fact Application)
- 1 item at EXCEPTION level if the source text contains exceptions or limitations
- 1 item at COMPETING PRINCIPLES level if the source text involves
  balancing two or more requirements

For each item, return a JSON object with this exact structure:
{
  "stem": "string",
  "option_a": "string",
  "option_b": "string",
  "option_c": "string",
  "option_d": "string",
  "keyed_answer": "a" | "b" | "c" | "d",
  "primary_distractor": "a" | "b" | "c" | "d",
  "distractor_rationale": {
    "option_x": "why this option is plausible but wrong",
    "option_y": "why this option is plausible but wrong",
    "option_z": "why this option is plausible but wrong"
  },
  "item_type": "recall" | "single_application" | "multi_application" |
               "exception" | "competing_principles",
  "estimated_b": float,
  "source_section": "string",
  "reviewer_notes": "any ambiguities or caveats the human reviewer should check"
}

Return a JSON array of all generated items. Do not generate an item
at a difficulty level if the source text does not support it.

Example Output (Domain 2: Constitutional Law)

Source text: Alabama Code of Criminal Procedure, coupled with U.S. Supreme Court precedent on the automobile exception to the warrant requirement.

[
  {
    "stem": "The automobile exception to the Fourth Amendment warrant requirement was established in which landmark Supreme Court case?",
    "option_a": "Terry v. Ohio",
    "option_b": "Carroll v. United States",
    "option_c": "Illinois v. Gates",
    "option_d": "Chimel v. California",
    "keyed_answer": "b",
    "primary_distractor": "c",
    "distractor_rationale": {
      "option_a": "Terry v. Ohio is frequently tested and involves stop-and-frisk, not vehicle searches - candidates may confuse landmark cases",
      "option_c": "Illinois v. Gates established the totality of circumstances test for probable cause - thematically related to vehicle searches but not the automobile exception",
      "option_d": "Chimel v. California governs search incident to arrest, not vehicle searches"
    },
    "item_type": "recall",
    "estimated_b": -0.3,
    "source_section": "4th Amendment - Automobile Exception",
    "reviewer_notes": "Straightforward recall item. Verify Carroll v. United States citation is still the controlling precedent - no subsequent case has displaced it but confirm."
  },
  {
    "stem": "Officer Davis stops a vehicle for a broken tail light. During the stop, she detects a strong odor of marijuana coming from inside the vehicle. She has no search warrant. Under the automobile exception, which of the following best describes her authority?",
    "option_a": "She may search the entire vehicle including any closed containers that could contain marijuana",
    "option_b": "She may search only the passenger compartment, not the trunk",
    "option_c": "She must obtain a warrant before conducting any search because the stop was for a traffic violation",
    "option_d": "She may search only if the driver consents",
    "keyed_answer": "a",
    "primary_distractor": "b",
    "distractor_rationale": {
      "option_b": "A common misconception - candidates often believe trunk searches require separate authority. United States v. Ross extends automobile exception to all areas where contraband might be found.",
      "option_c": "Plausible to candidates who confuse the basis for the stop with the basis for the search - probable cause for the search arises independently from the odor",
      "option_d": "Consent is one pathway but not required when probable cause exists - conflates two independent exceptions"
    },
    "item_type": "single_application",
    "estimated_b": 0.8,
    "source_section": "4th Amendment - Automobile Exception",
    "reviewer_notes": "Primary distractor (b) tests a genuine common misconception confirmed by training literature. High discrimination expected. Reviewer should confirm United States v. Ross is cited in agency training materials."
  }
]

Part 4: Assessment Type Roadmap

The following table defines the full roadmap from pilot through mature platform, with the SME and validation requirements that gate each assessment type.

Assessment Type	Pilot (Phase 1)	Condition to Unlock	SME Requirement	IRT Model
Multiple Choice - Recall	Yes, scored	None - available at launch	FSP review only	1PL Rasch
Multiple Choice - Application	Yes, scored	None - available at launch	FSP review only	1PL, upgrade to 2PL
Multiple Choice - Competing Principles	Yes, scored	200+ responses per item for calibration	FSP review + optional Training Officer review	2PL after calibration
SJT - Practice Preview	Yes, unscored	None - labeled as developmental	Scenario seeded from IACP/POST public frameworks	Not scored in pilot
SJT - Scored	Phase 2	Expert consensus scoring key from 3+ SMEs per agency	Agency Training Officer + I-O Psychologist (Tonya)	Nominal Response Model or GRM
Voice-AI Oral Board Practice	Phase 2, unscored	Voice pipeline built and tested	Scenarios from public prep materials - labeled as practice	Not scored
Voice-AI Oral Board Scored	Phase 3	Validated BARS anchors per dimension, per agency	Agency Training Officer, Field supervisors, Tonya for I-O validation	Requires separate scoring rubric tied to behavioral dimensions
Structured Interview	Phase 3	Full job analysis per agency, dimension development, anchor validation	Multiple agency SMEs + Tonya as I-O lead	Behavioral dimension scoring - not standard IRT
Written Simulation / In-Basket	Phase 3	Scenario development + scoring rubric + pilot test	Agency Training Officer + Tonya	Partial credit / GRM

The Unscored Preview Strategy

For assessment types not yet validated (SJTs and oral boards), the pilot uses an “unscored preview” approach: - The format is demonstrated to candidates so they understand what it looks and feels like - Responses are collected but not fed into the θ estimate - The collected responses become the raw data for SME review and eventual scoring key development - Candidates are told explicitly: “This is a practice format. Your responses here do not affect your readiness score. They are for your own reflection only.”

This is not a placeholder - it is a deliberate data collection strategy. The transcripts and response patterns from the pilot become the empirical foundation for Phase 2 validation.

Part 5: The SME Engagement Plan

Given the SME requirements for Phase 2 and 3 assessment types, FSP needs a structured approach to engaging subject matter experts. This is not something to figure out later - it should begin during the pilot with the first agency.

Who qualifies as an SME for the Merit Engine:

Role	What They Contribute	How FSP Engages Them
Agency Training Officer	Reviews MC items for policy accuracy, provides SJT scenarios, rates oral board responses	Formal paid consultation agreement per agency
Field Supervisor (Sergeant/Lieutenant)	Rates SJT response options, validates behavioral anchors for oral boards	Focus group during agency onboarding, compensated
Academy Instructor (APOSTC or AFC)	Reviews statutory items, confirms difficulty level accuracy	One-time engagement per item bank build
Civil Service Commission Representative	Reviews item content for adverse impact and legal defensibility	Invited to review item bank before exam cycle
Tonya R. Dawson (FSP)	I-O Psychology framework, item review sign-off, BARS development lead	Built into service tier pricing

The pilot agency SME ask: When onboarding the first agency, build the Training Officer review into the contract - not as an optional service, but as a required step for the agency layer of the item bank. Frame it as protecting the agency: “We need your Training Officer to review the items we build from your policies before any candidate sees them. That protects the integrity of your process.”

That conversation also surfaces the first round of SJT scenarios organically - a Training Officer who is reviewing policy-based MC items will naturally describe the situations where officers get the policy wrong in the field. Those are your critical incidents.

Fairlawn Strategy Partners, LLC, an affiliate of the Institute for Transformative Change - Confidential and Proprietary Contact: Tonya R. Dawson | tonya@fairlawnstrategy.com Document Version 1.0 - June 29, 2026