Merit Engine - Item Generation Pipeline
Merit Engine: Item Generation Pipeline
Multiple Choice Architecture, Distractor Design, and Assessment Type Roadmap
Prepared by: Fairlawn Strategy Partners Date: June 29, 2026 Version: 1.0
Part 1: Challenging the Pilot Scope Assumption
The Assumption Being Challenged
The assumption is that pilot items must be multiple choice only because situational judgment tests (SJTs), structured interviews, oral boards, and other assessment types require subject matter expert (SME) input - specifically scenario development and critical incident building - that is not yet available.
This assumption is partially correct. The challenge is that it is not entirely correct, and understanding the distinction will shape how aggressively you can expand the pilot scope.
What Actually Requires SME Input (And Why)
Situational Judgment Tests (SJTs) SJTs present a realistic scenario and ask the candidate to choose the most and least effective response from a set of options. The assumption that SJTs require SME input is CORRECT for two reasons:
First, scenario validity. The scenario must reflect actual critical incidents that occur at the target rank. A Sergeant’s SJT scenario about managing a subordinate’s misconduct needs to be drawn from real departmental experiences, not invented. A scenario that feels inauthentic to officers will be dismissed as irrelevant, and worse - if an agency’s union challenges the promotional process, inauthentic scenarios are a vulnerability.
Second, scoring key development. SJT scoring requires expert consensus on which responses are most and least effective. This is not something AI can reliably determine without reference to validated expert judgment. The standard methods (consensus scoring, expert scoring, empirical scoring) all require either multiple subject matter experts rating the options or criterion-related validity data linking SJT scores to job performance outcomes. Neither exists without SME involvement.
Verdict on SJTs for the pilot: The assumption is mostly right, but there is a simplified path. Publicly available SJT item frameworks from IACP model policies and POST scenario libraries can seed a basic SJT format for the pilot - clearly labeled as developmental and not used for high-stakes scoring. This lets you demonstrate the format to agencies without claiming it is validated. Use it as a preview, not a scored component.
Structured Interviews (Behavioral Event Interviews) Structured interviews elicit past behavior (“Tell me about a time when…”) and rate it against defined behavioral anchors. These require a full job analysis to define the behavioral dimensions being rated, and validated Behaviorally Anchored Rating Scales (BARS) for each dimension. This is a multi-month I-O Psychology engagement before a single interview question can be defensibly scored.
Verdict on structured interviews for the pilot: The assumption is correct. Do not include structured interviews in the pilot. They require job analysis data specific to the agency, dimension development, and anchor validation that cannot be shortcut. Including them prematurely would undermine the platform’s psychometric credibility.
Oral Board Simulations Oral boards are the assessment type where the Merit Engine has the most exciting long-term potential, but the assumption is substantially correct for the pilot. Effective oral board simulation requires:
- Scenario content drawn from real critical incidents at the specific rank
- Rating dimensions grounded in a job analysis
- Behavioral anchors that define what a 1, 3, and 5 response looks like per dimension
- Calibrated rater scoring (even if the rater is AI, it needs ground truth to calibrate against)
None of these can be built without SME input and agency-specific data. An AI oral board simulation without validated anchors produces scores that feel meaningful but cannot be defended in a civil service audit or a grievance hearing.
Verdict on oral boards for the pilot: The assumption is correct with one caveat. The voice-AI simulation module can be included in the pilot as a practice tool - a low-stakes environment for candidates to rehearse responses and hear themselves think through scenarios. Label it explicitly as a practice tool, not a scored assessment. Collect the transcripts. Those transcripts become the raw data from which, later, you build validated rating dimensions with a real SME. The pilot earns you the data that makes the scored version possible.
What the Multiple Choice Format Can Do That You May Be Underestimating
This is the more important challenge to your assumption. You are right to start with MC for the pilot - but you may be underestimating how much of the core value proposition MC items can deliver on their own.
MC items can test application, not just recall. A well-written MC item does not ask “What is the definition of probable cause?” It presents a scenario with specific facts and asks “Under the standard established in Illinois v. Gates, does Officer Jones have probable cause to arrest the suspect?” That is applied reasoning, not memorization. The scenario is embedded in the item stem. You do not need a separate SJT format to test situational judgment at a basic level - you need better item stems.
This is the key insight for the pilot: The difference between a knowledge-recall MC item and a situational-judgment-lite MC item is the item stem construction. The former tests whether the candidate can recite a rule. The latter tests whether they can apply it correctly to a realistic set of facts. The IRT model works identically for both. The discrimination parameter (a) will be higher for the application items because they separate candidates more sharply - which is exactly what you want for a promotional exam.
The implication for distractor design is significant. A recall item’s wrong answers are other definitions. An application item’s wrong answers are other plausible legal conclusions a candidate might reach from the same set of facts. Those distractors are richer, more discriminating, and more defensible. They are also harder to write, which is why the item generation pipeline needs to be designed carefully.
Part 2: Multiple Choice Item Architecture
Format Specification
The Merit Engine pilot will use 4-option multiple choice items: - 1 correct answer (the keyed response) - 3 distractors (incorrect options) - Of the 3 distractors, 1 is designated the primary distractor - the most plausible incorrect answer - Of the remaining 2, they are secondary distractors - plausible but less attractive than the primary
Why 4 options, not 2: A 2-option (binary) format has a guessing parameter (c) of 0.50. A candidate who knows nothing has a 50% chance of getting it right. The IRT 3PL model corrects for this, but high guessing parameters reduce item information and require more items to achieve the same θ precision. The 4-option format drops the theoretical guessing floor to 0.25, which is the standard for professional certification and promotional exams.
Why specify a primary distractor: The primary distractor is what separates this from a generic quiz. It represents the most common misconception or the most tempting wrong conclusion. Candidates who select the primary distractor reveal something specific about their misunderstanding - they are not guessing, they are reasoning incorrectly. When the IRT model identifies a candidate who consistently selects primary distractors in a given domain, that is diagnostic information the engine uses to target the source material review in Stage 2.
The Four Distractor Types (and Which to Use)
| Type | Description | IRT Value | When to Use |
|---|---|---|---|
| Plausible Alternative | Related content that is wrong in this context | High discrimination | Primary distractor - always include one |
| Common Misconception | A belief many candidates hold that is factually incorrect | Very high discrimination | Primary distractor when misconception is known |
| Partially Correct | True in some circumstances but not the specific facts presented | High discrimination | Secondary distractor - use for application items |
| Foil | Clearly wrong to anyone with basic knowledge | Low discrimination, inflates guessing | Avoid - adds noise, reduces item quality |
The AI generation pipeline will be instructed to produce at minimum one primary distractor of type Plausible Alternative or Common Misconception, and to explicitly avoid Foil distractors. Foils appear frequently in low-quality practice test banks and are easy to generate - resisting that temptation is part of what makes the Merit Engine item bank professionally defensible.
Item Difficulty Targeting by Stem Type
The item stem construction directly predicts the pre-calibration b parameter estimate. The pipeline generates items at three explicit difficulty targets per source section.
| Stem Type | Typical b Range | Example Construction |
|---|---|---|
| Definition/Rule Recall | -1.0 to 0.0 | “Under Alabama Code 13A-X-X, which of the following defines…” |
| Single-Fact Application | 0.0 to +1.0 | “Officer Jones observes X. Which charge is most appropriate?” |
| Multi-Fact Application | +0.5 to +1.5 | “Given the following set of facts [3-4 facts], what is the legal conclusion?” |
| Exception or Limitation | +1.0 to +2.0 | “Which of the following circumstances would NOT justify…?” |
| Competing Principles | +1.5 to +2.5 | “A supervisor faces both X requirement and Y constraint. What takes priority?” |
The pipeline requests all five stem types from each source section where the content supports them. Not every section will yield all five - a narrow policy section may only support Definition/Rule Recall and Single-Fact Application items. The domain taxonomy document identifies which domains are likely to yield high-difficulty items.
Part 3: The Item Generation Pipeline
Overview
The pipeline converts verified source text into reviewed, pre-calibrated MC items ready for the active bank. It runs in stages with human gates at two points - after AI generation and after source verification - before any item reaches a candidate.
STAGE 0: SOURCE PREPARATION
Source text section is ingested and tagged
Domain, criticality weight, and source citation are assigned
Source content hash is recorded
Source verification status = UNVERIFIED
STAGE 1: AI ITEM GENERATION (Claude API)
Input: Source text section + domain + target difficulty levels + format spec
Output: JSON array of candidate items at 3-5 difficulty levels
Each item includes:
- stem
- option_a through option_d
- keyed_answer
- primary_distractor designation
- distractor_rationale (why each wrong answer is plausible)
- estimated_b (difficulty estimate based on stem type)
- source_section reference
- item_type (recall | single_application | multi_application |
exception | competing_principles)
Items enter bank with status = GENERATED, not yet reviewable
STAGE 2: AUTOMATED QUALITY CHECKS
Check 1: Stem length is within bounds (20-120 words)
Check 2: All four options are present and non-empty
Check 3: No option is a subset of the stem (avoids "all of the above" constructions)
Check 4: Keyed answer is unambiguous - AI self-check: given only the source text,
is there one clearly correct answer?
Check 5: Primary distractor is identifiably wrong - AI self-check: a candidate who
read the source carefully would not select this
Check 6: No foil detected - AI check: is any option obviously wrong to anyone with
basic domain knowledge?
Items passing all 6 checks advance to STAGE 3
Items failing any check are flagged with failure reason and routed to human review
STAGE 3: SOURCE VERIFICATION (Default, Required)
Verification point 1: Source URL resolves and content hash matches ingestion record
Verification point 2: AI reads current source text and confirms it supports keyed answer
Verification point 3: Source verification date is current (within window)
Pass: source_verification_status = VERIFIED, date stamped
Fail: item suspended, flagged in FSP dashboard, routed to human review queue
STAGE 4: HUMAN REVIEW QUEUE (FSP Admin - Tonya or designated reviewer)
Reviewer sees: item stem, all four options, keyed answer, distractor rationales,
source text excerpt, estimated difficulty, automated check results
Reviewer actions:
APPROVE: item enters active bank
APPROVE WITH EDIT: reviewer edits stem or options inline, then approves
REJECT: item archived with rejection reason logged
FLAG FOR SME: item flagged for external subject matter expert review before
activation (used when policy interpretation is ambiguous)
STAGE 5: IRT PRE-CALIBRATION
Approved items receive:
- estimated_b based on stem type (from difficulty targeting table)
- estimated_a = 1.0 (default discrimination, assumes average, revised after
empirical data accumulates)
- estimated_c = 0.25 (default guessing floor for 4-option MC)
- calibration_status = PRE-CALIBRATION
Items are now eligible to be served to candidates
STAGE 6: EMPIRICAL CALIBRATION (Ongoing, Automated)
After 200 candidate responses per item:
- Run maximum likelihood estimation on response data
- Update b estimate with empirical value
- If using 2PL: update a estimate
- calibration_status = PROVISIONAL
After 500 candidate responses per item:
- Run full calibration with confidence intervals
- Flag items where empirical b differs from estimated b by more than 0.5
(signals item may be ambiguous or source has changed)
- calibration_status = CALIBRATED
STAGE 7: PERIODIC MAINTENANCE
Scheduled re-verification of source citations (90/30 day windows)
Items failing re-verification: suspended from serving, flagged in FSP dashboard
Items with b shift detected: flagged for human review
Items with high error rate at all theta levels: flagged as potentially flawed
The Claude API Prompt Template
This is the structured prompt that drives Stage 1. Consistency in the prompt is critical - the JSON output format must be identical every time so the ingest pipeline can parse it reliably.
SYSTEM PROMPT:
You are a psychometrician and item writer specializing in public safety
promotional examinations. You write multiple choice items that meet
professional testing standards: one unambiguously correct answer,
no trick questions, no ambiguous stems, no "all of the above" or
"none of the above" options, and no obviously wrong (foil) distractors.
Every item must be answerable correctly by a candidate who has read
and understood the provided source text. No item should require
knowledge beyond what is in the source text.
USER PROMPT:
Source Domain: {domain}
Source Section: {source_section}
Source Text: {source_text_excerpt}
Target Rank: {rank} (Sergeant | Lieutenant | Captain | Fire Officer I | etc.)
Criticality Weight: {criticality_weight} (1-10)
Generate items at the following difficulty levels:
- 1 item at RECALL level (stem type: Definition/Rule Recall)
- 1 item at BASIC APPLICATION level (stem type: Single-Fact Application)
- 1 item at APPLIED REASONING level (stem type: Multi-Fact Application)
- 1 item at EXCEPTION level if the source text contains exceptions or limitations
- 1 item at COMPETING PRINCIPLES level if the source text involves
balancing two or more requirements
For each item, return a JSON object with this exact structure:
{
"stem": "string",
"option_a": "string",
"option_b": "string",
"option_c": "string",
"option_d": "string",
"keyed_answer": "a" | "b" | "c" | "d",
"primary_distractor": "a" | "b" | "c" | "d",
"distractor_rationale": {
"option_x": "why this option is plausible but wrong",
"option_y": "why this option is plausible but wrong",
"option_z": "why this option is plausible but wrong"
},
"item_type": "recall" | "single_application" | "multi_application" |
"exception" | "competing_principles",
"estimated_b": float,
"source_section": "string",
"reviewer_notes": "any ambiguities or caveats the human reviewer should check"
}
Return a JSON array of all generated items. Do not generate an item
at a difficulty level if the source text does not support it.
Example Output (Domain 2: Constitutional Law)
Source text: Alabama Code of Criminal Procedure, coupled with U.S. Supreme Court precedent on the automobile exception to the warrant requirement.
[
{
"stem": "The automobile exception to the Fourth Amendment warrant requirement was established in which landmark Supreme Court case?",
"option_a": "Terry v. Ohio",
"option_b": "Carroll v. United States",
"option_c": "Illinois v. Gates",
"option_d": "Chimel v. California",
"keyed_answer": "b",
"primary_distractor": "c",
"distractor_rationale": {
"option_a": "Terry v. Ohio is frequently tested and involves stop-and-frisk, not vehicle searches - candidates may confuse landmark cases",
"option_c": "Illinois v. Gates established the totality of circumstances test for probable cause - thematically related to vehicle searches but not the automobile exception",
"option_d": "Chimel v. California governs search incident to arrest, not vehicle searches"
},
"item_type": "recall",
"estimated_b": -0.3,
"source_section": "4th Amendment - Automobile Exception",
"reviewer_notes": "Straightforward recall item. Verify Carroll v. United States citation is still the controlling precedent - no subsequent case has displaced it but confirm."
},
{
"stem": "Officer Davis stops a vehicle for a broken tail light. During the stop, she detects a strong odor of marijuana coming from inside the vehicle. She has no search warrant. Under the automobile exception, which of the following best describes her authority?",
"option_a": "She may search the entire vehicle including any closed containers that could contain marijuana",
"option_b": "She may search only the passenger compartment, not the trunk",
"option_c": "She must obtain a warrant before conducting any search because the stop was for a traffic violation",
"option_d": "She may search only if the driver consents",
"keyed_answer": "a",
"primary_distractor": "b",
"distractor_rationale": {
"option_b": "A common misconception - candidates often believe trunk searches require separate authority. United States v. Ross extends automobile exception to all areas where contraband might be found.",
"option_c": "Plausible to candidates who confuse the basis for the stop with the basis for the search - probable cause for the search arises independently from the odor",
"option_d": "Consent is one pathway but not required when probable cause exists - conflates two independent exceptions"
},
"item_type": "single_application",
"estimated_b": 0.8,
"source_section": "4th Amendment - Automobile Exception",
"reviewer_notes": "Primary distractor (b) tests a genuine common misconception confirmed by training literature. High discrimination expected. Reviewer should confirm United States v. Ross is cited in agency training materials."
}
]Part 4: Assessment Type Roadmap
The following table defines the full roadmap from pilot through mature platform, with the SME and validation requirements that gate each assessment type.
| Assessment Type | Pilot (Phase 1) | Condition to Unlock | SME Requirement | IRT Model |
|---|---|---|---|---|
| Multiple Choice - Recall | Yes, scored | None - available at launch | FSP review only | 1PL Rasch |
| Multiple Choice - Application | Yes, scored | None - available at launch | FSP review only | 1PL, upgrade to 2PL |
| Multiple Choice - Competing Principles | Yes, scored | 200+ responses per item for calibration | FSP review + optional Training Officer review | 2PL after calibration |
| SJT - Practice Preview | Yes, unscored | None - labeled as developmental | Scenario seeded from IACP/POST public frameworks | Not scored in pilot |
| SJT - Scored | Phase 2 | Expert consensus scoring key from 3+ SMEs per agency | Agency Training Officer + I-O Psychologist (Tonya) | Nominal Response Model or GRM |
| Voice-AI Oral Board Practice | Phase 2, unscored | Voice pipeline built and tested | Scenarios from public prep materials - labeled as practice | Not scored |
| Voice-AI Oral Board Scored | Phase 3 | Validated BARS anchors per dimension, per agency | Agency Training Officer, Field supervisors, Tonya for I-O validation | Requires separate scoring rubric tied to behavioral dimensions |
| Structured Interview | Phase 3 | Full job analysis per agency, dimension development, anchor validation | Multiple agency SMEs + Tonya as I-O lead | Behavioral dimension scoring - not standard IRT |
| Written Simulation / In-Basket | Phase 3 | Scenario development + scoring rubric + pilot test | Agency Training Officer + Tonya | Partial credit / GRM |
The Unscored Preview Strategy
For assessment types not yet validated (SJTs and oral boards), the pilot uses an “unscored preview” approach: - The format is demonstrated to candidates so they understand what it looks and feels like - Responses are collected but not fed into the θ estimate - The collected responses become the raw data for SME review and eventual scoring key development - Candidates are told explicitly: “This is a practice format. Your responses here do not affect your readiness score. They are for your own reflection only.”
This is not a placeholder - it is a deliberate data collection strategy. The transcripts and response patterns from the pilot become the empirical foundation for Phase 2 validation.
Part 5: The SME Engagement Plan
Given the SME requirements for Phase 2 and 3 assessment types, FSP needs a structured approach to engaging subject matter experts. This is not something to figure out later - it should begin during the pilot with the first agency.
Who qualifies as an SME for the Merit Engine:
| Role | What They Contribute | How FSP Engages Them |
|---|---|---|
| Agency Training Officer | Reviews MC items for policy accuracy, provides SJT scenarios, rates oral board responses | Formal paid consultation agreement per agency |
| Field Supervisor (Sergeant/Lieutenant) | Rates SJT response options, validates behavioral anchors for oral boards | Focus group during agency onboarding, compensated |
| Academy Instructor (APOSTC or AFC) | Reviews statutory items, confirms difficulty level accuracy | One-time engagement per item bank build |
| Civil Service Commission Representative | Reviews item content for adverse impact and legal defensibility | Invited to review item bank before exam cycle |
| Tonya R. Dawson (FSP) | I-O Psychology framework, item review sign-off, BARS development lead | Built into service tier pricing |
The pilot agency SME ask: When onboarding the first agency, build the Training Officer review into the contract - not as an optional service, but as a required step for the agency layer of the item bank. Frame it as protecting the agency: “We need your Training Officer to review the items we build from your policies before any candidate sees them. That protects the integrity of your process.”
That conversation also surfaces the first round of SJT scenarios organically - a Training Officer who is reviewing policy-based MC items will naturally describe the situations where officers get the policy wrong in the field. Those are your critical incidents.
Fairlawn Strategy Partners, LLC, an affiliate of the Institute for Transformative Change - Confidential and Proprietary Contact: Tonya R. Dawson | tonya@fairlawnstrategy.com Document Version 1.0 - June 29, 2026