Assessment Validity Checker
Audit a proposed assessment for construct validity, reliability, and alignment to learning objectives. Use when reviewing or quality-assuring assessments before deployment.
What it does
Evaluates a proposed assessment against three dimensions: validity (does it measure what it claims to measure?), reliability (would different markers agree on the score?), and authenticity (is the task meaningful and does it require genuine demonstration of the intended learning?). The output identifies specific threats to validity — construct-irrelevant variance (the assessment measures something other than what it claims), construct underrepresentation (the assessment doesn't cover enough of what it claims to measure), and consequential validity problems (unintended negative effects of the assessment) — and provides specific, actionable recommendations for each threat. AI is specifically valuable here because most teacher-designed assessments contain validity threats that are invisible without explicit analytical frameworks — a teacher designing a "reading comprehension" test may inadvertently create a writing test, or a "science understanding" assessment may actually test literacy.
The evidence behind it
Messick (1989) unified the concept of validity into a single framework: validity is not a property of a test but of the interpretation and use of test scores. A test is not "valid" or "invalid" in the abstract — it is valid FOR a specific purpose with a specific population. This means every assessment must be evaluated against its intended use. Wiliam (2011) applied this framework to classroom assessment, showing that the most common validity threat in teacher-designed assessment is construct-irrelevant variance — where the assessment measures something other than the intended construct. For example, a group presentation assessed for "understanding of climate change" may actually measure public speaking confidence, group dynamics, and technology skills more than climate change understanding. Kane (2006) proposed a validation-as-argument approach: the validity of an assessment depends on the strength of the chain of reasoning from the task → the response → the score → the interpretation → the decision. Any weak link in this chain is a validity threat. Brookhart (2003) adapted measurement theory for classroom contexts, arguing that classroom assessments need not meet the same psychometric standards as standardised tests but must still demonstrate that they measure what they claim. Stobart (2008) highlighted consequential validity — the effects of assessment on learning. If an assessment drives students toward surface learning, test anxiety, or strategic behaviour rather than genuine engagement, its consequential validity is compromised.
Sources
- Wiliam (2011) — Embedded Formative Assessment
- Messick (1989) — Validity in educational measurement: a unified validity framework
- Kane (2006) — Validation as argument-based approach
- Brookhart (2003) — Developing measurement theory for classroom assessment purposes and uses
- Stobart (2008) — Testing Times: the uses and abuses of assessment
How to use it in your lesson
For the best results with EvidenceLesson, give it:
- assessment_description — Description of the proposed assessment — what students do, how it is marked
- intended_learning — What the assessment claims to measure
- student_level — Age/year group
- subject_area (optional) — The curriculum subject
- assessment_purpose (optional) — Formative, summative, diagnostic, or evaluative
- marking_approach (optional) — How the assessment will be marked — rubric, mark scheme, holistic judgement
- stakes (optional) — The consequences of the assessment — low stakes (informing teaching), high stakes (grading, reporting)
Known limitations
- The analysis evaluates the assessment as described, which may differ from how it's implemented. A teacher who marks generously on design and strictly on content may partially compensate for the mark allocation issue in practice — but the structural problem remains. The assessment's design, not just its implementation, should be valid.
- Validity is always relative to purpose. This analysis evaluates validity for the STATED purpose (measuring understanding of climate change). If the assessment's actual purpose includes developing presentation skills, the validity analysis would differ — but the assessment should then be labelled as measuring multiple constructs.
- Some validity threats are trade-offs, not errors. Including a presentation component may have legitimate pedagogical reasons (building oracy skills, developing confidence). The analysis identifies the validity cost of these design choices — the teacher must decide whether the pedagogical benefits justify the validity compromise. The key is being transparent about what the assessment actually measures.