Historical Thinking Assessment Designer
Design formative assessments that make students' historical thinking visible — revealing whether they source, close-read, contextualise, and corroborate. Use when assessing historical thinking skills or planning a diagnostic.
What it does
Designs formative assessment tasks that make students' historical thinking visible — revealing whether students are sourcing, close reading, contextualising, and corroborating, and at what level of sophistication. The output includes assessment tasks with documents and prompts, a scoring guide distinguishing novice/developing/proficient responses, diagnostic guidance on how to interpret results, and validity notes identifying what the assessment does and does not measure.
The fundamental problem this skill addresses is that most history assessments test factual recall rather than historical thinking. A test that asks "When was the Battle of Gettysburg?" measures whether students memorised a date. A test that asks "Does this painting help us understand what happened at the first Thanksgiving in 1621? Explain your reasoning" — and provides a painting created in 1932 — measures whether students consider the temporal relationship between a source and the event it depicts. The first tells you what students remember; the second tells you how students think.
Wineburg, Smith, and Breakstone (2018) demonstrated this gap devastatingly. Using History Assessments of Thinking (HATs) — brief constructed-response tasks targeting sourcing, contextualisation, and claim-evidence reasoning — they found that 94% of introductory college students ignored the 300-year gap between a 1621 event and a 1932 painting. Upper-level history majors performed only slightly better: 78% earned zero on a task requiring them to use a Senate hearing's existence (not its content) as evidence of public opposition to war. Students consistently read for content rather than context — engaging with what documents said but not with the circumstances of their creation.
These findings confirm that historical thinking skills must be explicitly assessed, not assumed. And the HATs model shows how: brief tasks centred on a document, with questions that can only be answered well by deploying specific historical thinking skills. This skill generates assessments following that model.
The evidence behind it
Wineburg, Smith, and Breakstone (2018) developed History Assessments of Thinking (HATs) with Library of Congress support. Three HATs were administered to 78 introductory and 49 upper-level college students. Each HAT presented a document (a painting, testimony transcripts, a playbill) with full bibliographic information and asked questions that required sourcing, contextualisation, or claim-evidence reasoning to answer well. The scoring rubric used three levels: Basic (0) — no evidence of the target skill; Emergent (1) — some evidence but incomplete or superficial; Proficient (2) — clear demonstration of the skill. The results were stark: introductory students averaged less than 0.5 out of 12, and the highest score was 3 out of 12. The HATs model demonstrates that historical thinking can be assessed briefly, authentically, and diagnostically — each task reveals a specific skill presence or absence.
Reisman (2012) used a 30-item Historical Thinking Test (22 multiple choice + 8 constructed response, α = .79) and a 20-item Transfer test as outcome measures. The Historical Thinking Test assessed application of sourcing, close reading, contextualisation, and corroboration strategies. Treatment students showed significant gains on this measure, demonstrating that historical thinking skills are both teachable and assessable. Reisman's finding that effects were concentrated in sourcing and close reading (not contextualisation or corroboration) provides diagnostic guidance: if an assessment shows students performing well on sourcing but poorly on contextualisation, this is consistent with the research — the skills develop at different rates and require different instructional approaches.
Wiliam (2011) established that formative assessment's value lies in making student thinking visible so that teachers can adjust instruction. The most useful formative assessment tasks are those that distinguish between students who have developed a skill and students who have not — tasks where the wrong answer is as informative as the right answer because it reveals a specific misconception or skill gap. For historical thinking, this means assessment tasks where a student who sources will answer differently from a student who doesn't, and where the specific way a student fails to source reveals what they need to learn next.
Wineburg (2007) argued that the distinction between "being critical" and "thinking critically" is invisible without appropriate assessment. His student Jacob was passionately critical of Columbus — but his criticism was driven by spread of activation (everything he already believed about Columbus), not by disciplinary analysis of the document. A multiple-choice test on Columbus facts would not have revealed this. Only a task requiring Jacob to engage with the document as a document — attending to its source, context, and language — could make the distinction visible.
Gottlieb and Wineburg (2012) demonstrated that the quantity of sourcing references was similar between experts and novices — the difference was qualitative. Assessment tasks must therefore distinguish between mentioning a source and interrogating it. A student who writes "This was written by John Smith in 1624" has mentioned the source. A student who writes "Smith wrote this 17 years after the event, at a time when Pocahontas had become famous, which may explain why he added the rescue story" has interrogated it. The scoring guide must capture this distinction.
Sources
- Wineburg, Smith & Breakstone (2018) — What is learned in college history classes?
- Reisman (2012) — Reading like a historian: a document-based history curriculum intervention in urban high schools
- Wineburg (2007) — Unnatural and essential: the nature of historical thinking
- Wineburg (1991) — Historical problem solving: cognitive processes in the evaluation of documentary and pictorial evidence
- Wiliam (2011) — Embedded Formative Assessment
- Gottlieb & Wineburg (2012) — Between Veritas and Communitas: epistemic switching in the reading of academic and sacred history
How to use it in your lesson
For the best results with EvidenceLesson, give it:
- target_skills — Which historical thinking skills the assessment should target — one or more of: sourcing, close reading, contextualisation, corroboration, or 'all four'
- student_level — Age/year group and current level of historical thinking development
- assessment_purpose — What the teacher wants to learn from the assessment — e.g. 'whether students source before reading without prompting', 'whether students can contextualise a document they haven't seen before', 'baseline diagnostic at the start of a unit', 'end-of-unit check on all four skills'
- historical_topic (optional) — The topic or period the assessment should draw from — if not provided, the skill will use a topic that is accessible without specialised prior knowledge
- assessment_format (optional) — Preferred format — e.g. 'constructed response (short written answers)', 'multiple choice', 'mixed', 'discussion-based', 'document analysis task'
- time_available (optional) — How much time is available for the assessment — affects the number and complexity of tasks
- prior_instruction (optional) — What historical thinking instruction students have received — so the assessment targets what has been taught
- curriculum_framework (optional) — From context engine: relevant curriculum standards or historical thinking framework in use
Known limitations
- Constructed-response tasks reveal more about student thinking than multiple choice, but they also measure writing ability. A student who thinks historically but writes poorly may score lower than their historical thinking warrants. Teachers should supplement written assessments with observation during discussion (see the assessment indicators in each skill builder) to triangulate. Where possible, allow students to explain their reasoning verbally as well as in writing.
- The HATs model (Wineburg, Smith & Breakstone, 2018) was developed and tested with US college students reading US history documents. The task structure (document + question requiring a specific skill) transfers to other contexts, but the specific documents and scoring descriptors must be adapted for different age groups, national curricula, and historical topics.
- Formative assessment is only useful if it changes instruction. The diagnostic value section provides specific instructional guidance for each score pattern, but the teacher must act on it. An assessment that reveals students are not contextualising but leads to no change in instruction is a wasted opportunity. The assessment's value lies entirely in what happens next.