Generalizability of automated scores of writing quality in Grades 3—5.

The present study examined issues pertaining to the reliability of writing assessment in the elementary grades, and among samples of struggling and nonstruggling writers. The present study also extended nascent research on the reliability and the practical applications of automated essay scoring (AES) systems in Response to Intervention frameworks aimed at preventing and remediating writing difficulties (RTI-W). Students in Grade 3 (n = 185), Grade 4 (n = 192), and Grade 5 (n = 193) responded to six writing prompts, two prompts each in the three genres emphasized in the Common Core and similar “Next Generation” academic standards: narrative, informative, and persuasive. Prompts were scored using an AES system called Project Essay Grade (PEG). Generalizability theory was used to examine the following sources of variation in PEG’s quality scores: prompts, genres, and the interaction among those facets and the object of measurement: students. Separate generalizability and decision studies were conducted for each grade level and for subsamples of nonstruggling and struggling writers identified using a composite measure of writing skill. Low-stakes decisions (reliability ≥ .80) could be made by averaging scores from a single prompt per genre (i.e., 3 total) or 2 prompts per genre if administered to struggling writers (i.e., 6 total). High-stakes decisions (reliability ≥ .90) could be made by averaging across two prompts per genre (6 total) or 4—5 prompts per genre if administered to struggling writers (12—15 total). Implications for use of AES within RTI-W and the construct validity of AES writing quality scores are discussed. (PsycINFO Database Record (c) 2019 APA, all rights reserved)