What is an Automated Scoring Engine? By Scott W. Wood

You may have seen the phrases automated scoring, machine scoring, or even robo-readers used in news articles about assessment. All three phrases refer to the use of a computer algorithm to score open-ended free text responses (like essays) in a way that emulates scoring by a human rater in formative and summative assessments.

But how does the computer algorithm know how to score responses?

In this post, I give an introduction into how a typical automated essay scoring engine works. But first, it is important to define three processes for which engines are used.

  • When an automated scoring professional trains an engine, the engine determines (1) which response characteristics need to be calculated to produce the score and (2) which statistical model best converts response characteristics into scores.
  • When a professional validates an engine, he or she submits new responses with known human-assigned scores to predict how the trained engine will perform in operational practice.
  • When an engine is used in operational practice, responses are being scored by the engine as part of a live formative or summative testing program.

Although the same engine is used for all three processes, the processes differ slightly for each.

Training the Engine

To start training, most engines require a set of essays produced by actual test takers along with human-derived scores. It may seem unusual that training an engine requires human scores. But remember that engines are designed emulate human scoring, and we need evidence of how human beings would score the item.

The human scored essays are imported into the automated scoring engine. Pre-processing of the responses is necessary to standardize spacing and remove non-valid responses from further training. Next, a set of quantitative essay characteristics—called features—are obtained from training data. A feature is any characteristic of an essay that is relevant to the scoring of the essay and can be expressed numerically. A feature can be a count (i.e., the number of words in the essay), a decimal (i.e., the Flesch-Kincaid grade level), or a 0/1 binary value (i.e., the presence of the word “government” in the essay). Features like these represent a small portion of the total feature set available in ACT’s automated scoring engine, CRASE+.

After obtaining all the relevant feature values for all the training essays, a statistical model is fit to convert the features into a predicted score. Using the features as independent variables (the X’s) and the human-derived score as the dependent variable (the Y), we can use linear regression to produce a scoring model. Furthermore, we can use that model to assign scores to new essays. The original version of CRASE+ used linear regression, but new versions utilize the latest in machine learning methods.

There is one last training step. Some statistical models like linear regression will predict a value as a decimal and not a whole number rubric score. Discretization rules are established to produce a final score.

At the end of training, we have determined which features are important for determining an essay’s score, and we have determined what statistical models and rules are needed to convert feature values into predicted scores.

Validating the Engine

It is important to confirm that the trained engine will perform well on new essays—essays the engine has not seen before. This is done during the engine validation process.

To begin, we need to import a new set of essays and their human-derived scores into the engine. We apply the same pre-processing steps, compute the relevant features, produce the predicted score, and discretize the prediction via the scoring model developed during training.

The engine produces scores that we believe replicate the way a human scorer would produce scores. But we cannot confirm that until we compare the engine scores to the human-derived scores for the essays. Automated scoring professionals will produce a variety of metrics to compare the automated scores with the human scores, such as score point distributions, exact agreement, and quadratic weighted kappa. The customer and the automated scoring staff will compare these metrics to established best practices and advise the customer if the trained engine is acceptable for use in an operational testing program.

Conclusion

Automated scoring engines emulate human scoring by using previously scored essays to identify important essay features and to create a statistical model based on those features. Once trained, the engine can be validated to confirm that the scores returned from the engine meet specifications for accuracy.

There are several benefits to using automated essay scoring. First, engines can score essays any time of the day, 365 days a year, since most automated scoring engine use cloud computing. Second, engines can score essays quickly. CRASE+ can score a single essay reliably in less than one second. Third, in many circumstances, automated scoring can produce cost savings over human hand scoring. Finally, essays sent to the engine’s models will receiving consistent scores, regardless if the response is submitted on the first day of testing or the last.

Automated scoring does have limitations, though. Training an engine often requires large amounts of human-scored response data, which may not be available for some items. Automated scoring experts are also concerned about bias in the form of subgroup differences. Additionally, how can an engine capture information about a response from the bottom up (via the response data) and from the top down (high-level abstract features)?

In our next post, Dr. Luyao Peng will discuss a methodology that addresses this question of bottom up and top down information capturing.

This article is part of Augmented Intelligence, a blog series about automated essay scoring.