In August 2019, Motherboard (vice.com’s Tech blog) published a critical article about automated essay scoring, “Flawed Algorithms Are Grading Millions of Students’ Essays.” The story notes that: “research from psychometricians—professionals who study testing—and AI experts, as well as documents obtained by Motherboard, show that these tools are susceptible to a flaw that has repeatedly sprung up in the AI world: bias against certain demographic groups.”
The bias problem is not restricted to essay scoring and arises from the tendency of artificial intelligence to amplify patterns from data, in this case the human graders that train the essay scoring engines.
In this episode of ACTNext Navigator podcast, we’ll go under the hood of ACT’s automated essay scoring engine, CRASE+ (Constructed Response Automated Scoring Engine). Our guests are Erin Yao and Scott Wood. They’ve been working for many years on CRASE+, a product acquired in 2014 when ACT purchased Pacific Metrics. CRASE+ is a writing assessment tool that begins with human graders to develop a rubric. Data from human graders is used to train the automatic grading on a large scale.
We discuss CRASE+, automated scoring, the challenges of automated scoring using natural language processing, how to address the biases in human graders and the “flawed algorithms.”
Scott wrote “Public perception and communication around automated essay scoring,” a chapter in the Handbook of Automated Scoring: Theory into Practice (Chapman and Hall/CRC, 2020).
Want to learn more about CRASE+? Download the product overview here.
The views and opinions expressed in this podcast are those of the authors only and do not necessarily reflect the official policy or position of ACT, Inc.
[Erin Yao] My perspective is in that high-stakes assessment automated scoring will always be used with humans in the loop. Algorithms are good they perform well but they’re not they’re not the same as a human reader. We’re really doing our homework with automated scoring before it’s integrated into a high-stakes ACT product to demonstrate that it’s an appropriate solution to be integrated with humans for scoring purposes. So, you know the article that came out, those are those are real concerns and they’re really important ones and there’s many things we do to address them and we’ve got to have that strong research foundation otherwise you will get really strong critics.
[Host] That’s Erin Yao talking about ACT’s scoring engine CRASE+ which stands for constructed-response automated scoring engine. The article she mentions is from vice.com in August 2019 that examines essay scoring from a critical perspective.
In this episode of ACTNext Navigator podcast we’ll go under the hood of ACT’s automated essay scoring engine CRASE plus our guests are Erin Yao and Scott Wood. They’ve been working for many years on CRASE, a product acquired in 2014 when ACT purchased Pacific Metrics. We’ll talk about automated essay scoring and some of the challenges of automated scoring using natural language processing. CRASE+ is a writing assessment tool that begins with human graders to develop a rubric. Data from human graders is used to train the automatic grading on a large scale. Why auto scoring? Well human graders alone can take weeks to grade a stack of essays. CRASE can grade a pile of essays in seconds.
Getting a machine to grade multiple-choice tests is easy. ACT has been doing it for 60 years and it’s kind of our specialty. But how do you grade a short-answer writing or open-ended essay response?
Here’s Scott Wood talking about CRASE plus.
[Scott Wood] CRASE is an acronym it actually stands for constructed response automated scoring engine and the engine itself has been around since about late 2007 so about 12 years at the time that we’re talking here. CRASE Plus is our newest version of the engine the engine itself has gone through several transitions and iterations over the years so CRASE plus is the newest name for the engine.
We call it that because our hope is that the engine that we’ve produced will do more than just score responses score essays what we hope that it will do is also provide feedback as students and examinees and instructors and teachers may find valuable and so we want to provide because we’re providing more than just the scores we’ve added the plus to the to the name of the engine.
AES stands for automated essay scoring and the goal of automated essay scoring is to produce a computer program or an algorithm or a script that will assign scores to examinees based on their essay writing in a way that emulates how a human score or a human hand score might score that particular essay or the essays from that particular product and then and the idea behind automated scoring in general to help to speed up the process of scoring human scoring is very effective but it can be very slow because human beings have to read in the essays or read in the responses and digest what’s in that in there and turn that into a score based on the rubric that they’ve been given. Scoring engines and automated essays scoring engines are able to do that very quickly do that at all hours of the day, 24 hours a day, seven days a week.
[Host] And here’s Erin Yao again…
[Erin] You know it’s really important to get that really timely feedback where teachers might just not have time to do all the gradient they want to do and we can think of this technology as being an enabler to kind of take care of some of that less fun work so that teachers can really dig in to what the misconceptions are that their students have and what the class needs and move forward whether it’s was in the same lesson or you know the next day so we’re really looking into in a great CRASE+ with learning products it already supports the ACT online practice but we’re looking now to take advantage of that you know the high availability and things like that to really get that the formative feedback back to the people who who need it.
[Scott] Usually it actually starts, believe it or not, with human score data so to train an automated scoring engine or an automated essay scoring engine what will typically happen in an application is that we the engine trainers will be sent humans score data by the customer or by the client and for that given item or for the set of prompts that they’ll use automated scoring with and so from there we usually get quite a bit of very useful information one of those things is that we usually get scores from multiple independent human readers. Often that data is double scored or beyond, where we’ll have a human rater give a score to a particular response, we’ll have a second rater that also that independent from the first give a score to that response and then if necessary when there is disagreement, usually somebody comes in usually an expert reader or a table what’s called a table leader and they’ll help resolve those disagreements. A rater gives a 2, one rater gives a 4, the expert reader may come in and say oh that really was a 4, or they might split the difference as, you know it was really a 3.
When we have customers that do that that gives us on the human scoring team a lot of information because we can then see where some of those disagreements are. We’ll actually take the average of the human one and the human to score the rate or rater to support that information then gets conveyed to the engine, that okay this response perhaps is one that is a borderline to 3 borderline 3 4 looking and the scoring engine can take that information and when it goes and builds its statistical models it can leverage that.
Now if that yes there were disagreements on some of these responses with the humans and we’ll try to do our best to score them and order in the proper way compared to other essays that the engine will see where humans might not agree the machine and I’ll use the term of machine for this CRASE+ phrase plus might split the difference or come to some agreement or at least be adjacent to their score if not match it that’s true that that is true that’s a that’s a good summary of that with regard to you know with regard to disagreeing that’s right like that. The engine has to has to make a decision somehow. It has to slot that particular response into the proper scoring bucket if you will and so it does have the advantage of an engine like CRASE+ has the advantage of knowing in the training data that there may be multiple instances of a given response or multiple similar instances of a given response and it can use the characteristics of that response as well as characteristics of similar sounding responses to narrow down where that score should wind up based on the algorithms and the modeling techniques that the engine uses.
[Host] Now we’re going to talk about some of the biases inherent in artificial intelligence and natural language processing. How do biases become trained into the model? Is there a way to avoid them?
[Erin] So it’s really important because the algorithm only knows the data it’s seen, so to make sure that data is of high quality is super important many assessment programs have things that they do to make sure that their human rater data is quality so it doesn’t have those biases but there is a concern out there that if somehow those bias you slipped in right we don’t want an algorithm perpetuating the unfairness so something we do with our algorithms after we train them we oftentimes do subgroup analyses to determine if we see any biases and how our model is performing and from there if we you know if we don’t see anything, that’s great. If we do there’s the question of where did it come from right? Is it something that was present and the human rater data the model is trained on or is it something new and that is being added by the algorithm itself.
So those are really important conversations and fairness is that they deal to us so we want to eliminate the biases their human agreement varies so they don’t agree 100 percent of the time and our goal is to agree like how they agree with each other so if they agree 80 percent of the time our goal is to have our system agree with a human rater 80 percent of the time so oftentimes what comes out of those is exemplars where you know this particular response is a really good example of a response that should receive a three this one’s a really good example of one I should receive a square before so something or we’re pretty excited but something we can do with our system is we can make our system aware of those responses to increase its accuracy of identifying you know the threes and the fours etc correctly.
Alina von Davier has done previous work with automating tests to ensure that the algorithms are fair for subgroups so that’s something we’re looking to in the future to add as an automated check in our system. This is really important because as you’re scoring operationally, you want to make sure that your algorithm is still working as expected because when we evaluate our models for fairness and things like that it’s on the data we are provided with to build the models and you want to make sure as the model gets used that it’s still fair and scoring with the high standards that you expect it to. That’s something we’re looking to integrate into our system as an automated check, to check subgroups. Check subgroups and also another automated check we’re looking to do is just looking to make sure that the model quality stays consistent over time. Over time grading practices naturally changed by human graders even with the same rubric. It can be applied differently over time.
So we want to make sure that given our algorithm is going to score every response in the exact same way we want to make sure that it’s still the standards we have for how it should agree with human raters. Same thing goes if anything changes in the testing program right, if the population of students changes, whether it’s a different grade is being given the test or just a different type of student is receiving the test, all those things can affect how well our model works because the assumption is when we build the model it only knows the data.
So it knows that representative sample you gave it at the very beginning. If something changes you need to reevaluate an update. One of the advantages of CRASE+ is it has really high availability and a really fast turnaround times so on average we’re scoring responses in about a second and what’s really exciting about that is as we integrate CRASE with ACT products students and teachers will be able to get back results even quicker.
[Host] Seems like it would be easy doesn’t it? If you detect a bias then you fix the training data or maybe you revise the item question. But as Scott says it’s more complicated than that.
[Scott] One thing that I thought was very important in that article was a quotation I believe it was a professor from the University of Washington that was talking about this issue of potential bias towards certain subgroups by an army of scoring engine and the professor acknowledged that this is a very much a problem in artificial intelligence in general and automated scoring is an application of artificial intelligence. The professor is very clear that the goal of an artificial intelligence model or an artificial intelligence application is to find the trends that it sees in the data when it comes to automated scoring it is also trying to find those trends in that data where automated scoring training starts is with human score data so it’s possible that human scores are instituting. I believe they called the phrase they use is unconscious bias when automated scoring engine is then reading in the human score data and try find those trends it may pick up on these unintended bias seems that the human scores are introducing somehow and it’s as the Vice article said it’s been magnifying them because that’s what that’s what artificial intelligence algorithms do and so really is a core problem right now in artificial intelligence.
I would argue and that’s an open problem and a very difficult problem and human beings are not again are not instituting or adding bias in a malicious way or even a conscious way but sometimes you know as human beings read essay after essay after essay there could be something that pops in. We don’t know the origins of it. We don’t know why that could be and oftentimes it’s hard for us to figure out how to even get rid of it to avoid modeling this kind of noise and the data that we shouldn’t be myopic
It’s a very important step an action that automated scoring professionals and engine trainers should be doing as part of the engine training process it’s so important in fact that The Standards for Educational and Psychological Testing, which is a very important set of standards to psychometricians and test development people and testing experts assessment experts is mentioned in that book very often but when it is it’s almost always in the same sentence as looking out for potential bias to make sure that students from different sub-populations are being treated equitably with automated scoring.
It’s an important concern that especially the more psychometrically oriented automated scoring experts are very much aware of and so there are approaches towards at least the detection of potential subgroup bias or sub-population bias with an automated scoring now that and this is what the research is new enough we don’t know then what the next step is does that mean that we shouldn’t be using that feature in score modeling that probably is a reasonable next step that also means that perhaps the human scorers when they were creating their scores for the training sample perhaps picked up on something again unconsciously perhaps you can easily find when there are differences but then what’s the next step.
How do you resolve it and I think that’s an open research question for a lot of people involved in subgroup analyses of all stripes to be honest and is that we go back to the stimulus where that item is and say we need to rework the whole question or that’s a tough call and I’ve talked with other folks that deal with automated scoring and they’ve asked the same kinds of questions.
Creating a test item is not a cheap process as I understand. A lot of time, a lot of resources, a lot of both people resources and monetary resources too. And so for the automated scoring person to come back and say, hey, we’re seeing a subgroup difference on this particular characteristic of the essay. It’s easy for us to report that but to then rehash the item it’s hard to determine well how do you even do that? What is it about the item prompt that would have to be changed for this effect to disappear or for this effect to go to what go away. Revising the item seems like a very it’s a very strong reaction but who’s to say that the revision doesn’t then change another feature here and all of a sudden show up as a subgroup difference.
So again I would argue that is it’s an open research question as to what is the proper step to take it’s a little easier in the differential item functioning world because those are often multiple choice items and it’s a little easier to swap them out sometimes because there’s usually a bigger item pool but even then that’s not a trivial thing to do we through that has ripple effects and consequences towards the assessment again in terms of psychometric terms the short answer your question is I’m not sure and I think there are other plans for scoring experts that wouldn’t necessarily know what the next step is to resolving that subgroup different without dooming the extraordinary damage to an item or to the assessment itself. [Music]
[Host] How would you explain the AI as a black box metaphor to someone?
[Scott] Well let’s compare what CRASE+ looked like when I first started versus what a CRASE+ could potentially look like with these so-called black box methods or these machine learning methods.
So when I started it was called CRASE at the time to use multiple linear regression, the same kind of linear regression you would see in a college level statistics course and so it was nice about those models is that you had your set of features that you had available you have the human scores and so you could create this regression model and because linear regression is a somewhat familiar statistical modeling tool a lot of people are familiar with it again through college or perhaps even high school even and what they learn in math and stats as they go through that curricula, you can you can show them that equation. You can explain, well this feature has a positive progression of coefficients so as this feature goes up we would expect the predicted score to go up. Those kinds of things are fairly easy to explain to even a non-technical person who might be you know a stakeholder that might be using on its own. I mean when it comes to the kind of the newer methods they’re out there today with machine learning and neural networks and some of these so-called black box approaches you don’t have the same interpretability, you don’t have those same methods to establish interpretability in quite the same way.
I can’t write out a neural network. I can’t write it out, I don’t believe, in a nice simple equation and say, okay well, this feature had a positive coefficients so it you can interpreted in such in such a way and this one had a negative.
So that’s where it gets harder. Now we don’t have we can’t rely on say a simple equation anymore and so we have to rely on new methods new approaches for example there’s an approach that we’ve been using more often now called gradient boosted machines or gradient boosted models and in that case you don’t necessarily get a nice regression equation but you can get a list of what are called relative importance –is and so it’s a sequence of these are the features that you’ve included in your model and then this one was used the most often when creating the regression trees that are part of that method and then this one was used 40 percent of the time compared to the top feature and this one was used 14 percent of the time you can see how that explanation it’s a little it’s not as satisfying I would say it I’m speaking as a statistician now it’s not as satisfying as a regression equation and it’s more technical I think again where I just talked about gradient boosted models that’s a that’s a modeling technique that’s not frequently taught an introductory stats course or here in life now you have to explain well this is how the model works and now you don’t have the techniques that you can usually use to explain well the model is treating this feature this way whilst reaching treating this feature that and so it becomes more difficult to explain to a lay audience of stakeholders and that’s a big challenge. And that’s something that a lot of machine learning folks are aware of you know in some applications that use of these machine learning methods that being able to explain that isn’t quite as important people don’t really in certain applications may not care about that they just care about what number comes out of the agent but from my experience in working in education applications people want to know why the engine is doing what it’s doing you want to know and at least understand at a high level why the score came out the way that it did from a psychometric standpoint that’s also important for validity purposes in to some extent because you know people want to know that we’re not just coming up with a bunch of random characteristics about the essay and we magically get a score they want those features to be ground in the construct that’s being measured they want them grounded in what makes for good writing the characteristics of good writing so again it’s kind of give-and-take in the automated scoring right now people want to use these very fancy and very good that I’ve seen wonderful results from these machine learning methods but they’re harder to explain to a lay audience and when you already have some distrust in the models even amongst automated scoring critics that’s a tall order to then sit down with folks and try to explain all of these new machine learning methods are being used in engines across the industry.
ETS has their own automated scoring engine and I have seen in multiple publications and presentations there’s a flow chart that they have that is their overview of the features that go into their automated scoring model and what’s wonderful about that is that they’re able to give you a pretty darn good picture of the kinds of things that they’re looking for while at the same time preserving their intellectual property and not giving away everything. And that’s another one of the difficulties of this – most of the army of scoring engines out there are owned by companies and companies want to protect their intellectual properties so there’s some hesitancy to reveal all of the details on what’s in these engines but ETS has done a very good job of providing at least a high-level view without getting into the details and revealing all of them and then the other story that comes to mind is one here at ACT actually I was in conversations with our former senior vice president of research and she asked me. She said you know the features that you’ve listed here you’ve kind of made them a bit anonymous. I only refer to a by number and she said you know the industry is really moving where we don’t want to just call the feature, Feature 12.
We want to say that it’s from, if nothing else, that’s related to spelling or the frequency of spelling errors or whatever the feature of that is in reference to.
So I really do think that there is a movement towards trying to make these systems a little bit more open at the right level to not reveal too much intellectual property but enough that it gives people at least some awareness as to what’s going on and what we’re doing with their responses and how the engine is doing what it’s doing and I think that’s important.
I’ve been doing some research over the last couple years now I have a book chapter coming out in a book in early 2020 but I I’ve read through a lot of these criticisms and I’ve tried to figure out okay where are the critics coming from when it comes to automated scoring and the black box thing I think is it’s a very valid point.
People want to know what’s going on with their data they want to know how their data is being handled and processed and if there’s a nice medium where we can we can meet the needs of the stakeholders and they have awareness of what’s going on but also meeting the business needs of not revealing everything we can find a nice happy medium I think that’s going to do a lot towards improving upon and reducing the criticisms towards automated scoring.
[Host] What is the book? what is the book chapter can you can you tell me?
[Scott]Yeah, absolutely it’s about communication and public perception of scoring. The book itself is I believe it’s field handbook of automated scoring. I can provide the full title. It’s coming out in February 2020 and my chapter in particular goes through some of the history of criticism automated score cruises dates back to at least thing late 1999 sometime in the late 90s may not have been 99 exactly but I mean there were criticisms but nobody really paid attention because automated scoring really wasn’t a big thing back then. It certainly existed, it had been around for a while but it wasn’t until probably the last decade became more visible and people became more aware of it at the same time that people were becoming more aware of testing and assessment and student privacy and issues like that and all of a sudden the criticisms just ballooned in terms of what people were saying and so the chapter goes through what those criticisms were. Tries to organize them in a way that automated scoring experts can then attempt to address okay and attempt to handle so certainly the criticisms are valid some of them are a little easier to address than others but they’re certainly worth at least listening to and considering and doing our due diligence to try to work with stakeholders and critics and people who have a stake in this in automated scoring and making sure that their voices are heard.
[Host] Thank you Scott and Erin. I also want to thank their colleague Jacque Carlson, a data scientist who shared some background on CRASE+ plus during the ACT annual meeting last month when we celebrated our 60th anniversary. You can learn more about CRASE+, the constructed-response automated scoring engine on the act.org website.
Thanks for listening.