In Episode 6 of the ACTNext Navigator podcast, we discussed automated essay scoring. This time we cover automated item generation (AIG) to create test questions quickly using a computer.
A lot of research goes into ACT tests. Every item question begins with psychometric grounding. Subject matter experts (SMEs) develop questions in math, reading and comprehension, science, and graphic literacy. Items are field tested and answers are checked for validity. Assessments must measure ability and not reward random guesswork on the part of the test-taker. In addition to accuracy, test items are also extensively reviewed for fairness.
But that can take time, so test developers have created some ways to streamline the assessment pipeline.
In this episode (#7) of ACTNext Navigator podcast, we discuss the history of AIG at ACT with Rick Meisner. He’s been with ACT for thirty years and developed some of the first AIG content for math using BASIC programming language. Meisner also holds several patents related to AIG and automated scoring.
Later in the show, we’ll hear from Ian MacMillan and Brad Bolender. They’ve developed AIG software for WorkKeys graphic literacy assessment (AIGL) and passage organizing and extraction (POE) respectively.
Each presented a poster at the 2019 Education Technology and Computational Psychometrics Symposium research poster and tech demo reception on October 9, 2019.
The views and opinions expressed in this podcast are those of the authors only and do not necessarily reflect the official policy or position of ACT, Inc.
[Brad Bolender] The poster we’re looking at is about the Passage Organizer Extractor application, the abbreviation for that is POE. We designed this application to help test developers find excerpts from long texts that they could use on our English test to write items about. That’s a long and costly process to look for passages in hardcover books. Sometimes we write these from scratch so the goal here is to put in texts that we can use that are digital, scan them with a computer. We analyze potential passages for things like reading level and for cohesion so that lets us know that a topic is intertwined throughout the whole passage, so it doesn’t promptly change the topics in the middle.
[Host] That’s Brad Bolender talking about Passage Organizer and Extractor or POE. We’ll join him again at the end of the show. I’m Adam Burke. In our last podcast episode of ACTNext Navigator we talked about automated essay scoring. Today we’re discussing what goes into creating assessments, specifically the automation of assessment creation. Writing tests by hand is an intense process. Test items take many hours to create, check, field test and verify for fairness and accuracy.
My guests are Rick Meisner, Ian McMillan, and Brad Bolender. Rick Meisner is an expert on AIG automated item generation. He’s been with ACT for over 30 years and programmed ACT’s first automated item generation in the programming language, BASIC.
I’m Rick Meisner and I started at ACT in 1988 as a math test specialist so I’ve been here about thirty-one years the whole time in test development in various positions from math specialist to more software development-oriented positions. I’m currently with the automated content generation group in ACTNext
[Host] Walk us through how you create a test. What goes into all that?
A really straightforward an example from math might be if we want to test whether the student understands the concept of how to calculate an area of a rectangle. We could make this a word problem. Maybe something like a plot of land is 25 feet wide x 200 feet long. What’s the area in square feet? The key [correct answer] would be five thousand square feet, 25 times 200, and the distractors [incorrect answers] for those we would try to reflect misconception. Like maybe the student would confuse perimeter with area, so the perimeter might be one of the answer choices. They might add the two numbers instead of multiplying the two numbers. That could be another one of the distractors. Leave off a zero on the actual answer and answer 500 instead of 5000, that could be one of the distractors.
After the item was pre-tested we like to take a look at how those distractors performed. If one doesn’t really draw many responses, we might in the future replace it with something else it does work better it’s a little more closer or more attractive to the students.
[Host] So distractors, attractors, you’re playing a game between those two things.
[Host] And the answer would never be like three or elephant or something like that.
Yeah, that wouldn’t have much value because nobody would pick it and it wouldn’t add any information that would help us discriminate between high and low ability.
[Host] Okay, AIG is Auto item generation and we want to explain this to people who maybe have never thought about it or thought about when you make a test that there is a way that a machine could do it instead of writing out questions and answers by hand so what do we mean by AIG or automatic item generation?
Again we could illustrate that with our math item example with the example item and with most math items the numbers in the stem are pretty arbitrarily chosen you’re just picking a number that will help illustrate whether the student has the corresponding knowledge or skills but it doesn’t have to be 25 feet wide by 200 it could be 15 by 300. You have a wide range of choices for the numbers so this is something a computer can handle with random number generation to generate all kinds of variations on the item that would still test pretty much the same scale so if we turn this item into a sort of more abstract item model where we say a plot of land is fill-in-the-blank feet wide and fill-in-the-blank feet long the computer can randomly fill in those blanks and you can have 100 items of the same type all a little different all testing the same scale there are a lot of advantages from having that many different items to reuse there are parts of the stem that stay the same parts of the stem that are filled in they’re more variable depending on what they get filled in with you would recalculate the key so that it’s still correct with those new numbers and the distractors so that they still reflect the same misconceptions as before.
[Host] That makes sense, math seems easy and you’ve been doing AIG with math for a long time.
Since about 1992. That’s when I was still a math test specialist so I was seeing a lot of these items and thinking if you have one item you could just as easily have a hundred very similar items. We had just received these new machines on our desks called personal computers which have programming languages built in, a very early version of BASIC. Playing around with that it was pretty easy to generate items from models so we began doing that mostly to fill in weaknesses in the item pool. There are certain types of items that are a little underrepresented. Automated item generation really kind of fills in those those weak areas which makes for an assembly go a little more smoothly because you’re not short and anything
[Host] What was that like when you said, or or your team said, let’s do this automatically? Was it kind of a game-changer?
It was embraced right away by management. They really wanted us to pursue this because I think a typical item costs about a thousand dollars in order to develop. When you add in item writer costs, all of the staff time involved in editing, fine-tuning, all of the reviews, fairness reviews, external reviews. When you can generate an item that you know is going to be sound because it’s based on an already tested item and when we can do it that quickly and easily, there’s quite a savings in time and cost.
[Host] But you’re not all the way there with math. There are still some challenges?
Right, a lot of math items involve graphics. If you change the numbers in the stem and a figure where the graph will have to change. If you’re plotting a line in the XY coordinate plane, the slope changes or the points change. That means the graphic has to change also and that’s a little more challenging to do. That’s still in in an early stage but we’re looking at possible tools that we can incorporate for making that easier.
[Host] And is that the AIGL that we’ll talk about later.
That would apply to some of it anything involving a data representation like a bar chart. We’re working the AIGL project is already able to incorporate those kinds of variations and bar graphs and so we hope to bring that over to the ACT math test as well.
[Host] So math, again, that seems pretty straightforward to me. You could just flip some variables in and out and come up with an unlimited number of item rewrites.
Yeah, it’s all based on numbers and algorithms and computers are really good with those.
[Host] Then English is another one of our tests in ACT and that gets a little more tricky.
Yeah it does. It took us a long time to even really think of a way to do it because now you’re dealing with a lot more entities. What’s relevant in an English item is punctuation or subject-verb agreement, words, phrases, linguistic rules instead of mathematical rules. Additionally, these are based on essays the items are all embedded in an essay. There was the challenge of coming up with a way to sort of generalize item generation across essays and to do it with words, phrases, types of phrases which as far as we know hadn’t been done before at all.
[Host] So that’s an ongoing project?
Yeah, we’ve made quite a bit of progress. We did a prototype in 2016, I think late 2016, with item models.
[Host] Are there any additional possibilities for ways to apply AIG in the future you can see what’s kind of coming down the pike?
It can definitely be used for more than what we’re using it for.
Our particular need is generating content for assessments but you could picture AIG use in the classroom. I think it’s already being used in the classroom to some extent for things like generating worksheets. If you want to test a particular skill or set of skills, ask the AIG software to generate items in those areas. You could actually even generate it’s parallel but different worksheets where the items are a bit different from each other no chance for copying so within a class you could have four or five tests being passed for sure and they were all testing the same skill, AIG makes that fairly straightforward so for practice for repeat assessment the distractors in multiple-choice items capture misconceptions so there’s some diagnostic value there. You could picture generating the item looking at patterns and misconceptions maybe then routing to remedial learning resources to misconceptions then maybe retesting with more generated items in the same skill it could become part of a whole integrated system of that kind.
[Host] We’ll talk more about classroom applications for AIG and passage extraction with Brad Bolender during his POE project later in the show. Now let’s visit with Ian MacMillan at his AIGL poster. AIGL is automated item generation for testing graphic literacy.
I’m in the ACTNext annual edtech event poster session. I’m going to talk to Ian McMillan about AIGL. Let’s find out what that is… Okay, we’re standing in front of the AIGL poster that Ian McMillan made and he’s gonna talk a little bit about graphic literacy and automatically generating graphs and item questions. So Ian, tell me what you do and how do you do it?
[Ian MacMillan] Yeah so, I work in the assessment strategy group within ACTNext and what I’ve built here is a program that automatically generates items and graphics simultaneously. The graphic literacy test is a test of your ability to work with graphs like bar graph, line graph, table charts, pie chart, pulling out information, drawing inferences, identifying trends things like that, working with data. What this proof of concept did was I really just sort of went for the simplest item type that we had which is locating information.
[Host] I’m looking at a very simple bar chart. It says car sold in two countries across four quarters and it’s Lithuania, Macedonia. You found some data that created this graph.
I found the data in my own head. I made up the data for that for this purpose but any realistic data that you might find could be filtered into the program as well.
[Host] Okay and you said you used to do graphic literacy generation of items by hand, kind of like researching, finding permission to use and you would do the kind of the same thing, create a graph and then create some items and questions.
Yes, that’s right. For a few years I was writing graphic literacy items across the entire test, across all difficulty levels and all types.
[Host] Each one could have taken you I think you said an hour or less depending.
Oh yeah, there’s always a lot of variation. Sometimes you get lucky, sometimes you can make up data.
I used to be a college professor. I would I make up graphs about student test scores and then just off the top of your head to think about it, you know regular distribution of scores, and I’ll create a graphic and that’s realistic because that’s the same kind of data the teacher might use. Another way that we would create graphics is just by taking our own expertise and using that to create a graphic but usually we find graphics that are about subjects that the person writing the item is surely an expert at the subject matter but they may be very good at analyzing the data they might not know where the bars came from in the bar graph or what they represent or mean but they know how to ask questions about that.
Once the graphic is sort of given okay the graph is called a stimulus correction and that’s kind of the first part of the question so a test taker would look at this graph and then they’d go down to the items which questions are correct, and each question has the right answer obviously yes it’s A B C or D multiple-choice and then there are three what they call distractors. They’re the wrong answer incorrect answer.
[Host] I want to get that terminology, so people know what we’re talking about.
That’s right yes graphic stimulus yeah and the items. What AIGL does is it creates the items and the graphics from the same being simultaneously yes seconds in seconds.
[Host] So this is phase one. Is this kind of proof of concept?
Yes, and it generated four item questions. This is now in fact the sample I actually produced. 192 items and for every four graphs you get a bar graph and the line graph and there’s 192 items divided into four that gets you somewhere 40-50 or so graphics 40 or 50 so bar, line graph, right and four questions per that these come in. So these are sort of in groups that we’ve got two groups of bars, red bars with blue bars, but these four questions are simply asking about the red bars the program and that you see these questions are numbered one, two, three, four, those are literally the first questions to the output from the programs.
Instead of talking about cars being sold in two countries, that’s the title of our graph, we might talk about boats being sold and that creates a whole new universe from which you can start asking mostly the same question. It’s all about being able to read a graph and answer questions accurately, how many cars in this case sold.
[Host] This is one of the tests in WorkKeys.
Yes, graphic literacy and the other one is reading for information, workplace documents and kind of reading for information and applied mathematics. The basic math questions correctly so that’s phase one is this kind of simple bar graph and generating questions in the future.
Phase two is going to focus on the graphics themselves and then phase 3 you will work to integrate the graphics back in with additional kinds of items with items additional types of skills harder tests questions that are more difficult the proof of concept really just asked you to be able to identify a particular points in a graphic some of them are complicated items asks you to identify a trend or recognize a pattern okay some of our hardest level skills will ask you to actually go beyond the data and draw an inference or making a decision different item types are going to require different types of graphics to go along with them again phase two working on the graphics side phase three integrating graphics of items again and on this future side we have a difficult bar graph a high moderate pie chart high moderate line graph different kinds of graphs you can generate from data.
[Host] You do it all at Python and how does it work in Python? You kind of feed it data?
It reads an excel file and uses the data in the file to create the graphs and the item text well and the distractors that you see in the items and then the graphics for particular details that make it what it is are based in the data file.
Right now we’re more interested in exploring the capabilities that we can that we can build you know and see what’s possible. I have been in some discussion with people that are currently writing graphic literacy items manually and they’re actually part of the project team. I’ll talk with them as we go but for now I’m sort of working by myself because of written items for the test for several years I know exactly how it’s supposed to look. So in a way I’m my own consultant and subject matter expert. I’m both the subject matter expert and the programmer.
[Host] That’s great, thank you. Now let’s head over to the poster presentation from Brad Bolender. Brad has created a program to identify passages used for reading and science test items from publicly available texts and books users can select variables to get the exact length and grade level they want to assess. Technically, POE is not automated item generation but it’s a device that helps find text passages that leads to the creation of test items. Why don’t you introduce yourself and then tell me about POE.
I’m Brad Bolender. I’m talking about the passage organizer extractor application, otherwise known as POE.
I’ve been in ACT for about 12 years. I started on our English language arts development team and moved over to some more computer approach kind of stuff now moved into ACTNext. POE is an application that can take a whole book, slice it up into pieces and pull out sections that meet certain criteria. That could be criteria for a test developer or a learner. We designed this application to help test developers find excerpts from long texts that they could use on our English test to write items about.
That’s a long and costly process to look for passages in hardcover books sometimes we write these from scratch, so the goal here is to put in texts that we can use that are digital scan them with a computer. We analyze potential passages for things like reading level and for cohesion so that lets us know that a topic is intertwined throughout the whole passage so it doesn’t promptly change the topics in the middle. We can compare the potential passages that our output from this application to what is a typical English essay and then those that are the most similar to the typical English essay are sorted to the top so the test developers can look for ones that are close to final state that they can use.
A few teachers actually stopped by to look at this poster and they were interested in the possibility of inputting book length texts that students are interested in and slicing them up into passages they could use for lessons and so that’s something that we haven’t even really considered yet we were designing this application to help test developers but that would be great if we could both take some time use as a resource.
[Host] I see you have kind of some sliders here so you can put a word count range a grade level range a sentence count and paragraph count and you can slide these looks like a lot of different ways and then in a book by how many little 300-400 word passages might be good for test.
That’s exactly right.
[Host] Okay and no item generation, is that right?
We’re working on that so that we can start to analyze these to know how many items we can generate for these individual passages right now. That way we can start to pick the ones that have the most value to us, that we can automatically generate the items.
[Host] It gives you a ranked list already so you know these are probably the good ones and then you might not even get to the bottom of the list.
We’re implementing that it’s we’ve done that on the back end on the research bench okay but in the development of the application we’re working on incorporating that part of it okay and this says version two.
The next version that’s going to analyze the passages to find how many items we can generate just like we just talked about because that’s what we’re working on those.
[Host] What else can you tell me about this?
Right now, what we’re working on is for it to find opportunities for items. Rick Meisner has developed the AEGIS application that generates the actual items of English so developing stimulus passages for the English language test requires a considerable investment of resources in order to assist with this process the passage organizer extractor or POE application prototype was developed test developers can upload a full book link text and a whole segment the text into a potential excerpt that meet criteria such as length reading level and cohesion okay and you develop this.
[Host] What is “Shiny”?
That was an in R programming language with the Shiny application framework because and then we have posted it on the Amazon Web where developers can access it through the web browser right at their desks really is this being used today for tests we have found five passages with this application that were developed into units for the ACT test or is testing right now oh yeah field testing come back just fine okay go make it all the way out to the test.
[Host] What’s your background? Were you an SME before this or did you do item generation?
I was a language arts test developer for about three years okay before I came over to assessments and learned to start applying some of these computational approaches.
[Host] When did you learn how to program in R?
I learned how to program in R about five years ago it attracted me because it’s the research programming language. I was interested in what we could do with ACT datasets and so I dove in and learn everything I could about it, did tutorials, read blogs from rock star R programmers and things like that.
[Host] Were you a coder before?
Not really. I had tinkered but I had never got anywhere near building a whole application.
[Host] Wow, how long did it take you to build POE?
I’ve been working on it off and on for a couple years.
[Host] That’s our show. Thank you Brad, Ian and Rick and thank you for listening to ACTNext Navigator podcast. If you missed the 2019 poster show, we’ll do another one next year, probably in October, at the annual ACTNext education technology event.