Navigator Episode 5: Validating cognitive processes with eye-tracking
In this episode, guest Jay Thomas, Senior Assessment Designer at ACT, discusses how eye-tracking can go beyond psychometrics to evaluate and validate assessment and testing.
There are several aspects of recording and sampling eye movement. First are saccades, the rapid eye movements made during reading that don’t always follow left-to-right. Sometimes our eyes make retrograde saccades to backtrack text. It’s estimated that humans make over 100,000 saccades daily, including during rapid eye movement (REM) sleep cycles when we’re dreaming.
Other obvious parts are fixations, moments when the eyes are focused on a particular location, and blinking.
Another important piece of eye-tracking that has been around for decades is pupillometry, the measurement of pupil dilation.
Dilation can be interpreted to measure thinking and cognitive effort, in particular the size changes and acceleration of dilation. Pupil dilation is an autonomic bodily response. Unlike breathing or heart rates, which can be controlled or modified to trick lie detectors, you cannot “fake” or consciously control dilating your pupils.
Thomas says that eye-tracking goes beyond psychometrics to validate tests and give insight into cognitive thought processes. With Langenfeld, Zhu, and Morris, he created a formula to measure Total Cognitive Effort (TCE) for test items. He walks us through the TCE formula (above) in the podcast.
Before joining ACT, Mr. Thomas was a science teacher for 19 years and also worked for Kaplan Test Prep.
[Jay Thomas] Now that we can do this it’s not going to replace psychometrics but it’s going to allow us, like we said before, to triangulate better conclusions about test takers and that’s our goal. We want to make claims that are valid about people, about what they know we can do, based upon constructs that we think are important and measurable.
[Host] Welcome to the ACTNext Navigator podcast. I’m Adam Burke. Today we’re talking about using eye tracking data to evaluate and validate assessment and testing eye tracking what is it why do you want to know about it? Eye movement is a good way to tell how people think and even how much they are thinking. Saccades, fixations and blinks are all part of eye tracking.
Saccades are the eye movements we make when reading and dreaming like in REM sleep our eyes make over 100,000 saccades per day including while reading or when we’re dreaming.
Another part of eye tracking is pupillometry, the measurement of pupil dilation. The pupil is controlled automatically by the body and nervous system. The size changes and acceleration of pupil dilation can be interpreted to measure thinking and cognitive effort.
Our guest today is Jay Thomas. He’s a senior assessment designer with ACTNext. Before he joined ACT he worked with Kaplan Test Prep and he was a science teacher for 19 years. Later in the show we’ll talk about his classroom experience.
Jay presented a Tech Talk, which is ACT’s version of a TED Talk, in September 2019. He mentioned that AERA, APA and the NCME call for evidence of cognitive processes including eye tracking to validate assessment in their publication, The Standards for Educational and Psychological Testing.
[Thomas] You can train people to lower their heart rate you can train people to beat a lie-detector because there are both sympathetic and parasympathetic parts of blood pressure and heart rate and things like that that you can learn to control pupils are not one that you actually have any kind of conscious effort of I’m not going to let my people dilate your people naturally dilates whenever you are engaged in cognitive effort and the faster it dilates so the speed and the acceleration is able to be analyzed by the software to find out exactly how much instantaneous effort you’re giving and it’s not something that you can teach people to beat unlike other physiological measures that you might be able to do this
Is pupillometry a subset or a part of eye tracking?
Yes, so for when you think about traditional eye tracking there’s a lot of different kinds of things you can do the most basic and easy to understand as a heat map which will based upon the percentage of time that an individual looks at a particular part of a piece of paper or a screen one color will indicate a large percentage of time and white space indicates no looks in that area and this is used a lot by web designers and user experience folks to determine whether or not the way that they design their web pages are going to work so for instance Google and Apple and Amazon spend a lot of money on eye tracking research to make certain that wherever they’re putting their ads or the body now sort of things is going to draw enough attention that’s going to draw enough revenue which that’s really what’s driven a lot of innovation and eye tracking equipment and research is this idea of user experience and web design only in the last few years has it become something that we can use and testing because now we can do the same thing with a student and or any kind of test taker with an item and we can go wow that person spent a lot of time on the foils instead of looking for the answer in the graph or in the passage and we can really get a very quick and obviously a picture tells a thousand words view of the differences between say someone with high literacy skills or high graphic literacy skills or high math skills and someone with low skills in those areas because they don’t look at the same information and then in addition to that you have sequence maps which plays out like a movie so it shows you where they looked in what order.
You can see how people answer the question and whether it matches what you propose that someone with high skills would do so for instance if you think about the way that somebody with high graphic literacy skill would approach an assessment item or how a graphic literacy tasks at work you’re gonna look at the graphic and you’re gonna make sense of it and then you’re gonna look at the question that you need to find and go oh I need to find that part of the graphic and you’re gonna go scan and look for it and you’re going to do that the reading one was the o e that just blew me away the first time I saw it a good reader in English or any other language that reads left-to-right will predominantly have fixations on particular words and then jump from left to right to the next word and then they will jump backwards what’s called a retrograde saccade.
So once that they need to make better sense of something that they understand and then they’ll go left right and down and so on and people who are poor readers don’t do that and particularly in a testing situation when we found was they used a word search technique where they found a word and an answer and they went hunting for that word in vertical scans and didn’t read horizontally like you would to understand a passage oh I need a word with a K and I’m gonna go find the only 12 K’s in the word search.
I wish we had asked the question: is this the way that you read normally? Because we’re not certain whether that’s maladaptive testing behavior or this is the way this person reads normally that they don’t read left to right all the time and that they’re just scanning for key words that they might help them to do something.
So in our future research we have those questions planned is this the way that you would normally read like a textbook for your class or an article that you had to read for something so that we can maybe draw some more claims about that but having never done the research before because to be perfectly honest eye tracking research in testing people been writing about it for ETS had an article of 1969 saying that we should be doing this and it’s been in the standards in the last two editions of the Standards of Educational Testing I’ve mentioned that evidence of cognitive processes including eye tracking should be done to validate assessment it’s been there but people don’t do it because up until recently the equipment is very expensive.
I asked Jay how eye tracking goes beyond psychometrics to provide validity and illustrations of cognitive thought processes.
Psychometrics can tell you about what percentage of a population got it right and how the different groups of the population quintiles or thirds or however you want to do the second metrics are different from each other but they don’t tell you about the actual processes that people got to those answers did they get to the answers because they know how to do the math or because they’ve learned some trick in test taking skills the other part of it is that in psychometrics the assumption is that if very few people get something right so the p-value is low or in IRT the he B parameter is as high then you would say that item is difficult however it ignores the effects of good teaching. Good teaching can take something that’s very cognitively complex and a complex construct and if it’s taught well those students will get it right at a very high percentage of the time which would make a psychometrician go oh that’s easy. This class 90% of the students got it right and therefore it is it’s an easy task but it’s not an easy task it’s just a well taught task and that’s one of the things the problems of psychometrics – is that it ignores the impact of quality instruction by thinking that everything’s going to be along some sort of a bell curve. The idea is that when you are designing assessments that you should have given four thoughts to the claims that you’re going to be making about test takers and how the constructs that you want to measure are related to those claims and how they give you evidence that they can do and know what you say they know we can do and why it’s relevant to whatever claims you want to make.
So for instance if you are gonna make claims that students understand three-dimensional science based upon NGSS you need to have assessment items and tasks that together ask about all three strands so about practices about cross-cutting concepts in science about scientific knowledge and that if you only ask about one of those three then you haven’t found out information that you can make claims about what students know and can do about all three of those and so you really want to come in with this kind of scientific approach of what evidence do I need to make claims about whatever I’m going to do so if you think about kind of the Toolman claim evidence reasoning argumentation model it’s really applying that and a scientific way to assessment if you look at the standards and you think that there are five parts of validity and validity is really about the claims you’re gonna make about test takers about what they know and can do okay the psychometric research has always been very easy to collect you can pretest an item to two thousand people and you can bet it you can do all the sorts of things you can you can apply classical theory you can apply IRT you can more apply all sorts of other sorts of ways of getting evidence about reliability and those sorts of things but they don’t tell you anything about how people got to that answers those cognitive processes.
What they discovered a very long time ago was that when you are engaged in cognitive effort your pupils will rapidly dilate and this is in contrast to the slow constriction you get when you’re exposed to a bright light okay so your pupils rapidly dilate and for a long time what they would do is they would give you like some memory tasks so they would take you into a dark room and they would put your head in something that would not allow it to move so they could get a really good picture and in this dark room across the room from you on a screen they would shine up like the list of ten letters and then we’re going to then they would wait five seconds to say okay now recall it and every half second they would take a picture of your pupils and based upon the diameter of your pupil they said the bigger the diameter got the more cognitive effort you were engaging in well there’s a couple problems with that and a lot of them in terms of testing that you can’t put somebody in a taxonomic shell where their head is locked into a vise so they can’t move it left and right give them the test in a dark room and take a picture every half second part of the problem was because of those limitations there really a lot of memory quick math gave you two two-digit numbers to multiply those sorts of things and they were only dealing with the final result of how big the people was and so they would normally they would take all these pictures and somebody would take a micrometer or a caliper and they would blow up the pictures and they would measure how big the diameter of the of the I was in millimeters as the technology got better and you could start taking 60 readings per second of somebody’s eyes or 250 or whatever a woman by the name of Sandra Marshall who works at the University of San Diego came up with Index of Cognitive Activity (ICA) that realized it’s really the speed at which and the acceleration of the pupil that measures how much effort is and so she built upon Kahneman and a bunch of other people’s earlier work and came up with this measure that we’re using now that’s based on much better smaller time increment measurements that we can really get instantaneous load for the research that we published 60 readings per second and we chose that only because it matched the frame rate of the screen.
Okay what is cognitive effort and how do you talk about it?
So technically the software measures your instantaneous it’s called the Index of Cognitive Activity and it’s your instantaneous cognitive load relative to some baseline based upon a Fourier transformation that’s in inside the software. They can determine what your baseline cognitive effort is just on where your peoples go just from a change in lighting and so on and so your instantaneous cognitive effort is based upon this increase of your pupil and the rate at which it goes what we decided to do was to measure cognitive effort.
We said that we were going to integrate that over time so if during a particular saccade of let’s say point zero zero five seconds that your cognitive activity was 0.65 of your maximum we said that okay well then we’re gonna take point six five times 0.05 the time range use a rectangle approximation then add up all of those little instantaneous efforts to come up with a total effort over a task one of the nice things about the software in this equipment is because it takes hundreds thousands of readings for everybody as they’re doing these tasks.
So if you think about a typical task that takes 30 to 120 seconds and 30 seconds you’re getting you know 1,800 readings of this person so there’s an awful lot of data there since it’s all spit out in kind of a spreadsheet format it’s relatively easy to just say okay what’s the change in time what’s the cognitive activity multiply them together and then do a running total and so you can measure the cognitive effort from the beginning of a task to the end of a task for each individual for each task that you give them anybody who’s studying Educational Psychology or cognitive psychology you go okay yeah if I have a little bit of effort for a little bit of time and then I have a little bit more well that’s a little bit more effort and then you just keep adding them up and that when that number it becomes very large you end up with a large amount of cognitive effort.
The other thing that happens that as another part of this is that there’s a maximum cognitive activity that people are willing to give different people have called it different things but if you think about this maximum cognitive effort that you can get which is a function of how many things you can hold in your working memory and your skills will determine whether or not you can do a particular task because it might be too complex for you because in order to do it you would have to have way too much working memory so I think measuring the maximum cognitive load of something is important because that will tell you whether or not somebody’s engaging in but commonly called type 1 or type 2 thinking but I think that measure of the total cognitive effort over time is gonna be really important so we’re working with we’re beginning to try and work with Stephen wises about looking at his measure of engagement whether or not people are meaningfully engaged in an item and a task we would argue that if the TCE doesn’t exceed a certain amount but the person didn’t engage in meaningful efforts regardless of what the latency is really is in measured with computer-based tests you know when the items started you know when it ended but you don’t know where they looked or how hard that they were thinking or were they just staring at the screen well eye tracking allows us to know were they engaging in meaningful effort were they looking at important parts of the screen or were they staring into blank space and the total cognitive effort so it’s another way of being able to tell whether or not a test takers engaged.
It would be awesome if we got to a point where this hardware and software was integrated in the test but you get to a point where we could go you know there’s no point in wasting any more of your time you are not actively giving us any mental effort that’s gonna let us know anything about what you know it can do we know that you have tapped out we know that the maximum you are is level X and that we can give you all these level II questions and you might guess some and we might learn something about the psychometric guessing parameters or the discrimination of that item we’re not gonna learn anything about you and so we could use these to greatly speed up diagnostic testing to get to the point where we’re not wasting time or we could go while this person there there’s so little cognitive effort we know these questions aren’t challenging to this person let’s jump to the next level and so when you think about adaptive testing that normally is based upon again IRT parameters and going maximum information function well maybe maximum effort could be more important than the maximum information for the whole population because then we can focus in on oh this item targets these skills and this required more effort for this person to get right and this other one that’s supposed to be equally difficult engaged almost no cognitive effort so they must have that one fairly won’t master because they’re showing evidence of mastery as opposed to novice behavior on those items my guess is that technology is far down the road twenty thirty years against of cost of the testing the adaptive testing that you can do that with because it has to be in all of these testing devices and we’d have to find a way that these computer-based tests don’t have other things that add to cognitive effort that are going to mess things up so is scrolling really different from reading on paper and pencil if you have to scroll and that means you have to have more in your working memory to of what’s going on how is that going to affect your cognitive effort and we haven’t done enough comparisons of paper and pencil to computer to be able to draw those conclusions yet so that’s going to take a bunch of studies to go psychometrically you could go well it does this or does this but then we can look at the vision patterns and the gaze patterns and the cognitive effort to draw those conclusions that evidence of cognitive processes because it’s timely to consume and usually has small n values if you’re doing think a lots you can’t do a think-aloud on a thousand students!
It’s just impractical. You can’t do it thousand students in an eye-tracking with current technology so the N values are very small which makes them suspect to people who need to have statistical significance and not just looking at these kind of trends where we’re looking at it one of the great things that was said to me by one of the psychometricians here was when we showed the difference between high scores and low scores showed the heatmap at this item and she’s like I don’t need a t-test to know that those low scores are significant doing something significantly different from the people on the left whether it’s statistically significant or not the heat map tells me that there’s a difference and that’s one of the problems is that we end up sometimes a slave to the psychometric properties of statistical significance and so you lose things in small sample sizes If you’ve listened to this far maybe you want to know about the formula for TCE total cognitive effort that measures thinking with the data about pupillometry.
I put this TCE equation on the webpage. Jay is going to explain it a little bit further here. I want you to kind of walk us through this formula.
Okay so the formula so total cognitive effort of person i. The subscripts are all about the items and the people.
So this is TCE for one item.
Total cognitive effort TCE is our total cognitive effort for the item so we’ve got the summation so we’re saying we’re just gonna add up whatever to the right of the summation from J equals 1 would be the first measure measurement to N is the last measurement so from when the item starts till it ends so for us we define the item starting as when on the computer screen the entire item was on the screen and could be read into page loads the page finishes loading completely and we measure the end of the item as when the student when the test taker clicked on an answer and clicked next because we didn’t really care what was going on cognitively for them as it was trying to send the answer reload the next page wait we considered that to be dead spirit because from our point of view it really had nothing to do with what we were interested in here I see a is this index of cognitive activity that the software spits out based upon the Fourier transformation of changes and the people the velocity and the acceleration changes for J.
So again this would be the first measurement and the time is the time for that measurement so when you’re doing eye tracking you gets the cards which are rapid jumps of the eyes from one place to another you get fixations and fixations are when i moves very little over an extended period of time usually between 0.15 and 0.3 seconds doesn’t seem like an extended period of time but for eye movements when you think about the way that you actually use your eyes that’s a long time it blinks so every blink is measured and so it measures so based upon what your pupil diameter was before and after the blank it can actually then go okay in order to do that it had to accelerate at this rate and it’s velocities is this at this point and so you measure so the software will spit out the time of each one of those kinds of measurements and then you just multiply the ICA which is a number between 0 & 1 and multiplied by the time generally is speaking in milliseconds but you can pick any unit of time you wanted that was convenient milliseconds is convenient because almost all my measurements are really really brief in milliseconds so for each one of those measurements you multiply those two things together and then you just add up the first fixation and then there’s a saccade and then there’s another fixation in the middle and there’s a blink in a spreadsheet or you know using SAS or anything it’s relatively easy to just say okay multiply across add down and put in your timestamp system when you want to begin an end and it will pit that out for each item and to be perfectly honest the hardest part of doing all this eye tracking research with testing is the fact that all of this equipment and software was designed for web designers. Now at Amazon when you click on an item you go to a new URL on test-taking software the URL stays the same and when you click on next an applet actually loads there’s some sort of hidden program in the back URL does a change in the computer and the software thinks that you haven’t gone anywhere so you actually have to manually go in and process this is the beginning of the page and the end of the page and one of the things that you lose is the software is designed to automatically know where the scroll point is of the screen the moment that page loads because it’s part of the HTML data the problem is with the way that we do testing is because for most programs for test security since it’s the same HTML page and that first item where that scroll line is where the computer thinks the scroll line is for every item and so if you have items that have different scrolling later it becomes more difficult because you actually have to take a screenshot of the screen and then overlay the heat map on top of it whenever there’s no scrolling that’s not an issue because you screenshot the whole screen and everything lays on perfectly an x and y-coordinates but if there is scrolling then your heat map is suspect to where it is you know oh it’s from here down but I’m not certain where it is and so you want to be very careful about how much you judge from those heat maps other than if there’s a big gap around an area you know that they didn’t look in that area before or after they scroll down that’s one of the things that that comes out of it.
That’s a challenge.
It is a challenge. Scrolling is a challenge. We have some data that we have collected that we have not yet analyzed from an NSF-funded project that we’ve worked on where each page actually loads independently so when you click on next it actually has a new URL, I’m looking forward to doing that we have a very small sample size we went to actually went to Ann Arbor and tested a bunch of students who were going to be taking these tests in science class anyway and this is a portable equipment is I was able to take the equipment go into high schools set up you know biology teachers give the tests in third, fourth and sixth Bell and just sit there and have kids of like pure data and so it’ll be interesting to see how much easier it is to deal with a scrolling issue for that. Unfortunately, I don’t foresee any time in the near future where that’s going to become the case for high-stakes testing I mean there’s a difference in to classroom testing yeah for formative and summative testing.
If you could go back to being a teacher – how would you teach differently or what would you do in assessment design for your classes? You were teaching for 19 years.
Yes, so I taught every kind of science except biology at some point and I taught gifted to your high a lot of classes as well okay and in addition to that I taught for Kaplan test prep for 15 years as well so I so I’ve taught junior high up through adults one of the things that the graphic literacy thing taught me
was that I would spend a lot more time teaching kids to look at axes and keys and legends to graphics because that really was a huge jump out obvious difference between people with high graphical literacy and low graphic literacy one of the problems about teaching high school and above most of the time is that you assume that they’ve learned certain fundamental skills to get through to where you’re teaching so that’s one thing that would come out with me particularly when you think about how much data analysis in science you want to be doing graphical analysis so you know we used phase diagrams we would do real-time collect of data using probes for velocity acceleration and all these other sorts of things where graphical analysis was really important and paying attention to having you know what’s on the axes and what’s in the labels that’s something that I would spend much more time making certain that they all got it the other thing I would love to be able to give this to a class to see what was driving wrong answers.
So, for instance I wrote distractor driven multiple-choice so I was looking at misconceptions and I didn’t write very many multiple-choice questions because the students would do very poorly on them because they were distracted driven to see what they could do in chemistry and physics oftentimes would be okay here’s something that happened explain it here’s a reaction how much product you get in a stoichiometry if you only got this much what was your percent yield analyze what went wrong with this experiment and so there was a lot of constructive response sort of things. I would love to go back and go why were they so bad at multiple-choice? What was it about the distractors that was driving them beyond the fact that having read misconception research and conceptual change research that I knew that I was picking things that were going to be targeting things where students are likely to take the wrong answer.
That would be very interesting to see if you could do it the cognitive effort piece that we’ve been able to quantify I am one of those weird people that I had two psychology degrees in addition to my science degree so I gave a lot of thoughts of this cognitive psychology piece of teaching and of breaking things down one of the things that I told my students was that it’s really hard to be a good science teacher math teacher because usually it got into it because you were good at it and the reason why you were good at it was intuitively obvious. The problem with something that’s intuitively obvious to you is that it doesn’t require a lot of mental effort so my job was to take something that was intuitively obvious to me and break it down into smaller components that students could cascade together they could link together you know whether you want to talk about a spiraling curriculum so that they built the pieces I was very cognizant of there’s only so much mental effort that they can give at any given point so they’re not gonna go to look at this whole problem and do it all at once like I can look at and go that’s what’s gonna happen my job is to break it down into these smaller manageable pieces it would be interesting to to see from the beginning of the unit to the end of the unit how the cognitive effort was once they kind of mastered oh I know that these steps are gonna work to see how they would learn to chunk it so it would take less mental effort to do it but comes back to the problem of eye tracking in general it becomes expensive and hard to generate lots and lots of people and then a lot of post-processing time to do that it’d be awesome if we could do that and one of the things that they came up and I mentioned it in the TED talk the Tech Talk that we did was that Apple bought sensorimotor Inc SMI who was there are three leading companies in doing this eye tracking research equipment for four professors and researchers and people at ACT were doing that research and Apple surprisingly bought SMI and then stopped doing technical support and innovations on their product that they were releasing their press release was very cryptic as to whether or not they were doing it so that they could do market research or so they could do
accessibility issues because one of the things that comes up is that for people with all sorts of mobility issues whether it’s Parkinson’s or ALS or think of anything where somebody has difficulty using a mouse or a touchpad that eye tracking would be a great way to do that because oftentimes that’s something that even those folks with those disabilities eye movement control can be learned and the blinks can be used as mouse clicks and things like that and there has been done some research on that so when they bought it. It was hard to tell whether or not they were doing this for accessibility or if they were doing it so that they could do a better job of selling things in the iTunes and control that and so one of the things that we’ve just that we’ve had to do is we’ve had to go to a different vendor and going back to cost we thought we had this great way of doing it and we had invested a bunch of time and money and training and doing it and now we’re using different vendor because our vendor no longer exists.
The same thing could happen with anything when you think about when you’re when you’re using a better do it imagine if we hadn’t bought Pacific Metrics and we were we were hiring them as a vendor and somebody else bought the mountain and said okay you’re not working with ACT anymore because their competitor it could happen so that’s kind of what happened to us so we’ve now switched to a different vendor for this these things and so we’re learning it’s similar it’s the same but different so learning new ways of doing in the software and once you know one you could kind of figure out the others but that speed of the learning curve of being able to do it very quickly is is gonna take some time.
Thank you Jay for the conversation and I also want to give credit to his collaborators Thomas Langenfeld, Rongchun Zhu, and Carrie Morris.
Have you heard about Feed Your Brain Friday? Feed Your Brain Friday is a regularly scheduled release of ACT and ACTNext research. Look for the hashtag #FYBF every Friday on our Twitter and LinkedIn pages and look for some of this eye tracking research there too.
Thank you for listening to episode 5 of the ACTNext Navigator podcast!