Assessing Assessing: September 2011

There is a tendency, in certain precincts in, and around, higher education, to fethishize rubrics. One gets the impression at conferences and from consultants that arranging something in rows and columns with a few numbers around the edges will call forth the spirit of rational measurement, science even, to descend upon the task at hand. That said, one can acknowledge the heuristic value of rubrics without succumbing to a belief in their magic. Indeed, the critical examination of almost any of the higher education rubrics in current circulation will quickly disenchant, but one need not abandon all hope: if assessment is "here to stay," as some say, it need not be the intellectual train wreck its regional and national champions sometimes seem inclined to produce.

Consider this single item from a rubric used to assess a general education goal in gender:

As is typical of rubric cell content, each of these is "multi-barrelled" -- that is, the description in each cell is asking more than one question at a time. It's not unlike a survey in which respondents are asked, "Are you conservative and in favor of ending welfare?" It's a methodological no-no, and, in general, it defeats the very idea of dis-aggregation (i.e., "what makes up an A?") that a rubric is meant to provide.

In addition, rubrics when they are presented like this are notoriously hard to read. That's not just an aesthetic issue -- failure to communicate effectively leads to misuse of the rubrik (measurement error) and reduces the likelihood of effective constructive critique.

Here is the same information presented in a manner that's more methodologically sound and more intellectually legible:

At the risk of getting ahead of ourselves, there IS a serious problem when these rank ordered categories are used as scores that can be added up and averaged, but we'll save that for another discussion. Too, there is the issue of operationalization -- what does "deep" mean, after all, and how do you distinguish it from not so deep? But this too is for another day.

Let's, for the sake of argument, assume that each of these judgments can be made reliably by competent judges. All told, 4 separate judgments are to be made and each has 3 values. If these knowledges and skills are, in fact, independent (if not, a whole different can of worms), then there are 3 x 3 x 3 x 3 = 81 combinations of ratings possible. Each of these 81 possible assessments is eventually mapped on to1 of 4 ratings. Four combinations are specified, but the other 77 possibilities are not:

Now let us make an (probably invalid) assumption: that each of THESE scores is worth 1, 2 or 3 "points" and then let's calculate the distance between each of the four scores. We use standard Euclidean distance – r=sqrt(x2 + y2) with the categories being: Mastery = 3 3 3 3, Practiced = 2 2 2 3, Introduced = 2 2 2 2, Benchmark = 1 1 1 1

So, how do these categories spread out along the dimension we are measuring here? Mastery, Introduced, and Benchmark are nicely spaced, 2 units apart (and M to B at 4 units). But then we try to fit P in. It's 1.7 units from Mastery and 2.2 from Benchmark, but it's also 1 unit from Introduced. To represent these distances we have to locate it off to the side.

This little exercise suggests that this line of the rubrik is measuring two dimensions.

This should provoke us into thinking about what dimensions of learning are being mixed together in this measurement operation.

It is conventional in this sort of exercise to try to characterize the dimensions in which the items are spread out. Looking back at how we defined the categories we speculate that one dimension might have to do with skill (analysis) and the other knowledge. But Mastery and Practiced were on the same level on analysis. What do we do?

It turns out that the orientation of a diagram like this is arbitrary -- all it is showing us is relative distance. And so we can rotate it like this to show how our assessment categories for this goal relate to one another.

Now you may ask what was the point of this exercise? First, if the point of assessment is to get teachers to think about teaching and learning, and to do so in a manner that applies the same sort of critical thinking skills that we think are important for students to acquire then a careful critique of our assessment methods is absolutely necessary.

Second, this little bit of quick and dirty analysis of a single rubric might actually help people design better rubrics AND to assess the quality of existing rubrics (there's lots more to worry about on these issues, but that's for another time). Maybe, for example, we might conceptualize "introduce" to include knowledge but not skill or vice versa? Maybe we'd think about whether the skill (analysis) is something that should cross GE categories and be expressed in common language. And so on.

Third, this is a first step toward showing why it makes very little sense to take the scores produced by using rubrics like this and then adding them up and averaging them out in order to assess learning. That will be the focus of a subsequent post.

Assessing Assessing

Sunday, September 25, 2011

Rubrics, Disenchantment, and Analysis I