Assessing Assessing: 2011

Sunday, September 25, 2011

Rubrics, Disenchantment, and Analysis I

There is a tendency, in certain precincts in, and around, higher education, to fethishize rubrics. One gets the impression at conferences and from consultants that arranging something in rows and columns with a few numbers around the edges will call forth the spirit of rational measurement, science even, to descend upon the task at hand. That said, one can acknowledge the heuristic value of rubrics without succumbing to a belief in their magic. Indeed, the critical examination of almost any of the higher education rubrics in current circulation will quickly disenchant, but one need not abandon all hope: if assessment is "here to stay," as some say, it need not be the intellectual train wreck its regional and national champions sometimes seem inclined to produce.

Consider this single item from a rubric used to assess a general education goal in gender:

As is typical of rubric cell content, each of these is "multi-barrelled" -- that is, the description in each cell is asking more than one question at a time. It's not unlike a survey in which respondents are asked, "Are you conservative and in favor of ending welfare?" It's a methodological no-no, and, in general, it defeats the very idea of dis-aggregation (i.e., "what makes up an A?") that a rubric is meant to provide.

In addition, rubrics when they are presented like this are notoriously hard to read. That's not just an aesthetic issue -- failure to communicate effectively leads to misuse of the rubrik (measurement error) and reduces the likelihood of effective constructive critique.

Here is the same information presented in a manner that's more methodologically sound and more intellectually legible:

At the risk of getting ahead of ourselves, there IS a serious problem when these rank ordered categories are used as scores that can be added up and averaged, but we'll save that for another discussion. Too, there is the issue of operationalization -- what does "deep" mean, after all, and how do you distinguish it from not so deep? But this too is for another day.

Let's, for the sake of argument, assume that each of these judgments can be made reliably by competent judges. All told, 4 separate judgments are to be made and each has 3 values. If these knowledges and skills are, in fact, independent (if not, a whole different can of worms), then there are 3 x 3 x 3 x 3 = 81 combinations of ratings possible. Each of these 81 possible assessments is eventually mapped on to1 of 4 ratings. Four combinations are specified, but the other 77 possibilities are not:

Now let us make an (probably invalid) assumption: that each of THESE scores is worth 1, 2 or 3 "points" and then let's calculate the distance between each of the four scores. We use standard Euclidean distance – r=sqrt(x2 + y2) with the categories being: Mastery = 3 3 3 3, Practiced = 2 2 2 3, Introduced = 2 2 2 2, Benchmark = 1 1 1 1

So, how do these categories spread out along the dimension we are measuring here? Mastery, Introduced, and Benchmark are nicely spaced, 2 units apart (and M to B at 4 units). But then we try to fit P in. It's 1.7 units from Mastery and 2.2 from Benchmark, but it's also 1 unit from Introduced. To represent these distances we have to locate it off to the side.

This little exercise suggests that this line of the rubrik is measuring two dimensions.

This should provoke us into thinking about what dimensions of learning are being mixed together in this measurement operation.

It is conventional in this sort of exercise to try to characterize the dimensions in which the items are spread out. Looking back at how we defined the categories we speculate that one dimension might have to do with skill (analysis) and the other knowledge. But Mastery and Practiced were on the same level on analysis. What do we do?

It turns out that the orientation of a diagram like this is arbitrary -- all it is showing us is relative distance. And so we can rotate it like this to show how our assessment categories for this goal relate to one another.

Now you may ask what was the point of this exercise? First, if the point of assessment is to get teachers to think about teaching and learning, and to do so in a manner that applies the same sort of critical thinking skills that we think are important for students to acquire then a careful critique of our assessment methods is absolutely necessary.

Second, this little bit of quick and dirty analysis of a single rubric might actually help people design better rubrics AND to assess the quality of existing rubrics (there's lots more to worry about on these issues, but that's for another time). Maybe, for example, we might conceptualize "introduce" to include knowledge but not skill or vice versa? Maybe we'd think about whether the skill (analysis) is something that should cross GE categories and be expressed in common language. And so on.

Third, this is a first step toward showing why it makes very little sense to take the scores produced by using rubrics like this and then adding them up and averaging them out in order to assess learning. That will be the focus of a subsequent post.

Sunday, August 14, 2011

What Will "Assessment 2.0" Look Like? A Proposal

The most serious flaw in assessment as now practiced is the premise that it is something that teachers are not interested in, do not want to do, have not been doing, etc. A word that comes up a lot in connection with assessment is "accountability," but most folks who use the word don't take the time to be explicit about just who is supposed to be accountable to whom for what. When someone does get beyond just parroting the word, the most common interpretation seems to be "we need to hold teachers accountable."

We have some news for those who have discovered assessment. Teachers -- lecturers, instructors, professors -- have long been interested in what works and what doesn't in the classroom. Those who would appoint themselves guardians of learning have a nasty habit of trotting out stereotypes of the worst professor ever and, in a classic example of question begging, concluding that such figures dominate the academy and represent a threat to the future of higher education.

But rather than argue about that, here's a proposal for what the next stage in assessment might look like.

Given that most professors and most departments are actually interested in student learning and in how to maximize it -- this is, after all, the vocation these folks have chosen -- the resources that have been pumped into assessment projects should be put at the service of the faculty. Throughout Assessment 1.0 the dominant pattern is for an office of assessment to be in the driver's seat, more or less dictating to faculty (generally relaying what had been dictated to them by accreditation agencies) how and when assessment would happen. Many faculty found the methods wanting and the tasks tedious and pointless, but most went along -- at some institutions more willingly and at some less. The interaction between faculty and assessment offices generally came down to the latter making work for the former without the former seeing much in the way of benefits.

That's unfortunate because there are lots of potential benefits for us as instructors. But to realize them, we need to turn the tables. The basic premise of Assessment 2.0 should be (1) that it be faculty driven and (2) that assessment offices work for the faculty, rather than the other way round. Assessment offices should think of themselves as a support service for the academic program rather than a support service for a regulatory body that oversees the academic program from the outside. The main job of assessment offices should be to make a part of the job that faculty do, as professionals practicing their craft, easier. A part of what professionals do is self monitor and mutually monitor outcomes. As faculty, we need to think about what information will help us to make micro-, meso-, and macro-adjustments in our practice that will improve the outcomes we are collectively trying to achieve.

And the services of our assessment offices should be available to us to obtain it. We need to put the focus back on this side of the operation and shift away from the idea that the primary motivation behind assessment is to prove something to outsiders. Even the rhetoric from the accreditation agencies, if you slow the tape down and listen, resonates with this: they demand evidence that assessment is happening, that program adjustments happen in response to it, and so on. Where they are wrong is in their ignorant insistence that such things were not already happening.

The assessment industry did not invent assessment -- they simply codified it and figured out how to make a living off of doing it instead of being involved directly in educating.

Thursday, August 11, 2011

Too Bad Higher Education "Experts" and Vendors aren't Graded

I was inspired by a TeachSoc post from Kathe Lowney today to have a look at two articles in the Chronicle of Higher Education on computer essay grading.

The articles are "Professors Cede Grading Power to Outsiders—Even Computers" and "Can Software Make the Grade?"

My Review: A typical Chronicle hack job to my mind. Articles like this remind me of National Enquirer. Author makes little attempt to critically assess comments from his sources and gives little weight to contrary information (failing to infer, for example, anything from reported fact that in six years of marketing, almost no one has bought into the computer grading product mentioned). He jumps on grade inflation bandwagon instead of offering an analytic take on it. In typical COHE fashion he sets up false dichotomies and debates between advocates and defenders as if there is a big divide down the middle of higher education. In effect, articles like this are just product placement -- hopefully without kickbacks -- and "if someone says it then it's a usable quote" journalism. As with many COHE articles, it reflects journalism that's more in touch with the higher education industry than with higher education. It's mediocre work such as this that makes me let my subscription lapse every year or so. It's interesting how COHE seems to have no qualms at all about trashing educators and educational institutions but only ever so rarely do they seem to take an even gentle critical look at education vendors.

On the accompanying "compare yourself to the computer" article : I think I'd fire a TA who graded like that -- the words "capitalism" and "rationality" showing up constitute "concepts related to him" and an answer on Marx where "expelled for advocating revolution" = "significance for social science"? I scored them 4 and 2 and that was generous. I'd be mighty disappointed if I were the makers of that software and this is how my product placement in COHE turned out -- would anyone buy it based on this portrayal?!

Tuesday, March 22, 2011

The Rubrikization of Higher Education

The rubricization of education has always rubbed me the wrong way but I’ve never been able to put my finger on concrete flaws beyond the obvious. This past January I attended the AAC&U conference in San Francisco. A few more problems became clear.

There are three obvious methodological/measurement problems that have long stood out:

1. Almost every rubric I have ever seen has exhibited gads of multi-dimensionality in the different skills/items/categories/rows. Another way to say this is that the rows typically posed double or multi-barreled questions to the evaluator. Or, even if the construct named in the row was simple, the description of the different scale levels would be multi-dimensional. Example:

Category	Advanced (4)	Competent (3)	Developing (2)	Underdeveloped (1)
Structure	Sections fit together in logical sequence; claims, evidence, analysis, conclusions distinguished; logic of argument telescoped and reviewed

One argument that this is not a problem is that all the things listed here typically go together and that they are all indicators of the same underlying skill. Maybe. But it seems to be a stretch that all these skills nicely fall into a simple four level linear scale.

2. The second problem here is just that four point scale. What evidence is there to support the idea that “Advanced” level structure is two times as much structure (or as much skill) as “Developing”? This does not matter much when we are simply looking at these four levels, but the first thing that that folks with just a little quantitative skill do is come up with average ratings for a group of students on a skill rating like this.

Let us be clear: computing the average of a scale that has not been shown to have the arithmetic properties of what we call an interval scale PRODUCES MEANINGLESS RESULTS.

3. The third problem with rubriks like this is that the items (rows) are not necessarily exhaustive or mutually exclusive. In other words, they do not always include all the components of learning that might be (or should be) happening and the individual items often tap into the same underlying skill. The former is a substantive problem to be solved by better conversations about the goals of education. The latter, though, lead to bad data. Suppose three items X, Y, and Z are listed in a rubric and that the elaborate operationalizations of the different levels of these involve underlying skills a, b, c, d, and e.

Category	Advanced (4)	Competent (3)	Developing (2)	Underdeveloped (1)
X	Blah blah blah {a} blah blah blah {c}
Y	Blah blah blah {b} blah blah blah {c}
Z	Blah blah blah {d} blah blah blah {a} blah blah blah {e} blah blah blah {c}

Where we’ve put in curly brackets the underlying skill that the description “blah blah blah” refers to. In this rubrik, skills a and b get counted twice, skill c three times. When data is aggregated, success on a, b, or c will easily mask lack of progress on d or e.

4. But here is the most serious problem of rubricization. It completely drives out of the teaching and learning process any response to individual variations in understanding. The role of the teacher as offering constructive criticism about the wide range of variability in learning is driven out in favor of a set of categories.

One great irony in this is that so many of the champions of this approach to educational reform are the very folks who preach about variability of learning styles.

Another is the high level of concern about students who “fall between the cracks.” Here we are developing a system with explicitly designed cracks between which they can fall.

Yet another is that a mantra of the rubrik crowd is “evidence based” and “data driven” decisions. And yet the very devices that lie at the heart of the enterprise are custom-built to degrade information and result in misleading data.

The fundamental absence of critical thinking in the rubrik/assessment literature – and total lack of interest in critical discourse about these techniques – is the final irony.

One can conclude that what we have here is a bunch of middle-brow thinkers designing a system that will maximize the production of people like themselves and guarantee their own employment in higher education industry. If only there were some evidence that this is what the world will need in the 21st century.