ReliableSurveys.com
Home  |  FAQs  |  Why Do Our Surveys Look Different?  |  What's Wrong With Rating Scales? 


What's Wrong With Rating Scales?

"All I Ever Needed to Know, I Learned in Kindergarten"

It wasn't in Robert Fulghum's original essay, but another one of the things we all learn by the time we finish Kindergarten is the ability to measure.  We know
...one of the things we all learn by the time we finish Kindergarten is the ability to measure.
how to measure weight, height, length, distance, sound (loud or soft), brightness (light or dark), age, beauty, kindness and many other attributes of people and things.  And, what's more amazing, we learn to measure without adult tools like tape measures, scales (for weighing), decibel meters, appraisal forms or surveys!  How do we do it?  It's so simple we don't even have to be taught, which means it is an inherently natural cognitive process.

A Child Learns Naturally to Measure by Comparing.

If you ask a child which of two stones is heavier, the child will pick them up, one in each hand, and answer quickly, with no hesitation.  If you ask a child to tell you which of two classmates is taller than the other, they will take a quick look and give you the right answer.  If you ask the child to line up all of his or her classmates at the blackboard with the tallest at one end and the shortest at the other, the task will be completed in a minute or two.  All of these are measurements, all result in information about the relative intensity of an attribute or characteristic in the objects.

Why Start This Discussion with Children?

To make clear the point that subjective measurement is natural at all ages and not the special domain of measurement gurus or survey experts.  We are all prepared as children for subjective measurement.

But a distinction needs to be made here.  There is absolute measurement and comparative measurement.

  • Comparative measurement is comparing one thing with another (or one with many) to know which has more of some quality.  This is the method we learn early because most judgments in daily life are inherently comparative.
  • Absolute measurement is measuring without comparing to any other similar object.  It's like a carpenter finding the length of a 2x4 with a tape measure instead of comparing one 2x4 against another.  It is quicker than comparative measurement, and that is why psychometricians (a fancy word for psychologists who like numbers) invented the rating scale.  They want the rating scale to be the tape measure for human attributes.
If we are going to measure without comparing things, we obviously need something to measure with, because our language fails us.  How would you respond, for example, if you were asked, "How bright is this light?" or "How loud is this tone?", without comparing it to something?  That's the challenge we face with absolute measurement.

Lacking the Language For Absolute Measurement...

We have been carefully sold the idea of using rating scales to measure things absolutely.  A rating scale is a series of numbers or categories arranged so that one end of the scale represents a low level of the quality in an object, and the other end of the scale represents a high level of the quality.  More than 50 years of using rating scales have led to the following conclusions (not universally accepted, but irrefutable):

  • Rating Scales Are Missing Something - The attractiveness of the rating scale is its similarity to the tape measure, with equally spaced increments of the quality being measured.  But rating scales are missing the one thing that makes the results meaningful.  You can see this missing "something" in the following example:
  • How tall is Eddie?

    Very Short |___|___|_X_|___|___| Very Tall
    What does that "X" in the middle box tell you?  What do you know about Eddie's height?  (Don't tell me that a rating scale is a silly way to measure height — the psychometricians say rating scales are fine for assessing leadership, they should work as well for measuring height.)  Would you know enough about Eddie's height to pick him out in a police lineup?  Would Eddie sit comfortably in a coach-class seat in an airplane?  Would anyone want to have Eddie play on their basketball team?  What if you knew that 99 others were measured at the same time, in the same room, with the same scale — in other words, would a large sample size tell you something more about Eddie's height?  Do you really know anything at all about how tall Eddie is?  Could Eddie be the dog on the "Frasier" TV show?  Could Eddie be a 6-year old in kindergarten?  Or could he be Eddie Johnson, the 17-year veteran pro basketball player?  Eddie could be any of these, depending on what the person who placed the "X" was looking at, which we can't know.  Without knowing what other things a rater is looking at (Jack Russell Terriers, 6-year olds, or NBA players), rating scale results can mean whatever anyone wants them to mean.

    The inventors of the rating scale left something important out of their invention.  What's missing is like that little hook at the end of a regular tape measure that slips over the edge of the object being measured, so that the measurement starts where the object starts.  The rating scale is missing a point of reference.  Look back again at the rating scale supposedly measuring Eddie's height.  Do you find yourself asking, "How tall compared to what?"  A reference point is not one of those "nice to have, but not essential" qualities, nor is it a problem solved "with a large enough sample size", as the psychometricians tell us.  We simply have to know something about what or who Eddie was being compared to in order to make sense of his "score".  It sounds as if the only way to make sense of rating scale data is to make it more comparative.

  • The Mathematical Treatment of Rating Scale Data is Dubious - By far, the most common way to use rating scale data is to convert check marks to numbers that correspond to the sequence of the scale categories, then to add, subtract, multiply and divide those numbers as if they were quantities of something.  These mathematical operations require the assumption that the intervals between the scale categories are equal.  The most widely used rating scale is probably the 0 to 4 scale used to evaluate academic performance in schools in the U.S.  The mathematics we employ to calculate a Grade Point Average assume that these numbers are quantities — that the difference between an "F" (0 points) and a "D" (1 point) is the same as the difference between a "C" (2 points) and a "B" (3 points).  Since we accept that to be the case, we must also conclude that a "B" is three times a "D", and that an "A" is two times a "C" and 33% greater than a "B".  Not many like that last statement, but the arithmetic requires it.

  • The accuracy of data obtainable from absolute rating scales will be inversely related to the importance of the issue.
    Rating Scales Work Worst When You Need Them the Most
    - The accuracy of data obtainable from absolute rating scales will be inversely related to the importance of the issue.  Rating scales are patently obvious in their consequences.  You can see the "good" end and the "bad" end of any scale.  The more a person is emotionally invested in the outcome, the more likely those emotions will influence their judgment.  Relatively benign assessments, such as climate surveys and market research are not as likely to arouse as strong emotions as performance evaluation or downsizing assessment.

  • Rating Scales are Unreliable - Ask anyone whether they would like their boss to write their annual performance appraisal on a Monday morning or Wednesday afternoon.  They will answer something like "Wednesday afternoon — he's in a better mood then."  Most performance appraisal systems are based, at least partially, on rating scale assessments of employee performance.  If the mood of the boss can affect the resulting appraisal, the process is not reliable.  There are ways of measuring the reliability of rating scale results and perhaps improving reliability, but they are rarely used.

  • ...nobody uses rating scales to make decisions in everyday life.
  • Rating Scales Are Used Everywhere and Nowhere - Almost every survey or questionnaire uses rating scales, almost all formal performance appraisals use rating scales, virtually all customer service assessments use rating scales, but nobody uses rating scales to make decisions in everyday life.

    • If you were going to purchase a car, you could use rating scales to evaluate the Honda on a 5-point scale across multiple criteria, followed by the Toyota, the Dodge, and the Chevrolet, then sit down with a calculator, add up the scores, calculate the overall score for each car, and go buy the one with the best score.

    • At a restaurant you could look over the menu, assigning scores from 1 to 7 to each entree, then look back at your "evaluations" and select the meal with the highest score.

    • Or you could select a husband or wife by judging the available candidates with the use of a 1 to 10 scale on the important marital criteria, calculating the overall score of each, then go after the one with the highest score.

    We could, but we don't.  Nor do businesses when they are making critical decisions, such as mergers and acquisitions.  Nor do doctors diagnosing their patients.  Nor do policemen facing difficult choices.  Nor do teachers selecting the best way to teach an important subject.
Ironically, what we have in the rating scale is a process used only in artificial, contrived or controlled circumstances, unlike the real world we live in.  Beauty pageants, Olympic Figure Skating (my, doesn't it work well there!), customer service surveys, formal performance appraisals, academic grading, employee satisfaction surveys.  When one of these events come along, we dutifully press the "pause" button on reality, suspending our rational thought for a while, until the event is over, then return to real life.

On the other hand, we regularly face choices that require an examination of the alternatives, evaluation of the relevant variables, an assessment of the outcomes, costs and risks — it's all subjective measurement.  We do what is required to make the best decisions we can based on the data at hand.  We don't skip measurement, we do it the normal way, which doesn't involve rating scales.


How Do We Do It?

The same way we have always done it from the earliest age — by comparative judgment, not absolute judgment.  A prominent measurement theorist (Nunnally, 1967), after 20 years of studying the way people make judgments, concluded,
"Whereas people are notoriously inaccurate when judging [absolutely], they are notoriously accurate in making comparative judgments." (emphasis added)
In contrast to the difficulties we face when trying to measure absolutely, the comparative measurement tools

  • Are Far More Reliable - Reliability means you would get the same results if you repeated the assessment a number of times.  To go back to the example cited earlier, a 6-year old child can line up the other kids in the class with the tallest at one end, and the shortest at the other.  Then everyone could sit down, and the child could do it again, and again, with very similar results from one time to the next.  That's reliability.
  • Have Consistency Built In - Comparative methods allow you to track inconsistencies as a way to check the quality of the results.  Inconsistent judgments may mean that evaluators don't have enough data, are not willing to express their real opinions, or that the instructions were not clear, or any of a number of other reasons — all of which degrade the quality of the results.  With rating scale data, you don't have a clue about the quality of the results you are getting.
  • Are Easier - To complete, to administer, to interpret, to report the results, to see action steps in the results.
It's Our Contention that...

  • if the natural human measurement process is comparative, not absolute, and
  • if the comparative methods we have developed are far more reliable and consistent than the absolute, and
  • we need comparative data, not absolute data, for the vast majority of our decisions,
then we should be using comparative methods to gather the data we need for these decisions.

Comparative methods are the kind of assessment tools you can find here.  We deploy them online, or in paper and pencil form.  Contact us for more information.

Nunnally, J.C., Psychometric Theory. 1967, McGraw Hill, Inc., New York.



 posted: 19:06 - 06.08.08 [an error occurred while processing this directive]