What is the Scaled Comparison
The Scaled Comparison is a computerized data gathering and analysis process. The term derives from the fact that data are gathered by making comparisons between objects using a scale. It is also similar to, and builds upon the "Paired Comparison" method, proposed by Thurstone (1927) for the measurement of social values.
Is it really something new? Haven't we had paired comparison and scaling methods for years?
The Scaled Comparison represents a new advance in methodology over either traditional scaling techniques or the paired comparison. It delivers more information, more reliably than either method.
|Ranking and paired comparison methods give you
ordinal data, which show only the order of the things
you are evaluating. But the Scaled Comparison score
is an interval measurement, and reports relative
distance between objects.
Why is the Scaled Comparison more reliable?
For one, the Scaled Comparison is a comparative rather than non-comparative method. Non-comparative appraisal methods, like rating scales or scoring techniques, compare an object to some vague concept in the mind of the person doing the evaluating. Comparative methods, like ranking, the paired comparison or the Scaled Comparison, compare the objects directly. They are always more reliable because they require raters to be consistent. With non-comparative methods, raters can, and often do, rate everyone outstanding, every issue important, every action top priority. That's fine, if they really are all equally high. But when an important comparative decision needs to be made, like a promotion, or a choice between two "top priorities", the inconsistency of these kinds of systems becomes evident.
For an example of a comparative process and other methods trying to measure the same thing, see Different Methods, Different Results.
Another reason the Scaled Comparison is more reliable is because it requires the use of multiple evaluators or judges, rather than a single person. You can make any evaluation more reliable by using multiple evaluators, because idiosyncratic rater preferences and biases are controlled. With the Scaled Comparison, these idiosyncrasies are not only controlled, they are identified and reported.
But aren't the comparative methods, especially the pairwise kind, cumbersome or even impossible with larger numbers of people or objects being evaluated?
In most cases, yes. Take the paired comparison, for example. Almost as soon as it was developed, it was clear that it's biggest limitation was that it was too cumbersome for large numbers. A manager who wished to evaluate 20 subordinates using the paired comparison would be faced with almost 200 decisions. The Scaled Comparison, on the other hand, would allow him to rate many more with fewer decisions.
How does the Scaled Comparison do it?
Let's look at the manager's situation again. Using the simple formula N(N-1)/2, we learn that there are 190 ways to compare those 20 people. That would be too much work for a single rater on even a single overall criterion of performance. But if you divide the task up among 5 raters, each only has to make thirty-eight decisions. This is oversimplified for the purpose of illustration, but the principle is there.
But then no rater is making all the possible comparisons. How can you get reliable results from incomplete data?
Let me answer that with an analogy. Suppose you had the job of lining those 20 people up along the wall with the tallest at one end down to the shortest at the other. It's unlikely you would make all 190 comparisons. Each person would have to be considered at least once, but you probably wouldn't have to make more than 30 decisions to complete the task. Now that you've lined them all up like that, let's consider the reliability of the process you have just used to get them there. (Reliability simply means that your process would give you the same results every time you used it.) You could have everybody sit down and repeat the task 10 times and get the same results every single time. And in no case would you make anywhere close to 190 decisions.
Sure but that's a different situation. You can look around the room and pick out the tallest right off the bat. The Scaled Comparison doesn't do that, does it?
No, but let's modify our task a little to make it more like the Scaled Comparison. Instead of a room full of the 20 people, you are alone in a room with a door at each end. In walk 2 people, you make a decision about their height...much taller, slightly taller, or exactly equal...record your decision, and they walk out the other door. After ten decisions, you've seen all 20 people. And, since you're just getting warmed up, let's have each of the 20 people go back down the hall to the first door, pair up with someone else, and walk through a second time, and then a third. Now you've made 30 decisions. You could go all the way up to 190 decisions, but you wouldn't need to. After 30 judgments, you have most of what you need to complete the ranking.
But what if by chance an average person was paired 3 times with only short people? Wouldn't your scoring make him appear as the tall as the tallest one in the group?
It would if you were using the paired comparison and stopped before making all the comparisons. But the Scaled Comparison is different. To start with, remember that this same exercise is being conducted simultaneously by 4 to 6 more evaluators...the other multiple judges. The way the pairings are generated precludes the other raters from seeing that same average person with the same three short people. While each rater sees only a portion of all the possible combinations, they each see a different portion, until the possible combinations are evaluated. But the Scaled Comparison doesn't stop there. I think I hear you saying that being taller than a tall person ought to count more toward height than being taller than a short person...
That's right. It's like playing in the minors or the majors. It's easy to have a high batting average if you're always hitting against poor pitchers.
Agreed. One of the most innovative aspects of the Scaled Comparison is that it qualifies the result of a comparison by the relative strength of each of the two objects being compared. To put it a little simpler, if you were compared to a strong performer and came out better, the result of that comparison would be worth more to your score than if you came out better than a medium or lower performer. In the same way, being judged lower than a low performer has a greater negative effect on your score than being rated lower than a high performer.
Because the Scaled Comparison is sensitive to these distinctions, information of the same quality may be obtained from fewer data. That sounds like "we can do the same for less", but in reality, the Scaled Comparison provides better information than the other methods.
Why do you say better information?
Ranking and paired comparison methods give you ordinal data, which show only the order, from top to bottom of the things you are evaluating. You have no way of knowing the degree of difference between objects on the list. But the Scaled Comparison score is an interval measurement, and reports not only rank position, but relative distance between objects in the ranking.
But the paired comparison and rating scales give you interval data, don't they?
If they do, the intervals have only been inferred from other computations. The paired comparison asks raters to judge ordinal position only which one is better with no allowance for an "equals" judgment. The scoring procedure tallies the number of times a subject was judged better than others. A win/lose tally is not a measure of interval distance. Interval data can't be generated from a summary of ordinal decisions. You can say it's interval, but saying it doesn't make it so.
The rating scale method asks raters to judge the objects on a linear scale with 5, 7 or 10 points, each scale point being a different gradation of the criterion being used for evaluation. Those scale scores are then added up, multiplied or divided as if they were quantities of something (quantities of the criterion?). In principle, it is just the way we measure height apply a scale to each individual, read the corresponding value and record the number, which can then be manipulated arithmetically. But in practice, rating scales have no reference point no floor upon which the bottom of the person and the bottom of the scale rests. Without that, it's like holding up a ruler beside someone's head and "measuring" their height. You really don't even have ordinal information with that kind of measurement.
The Scaled Comparison literally asks the rater to indicate how far apart the two objects are on the criterion. It asks them directly for the interval. The final report is a summary of the judgments, not an inference from win/lose tallies or scale scores of dubious value. The Scaled Comparison returns interval results because it asks for an interval judgment.
But the Scaled Comparison uses a scale. Aren't the checkmarks in those boxes used as numerical values in the calculations?
Yes, the checkmarks are converted to numeric values, but not in the same way they are with rating scales. Rating scales are treated as if the intervals between the scale points were equal. The most widely used rating scale is probably the 0 to 4 scale used to evaluate academic performance in schools in the U.S. The mathematics we employ to calculate a Grade Point Average assume that the difference between an "F" (0 points) and a "D" (1 point) is the same as the difference between a "C" (2 points) and a "B" (3 points). If that is true, we must also conclude that a "B" is three times a "D", or that an "A" is two times a "C". Not many academics would agree with that last statement, but their arithmetic requires it. If they disagree too much with that last statement, then the whole house of cards falls down so nobody talks much about it. (For more on the weaknesses of rating scales, see What's Wrong With Rating Scales?).
The values used to score a Scaled Comparison are not based on any such assumptions. They represent empirical measurements obtained before the computational procedures were established. Before developing the Scaled Comparison, it first had to be determined how people view the difference between "equal" and "slightly more", and between "slightly more" and "much more". Using visual, auditory and tactile stimuli of measurable intensity, subjects were asked to vary continuously the intensity of one of the stimuli until it was "much more" or "slightly more" than the other. What was learned was that the difference between "equal" and "slightly more" was not the same as the difference between "slightly" and "much more". In fact, the differences were persistent across stimulus modalities and differed significantly from equality. Those perceived values became the values used by the Scaled Comparison.
So, does that mean you use your special values for each box, add them up and average them for each object being evaluated?
No, because those values can only be used as the starting values for the computations. We must never forget that scales (including the Scaled Comparison's scale) have no reference points. A tape measure has a zero point, so any "score" we obtain from it is immediately a meaningful value that can be manipulated arithmetically. The scales we use to measure opinions and judgments do not work that way; they do not have a zero point, nor do they produce meaningful values by themselves.
The only reference points available to the Scaled Comparison are relative ones. Using our earlier example of the room with 20 people we want to rank according to height, it doesn't matter how tall they are absolutely. They could be pro basketball players or kindergarten children. Some would be taller and some shorter; one would be the tallest and one the shortest. We are never able to "measure" their height, only the relative differences between them.
The Scaled Comparison uses the initial values to arrive at a rough arrangement or approximate ranking. Then it returns to the raw data and tests each comparison to see if it is consistent or inconsistent with that first arrangement. Then it revises its initial arrangement by what it finds. It continues that process until it reaches a minimum number of inconsistent judgments. Or, said another way, until it finds the arrangement that the most decisions are consistent with. That's why we say it is a method that "searches for agreement".
Does the scaled comparison use new or secret mathematical procedures?
Not really. Its development was more hard work than a creative discovery. It's not like e=mc². It is complex, to be sure, but the procedures involved are not unlike the ones used in many other statistical computations. And it is not unique in the way it returns to the raw data until it finds the right "fit" in the results. In this respect, it belongs to a family of statistical methods that take advantage of the computer's ability to process and reprocess large amounts of data to arrive at a best solution. Most behavioral scientists have a difficult time with any mathematical process they can't take out of the computer and do by hand to understand it. The Scaled Comparison simply would not be possible without the computer.
So you obviously rely on computer programs for your analysis. Are your programs available to your clients?
Sometimes a client organization will wish to perform some of the steps in the process themselves, such as data entry, or printing reports, in order to keep costs down. We provide them our programs at no cost.
What about the programs that perform the data analysis?
We have licensed and will license all of our software to any organization that will sign a non-disclosure agreement and anticipates a level of usage that will make the annual license fee and usage costs more cost-effective than to pay directly for the service.
Thurstone, L.L., The method of paired comparisons for social values. Journal of Abnormal and Social Psychology, 1927, 21, 384-400.
How do I get more information?