After running an experiment, we are
typically left with a large number of scores.
Knowing each of those individual scores, however, is not normally very
informative. What we’re really
interested in is the characteristics of the distribution of scores.
Later, we’ll learn how to characterize
the distribution with a few calculated values, but it’s often informative to
examine the entire distribution of scores as a first step.
So
how do we determine the distribution of the data? Consider the following simple example:
A researcher asks 20 students how many hours of sleep they had the night before, and receives the following 20 responses:
5 7 7 8 6
9 6 8 7 7
8 6 9 10 5
7 8 6 9 7
We can characterize these data by
constructing a frequency distribution.
·
Frequency distribution: shows the number of times each score occurs in a
set of data.
We
determine the frequency distribution by:
1. Determining the unique
scores present in the data (e.g., 5, 6, 7, 8, 9, 10 in our example)
2. Counting how often each
score appears in the data (its frequency, f)
a.
e.g.
|
Score |
Frequency (f) |
|
10 |
1 |
|
9 |
3 |
|
8 |
4 |
|
7 |
6 |
|
6 |
4 |
|
5 |
2 |
Looking at this table, we can easily see
that there were lots of students of who slept between 6 and 8 hours, but only a
few who slept less than 6 hours or more than 8 hours.
It’s even easier to
visualize the data by displaying the frequency distribution as a graph. In fact, graphs are a useful way of
presenting data in general.
·
Graphs
are useful for making complex data sets more understandable.
·
Graphs
make it easier to visualize relationships between variables.
·
Graphs
can be useful for identifying potential relationships and guided further
analyses.
1.
Bar
graph: used primarily with ordinal or nominal data.
2.
Histogram:
used primarily with interval or ratio data.
3.
Polygon:
also used for interval or ratio data; similar to histogram, but with points,
rather than bars, plotted.
This method works fine when there are only a few
unique scores, as in our example of number of hours slept. When there are many unique scores, though,
this method makes the data difficult to understand, even after constructing the
frequency distribution.
In that case, we group similar scores
together, and construct a frequency distribution on the grouped scores.
To
construct a frequency distribution for the grouped scores:
1. Determine the range of scores
·
Range
= Largest - smallest score
·
Range
= 155-80 = 75
2. Determine the number of intervals you need (typically 10-20 intervals for 100 or more scores).
·
We’ll
use 10 intervals.
3. Determine the required interval width.


4. Round i to the same precision as the raw scores in the data set (round to 8 in this example).
5. Construct intervals of width
i, starting with a lower bound that is lower than the smallest score, and is a
multiple of i (in this example, the lower bound is 80).
6. Tally the scores into the
appropriate interval.
7. Sum the tallies to determine the frequency for each interval.
|
Interval |
Frequency |
|
152-159 |
3 |
|
144-151 |
9 |
|
136-143 |
13 |
|
128-135 |
15 |
|
120-127 |
20 |
|
112-119 |
21 |
|
104-111 |
17 |
|
96-103 |
13 |
|
88-95 |
7 |
|
80-87 |
2 |
Again,
it’s much easier to visualize the data using the frequency distribution than it
was given just the raw data. And,
again, it’s even easier to visualize using a graph.
Graphing the data can help identify trends for further analysis. For example, suppose the researcher had another group of 120 rats, who had been raised in an enriched environment, and she wanted to find out how they compare to the first group of rats. Call the first group of rats “Group A” and the second group “Group B.”
After running Group B through the maze,
the researcher has 120 more scores. How
do the two groups compare? Does Group B
tend to be faster than Group A?
It’s hard to spot that sort of trend
based just on the raw scores, but comparing graphs makes the trend more
obvious.
Sometimes it’s useful to
know what percentage of scores lie above and below a particular score. For example, after you’ve taken the SAT, you
want to know what your raw score was, but also what percentage of people scored
higher or lower than you did.
To do this, we construct a cumulative
frequency distribution or a cumulative percentage distribution.
For each interval, the cumulative
frequency is equal to the sum of the frequency in that interval, and all
intervals that are below it. The
cumulative percentage is equal to the cumulative frequency divided by the total
number of scores.
|
Interval |
Frequency |
Cumulative Frequency |
Cumulative Percentage |
|
152-159 |
3 |
120 |
100.0 |
|
144-151 |
9 |
117 |
97.5 |
|
136-143 |
13 |
108 |
90.0 |
|
128-135 |
15 |
95 |
79.2 |
|
120-127 |
20 |
80 |
66.7 |
|
112-119 |
21 |
60 |
50.0 |
|
104-111 |
17 |
39 |
32.5 |
|
96-103 |
13 |
22 |
18.3 |
|
88-95 |
7 |
9 |
7.5 |
|
80-87 |
2 |
2 |
1.7 |
Once we’ve constructed a cumulative frequency distribution, we can calculate percentiles or percentile ranks.
·
Percentile: a value below which a particular percentage of scores fall.
For example, we can calculate the value below which 75% of the maze completion times for our first group of rats fall (P75).
To do this:
1.
Calculate
how many scores will fall below the percentile (cum fp)
cum fp = (% of
scores below) x
# of scores (N)
cum fp = (0.75) x 120 =
90
2.
Determine
the lower real limit (XL) of the interval containing the percentile.
a.
XL
= 127.5
3.
Determine
how many additional scores are required within the interval in order to reach
the percentile.
a.
What’s
the frequency below XL? 80
b. What is the percentile
point? 90
c.
Therefore,
we need 10 more scores.
4.
Determine
the number of units within the interval we need in order to get those extra
scores.


5.
Determine
the score (the percentile point) that corresponds to the correct
percentile.
Percentile point = XL + additional units
P75 = 127.5 + 5.3 =
132.8
75% of the obtained scores are less that 132.8.
Suppose what you’re really
interested in is how a particular rat (your favorite rat, I suppose) did on the
task, relative to other rats in Group A.
Did your rat do better than 50% of the others? 80%?
To answer that question, you need to
compute a percentile rank: the percentage of scores falling below a particular
score.
Suppose your rat escaped the maze in 110
seconds.


percentile rank = 29.8
So, 29.8% of the rats had lower completion times than your rat. When dealing with reaction time as a dependent variable, however, less is better, so your rat outperformed 71.2% of his cohort!
One problem with using
frequency distributions on grouped scores is that you lose some information
about the individual scores within each interval.
One method for overcoming this in some
cases is to represent the data using a stem and leaf diagram rather than a
histogram or polygon.
In a stem and leaf diagram, each score is
represented by two components: a stem (usually the first digit) and a leaf
(usually the remaining digits).
Consider the following 30
scores on a hypothetical memory test:
85 90 64 73 94 82 67 78 89 98
76 63 84 92 76 85 88 93 69 72
84 66 78 82 94 75 86 95 63 78
An example stem and leaf diagram for
these data would be:
6
| 3 3 4 6 7 9
7
| 2 3 5 6 6 8 8 8
8
| 2 2 4 4 5 5 6 8 9
9
| 0 2 3 4 4 5 8
Note that you can stretch this out a little to make it more informative by repeating stems:
6 | 3 3 4
6 | 6 7 9
7 | 2 3
7 | 5 6 6 8
8 8
8 | 2 2 4 4
8 | 5 5 6 8
9
9 | 0 2 3
9 | 4 4 5 8
Stem and leaf diagrams represent a nice
compromise between the simplicity of graphical representation and the
usefulness of retaining individual data values. Their usefulness is limited, though, to cases in which there are
relatively few scores (< 100).
Quantitative SAT Score |
Final Exam Score |
|
595 |
68 |
|
520 |
55 |
|
715 |
65 |
|
405 |
42 |
|
680 |
64 |
|
490 |
45 |
|
565 |
56 |
|
580 |
59 |
|
615 |
56 |
|
435 |
42 |
|
440 |
38 |
|
515 |
50 |
|
380 |
37 |
|
510 |
42 |
|
565 |
53 |
|
520 |
46 |
|
495 |
43 |
|
600 |
56 |
|
580 |
53 |
|
525 |
50 |
|
485 |
45 |
|
560 |
52 |
|
620 |
58 |
|
680 |
64 |
|
570 |
56 |