Diagnose my Graphs November 11, 2008
Posted by Lee in linkedin.trackback
Frequently, statisticians have to act like doctors. We see statistical reports that try to describe something : how fast rumors spread based on how large a company is, or the relationship between nitrogen content and crop yield. Speed and gas usage. Almost anything you can think of.We get measurements on the relationships, then try to see what we can determine about them.
So today, put on your diagnostician’s cap and look at the four relationships I show you here. To keep you from guessing, I’ve hidden the labels for the two variables, so you’ll be looking at Y1 and X1, Y2 and X2, Y3 and X3, and so on. Here’s the DATA step and PROC REG code to generate the output.
DATA Anscombe; INPUT X1 Y1 X2 Y2 X3 Y3 X4 Y4; Lines; 10 8.04 10 9.14 10 7.46 8 6.58 8 6.95 8 8.14 8 6.77 8 5.76 13 7.58 13 8.74 13 12.74 8 7.71 9 8.81 9 8.77 9 7.11 8 8.84 11 8.33 11 9.26 11 7.81 8 8.47 14 9.96 14 8.1 14 8.84 8 7.04 6 7.24 6 6.13 6 6.08 8 5.25 4 4.26 4 3.1 4 5.39 19 12.5 12 10.84 12 9.13 12 8.15 8 5.56 7 4.82 7 7.26 7 6.42 8 7.91 5 5.68 5 4.74 5 5.73 8 6.89 ; RUN; proc reg; model Y1=X1; model Y2=X2; model Y3=X3; model Y4=X4; run; quit;
The PROC REG command fits the least-squares line to each set, giving me the equation of fit and all the statistics you could want. Click on any report or picture to see it in larger size.
Here’s Y1 vs X1:
I highlighted some typical statistics that statisticians might use in discussing how well this line fits. Circles in the picture show the equation of the line (essentially y=3 + ½x), the R2(≅ 0.666), and the F-statistic (≅ 0.022). If you don’t know what these statistics are, bear with me. You’ll still get the joke.
Here’s Y2 by X2. Check the labels if you don’t believe me. :
Here’s Y3 vs X3.
And Y4 by X4.
You should have noticed that all the statistics are identical. The line of best fit is pretty much y = 3 + ½x. And getting all those statistics to be the same, well, that’s something, right ?
Here’s the playing-doctor part. Consider the fact that you’ve got four patients (graphs) exhibiting identical symptoms. What can you tell me about the underlying causes?
It turns out, not much. I may be the master of reading regression output, but there are some things that absolutely require a graph.
I can add the following to my existing SAS code:
proc sgscatter ; plot Y1*X1 Y2*X2 Y3*X3 Y4*X4 /reg; run;
Here are the four graphs, with the data points turned on.
- The first graph is exactly what you want to see in a regression. Points are reasonably dispersed around the line of best fit.
- The second graph is clearly not a linear relationship. If I wanted to show off, I’d say that you could fit a second-degree polynomial, a parabola, a conic section, to this data. But I don’t want to show off, just say that if your data isn’t linear, you can’t fit a line to it. Either make the original data linear (it’s not cheating, really) or use another kind of model.
- The third graph shows what happens when one point is an outlier.The single point is essentially pulling the least squares line upward. I would check that data point, since anyone can have a bad day when transcribing data.
- The fourth graph is an even more extreme case of the third one.All the points line up along the vertical line x = 8. Except one. And it completely determines the equation of the line. Move it anywhere, and the least-squares line follows.
An even better solution is to turn ODS graphics on at the start of your session:
ods graphics on;
and let SAS give you pretty and appropriate graphics without you having to worry about them.





Flickr/leecreighton
Facebook/Your Name
Twitter/leecreighton
Wikipedia/Lcreight
GMail/Lee Creigton
Blog/Sciolism rocks
Comments»
No comments yet — be the first.