## Generating Data

### Introduction

This first computer exercise introduces you to many of the basic computer and statistical concepts that will be used in later exercises. In this exercise you are going to create some data and then perform some simple analyses of it. If you follow the steps outlined below you should be able to work through the entire exercise without a problem. However, in order to really benefit, it is important that you work slowly and think about what you are doing at each step.

After completing the exercise you are certainly encouraged to play with variations of your own and some that will be suggested. You can ask for a short or long description of any command by typing HINT or HELP followed by the command name. When you encounter a new command you are encouraged to do this. You might also take a moment to see if the command is listed in the MINITAB Handbook.

Now, get into the MINITAB program. If you don't know how to do this, you have to look it up in the MINITAB manual that came with the program. You should see the MINITAB prompt (which looks like this MTB>). Now you are ready to enter the following commands:

MTB> Random 10 C1;
SUBC> Normal 0 1.
You begin by having the computer create or generate 10 random numbers. We want these numbers to be normally distributed (i.e., to come from a "bell-shaped" distribution) with a mean or average of zero and a standard deviation of 1. (Remember that the standard deviation is a measure of the "spread" of scores around the mean). You told the computer to put these ten numbers in variable Cl. To get an idea of what the RANDOM command does type:
MTB> Help Random
Before getting to more serious matters, you should play a little with the ten observations you created. First, print them out to your screen.....
MTB> Print C1.
Or get means and standard deviations....
MTB> Describe C1
The mean should be near zero and the standard deviation near one. Or, draw a histogram or bar graph....
MTB> Histogram C1
Does it look like a bell-shaped curve? Probably not, because you are only dealing with 10 observations. Why don't you start over, this time generating 50 numbers instead of 10....
MTB> Random 50 C1;
SUBC> Normal 0 1.
Notice that you have erased or overwritten the original 10 observations in Cl. If you do....
MTB> Print C1
all you see are the newest 50 numbers. To describe the data...
MTB> Describe C1
Notice that the mean and standard deviation are probably closer to 0 and 1 than was the case with 10 observations. Why?
MTB> Histogram C1
This should look a little more like a normal curve than the first time (although it may still look pretty bizarre).

### Simulation: Generation of Two Variables

The above commands were included to familiarize you with the Random/Normal command. Now you will conduct a real simulation. You'll create data according to a simple measurement model. You will generate two imaginary test scores for 500 individuals. If you like, you can imagine that you have given two achievement tests to 500 school children.

The measurement model we'll use assumes that a test score is made up of two parts - true ability and random error. We can depict the model as:

O = T + eo

Here, O is the observed score on a test, T is true ability on that test and eo is random error.

Notice what we're doing here. We will create 500 test scores or Os for two separate tests. In real life this is all we would be given and we would assume that each test score is a reflection of some true ability and random error. We would not (in real life) see the two components on the right side of the equation - we only see the observed score. We'll call our first test the X achievement test, or just plain X. It has the model....

X = T + eX
which just says that the X test score is assumed to have both true ability and error in measurement. Similarly, we'll call the second test the Y achievement test, or Y, and assume the model....
Y = T + eY
Notice that both of our tests are measuring the same construct, for example, achievement. For any given child, we assume this true ability is the same on both tests (i.e., T). Further, we assume that a child gets different scores on X and Y entirely because of the random error on either tests - if the tests both measured achievement perfectly (i.e., without error) both tests would yield the same score for every child. OK, now try the following....
MTB> Random 500 C1;
SUBC> Normal 0 3.
MTB> Random 500 C2;
SUBC> Normal 0 1.
MTB> Random 500 C3;
SUBC> Normal 0 1.
Be sure to enter these exactly as shown. The first command created 500 numbers which we'll call the true scores or T for the 500 imaginary students. The second command generated the 500 random errors for the X test while the final command generated the 500 errors for the Y test. All three (Cl-C3) will have a mean near zero and the true score will have a bigger standard deviation than the two random errors. How do we know that this will be the case? We set it up this way because we wanted to create an X and Y test that were fairly accurate - reflected more true ability than error. Now, name the three variables so you can keep track of them.
MTB> Name C1 "true" C2 "x-error" C3 "y-error"
Now get descriptive statistics for these three variables....
MTB> Describe C1-C3
Note that the means and standard deviations should be close to what you specified. Now construct the X test...
Remember, Cl is the true score and C2 is random error on the X test. You are actually creating 500 new scores by adding together a true score, Cl, and random error, C2. Now, construct the Y test...

Notice that you use the same true ability, Cl (both tests are assumed to measure the same thing) but a different random error.

It would be worth stopping at this point to think about what you have done. You have been creating imaginary test scores. You have constructed two tests which you labeled X and Y. Both of these imaginary tests measure the same trait because both of them share the same true score. This true score (Cl) reflects the true ability of each child on an imaginary achievement test, for example. In addition, each test has its own random error (C2 for X and C3 for Y). This random error reflects all the situational factors (e.g., bad lighting, not enough sleep the night before, noise in the testing room, lucky guesses, etc.) that can cause a child to score better or worse on the test than his true ability alone would yield. One more word about the scores. Because the true score and error variables were constructed to all have zero means, it should be obvious that the X and Y tests will also have means near zero. This might seem like an unusual kind of test score, but it was done for technical reasons. If you feel more comfortable doing so, you may think of these scores as achievement test scores where a positive value indicates a child who scores above average for his/her age or grade, and a negative score indicates a child who scores below average.

If this were real life, of course, you would not be constructing test scores like this. Instead, you would measure the two sets of scores, X and Y, and would do an analysis of them. You would assume that the two measures have a common true score and independent errors, but you would not see these. Thus, you have generated what we call simulated data. The advantage of using such data is that, unlike with real data, you know how the X and Y tests are constructed because you constructed them. You will see in later simulations that this enables you to test different analysis approaches to see if they give back the results that you put into the data. If the analyses work on simulated data then you might assume that they will also work for real data if the real data meet the assumptions of the measurement model used in the simulations.

Now, pretend that you didn't create the X and Y tests but, rather, that you were given these two sets of test scores and asked to do a simple analysis of them. You might begin by exploring the data to see what it looks like. First, name the two tests....

MTB> Name C4 "X" C5 "Y"
Try this command...
MTB> Info
This just tells you how many variables, what names (if any) and how many observations you have. Now, describe the data....
MTB> Describe C4-C5
By the way, you might also try some of the other column operations listed in MINITAB Help. For example....
MTB> Count C4
tells you there are 500 observations in C4,
MTB> Sum C4
gives the sum,
MTB> Average C4
gives the mean (which should be near zero),
MTB> Medi C4
gives the median,
MTB> Standard C4
gives the standard deviation, and
MTB> Maxi C4
MTB> Mini C4
give the highest and lowest value in C4.

Now look at the distributions....

MTB> Histogram C4
MTB> Histogram C5
These should look a lot more like bell-shaped or normal curves than the earlier graphs did. Look at the bivariate relationship between X and Y....
MTB> Plot C5 * C4; SUBC> symbol.
Notice a few things. You plotted C5 on the vertical axis and C4 on the horizontal. Each point on the graph indicates an X score paired with a Y score. It should be clear that the X and Y tests are positively correlated, that is, higher scores on one test tend to be associated with higher scores on the other. To confirm this, do...
MTB> Correlation C4 C5
The correlation should be near .90. You can predict scores on one test using scores on the other. To do this you will use regression analysis. Fit the straight-line regression of Y on X...
MTB>Regress;
SUBC> Response 'Y';
SUBC> Continuous 'X';
SUBC> Terms 'X'.
For now, don't worry about what all the output means (although you might want to start looking at the section on Regression in MINITAB Help). The regression equation describes the best-fitting straight line for the regression of Y on X. You could draw this line on the graph you did earlier. Just substitute some values in for X (try X = O, 1, -1, 2, and -2) and calculate the Y using the equation which the regression analysis gives you. Then plot the X, Y, pairs and you will see that they fall on a straight line. Recall from your high school algebra days that the number immediately to the right of the equal sign is the intercept and tells you where the line hits the Y axis (i.e., when x = 0). The number next to the X variable name is the slope. It is possible to look at a plot of the residuals and the regression line if we use the subcommand form of the regress statement....

MTB> Regress;
SUBC> Response 'Y';
SUBC> Continuous 'X';
SUBC> Terms 'X';
SUBC> Residuals C20;
SUBC> Coefficients C22.
MTB> Let C21=C5-C20

We have arbitrarily chosen columns C20-C22 to store the residuals, predicted values and coefficients, respectively. The predicted Y value is simply the observed Y minus the residual. The LET command is used to construct the predicted Y value. (Try 'Help regress' to get information about the command). Now, to do a plot of the regression line you plot the predicted values against the X variable....

MTB> Plot C21 * C4; SUBC> symbol.

This is actually a plot of the straight line that you fit with the regression analysis. It doesn't look like a "perfect" straight line because it is done on a line printer and there is rounding error, but it should give you some idea of the type of line that you fit. Now, you can also look at the residuals (i.e., the Y-distance from the fitted regression line to each of the data points). To do this type....
MTB> Plot C20 * C4; SUBC> symbol.
Notice that the bivariate distribution is circular in shape indicating that the residuals are uncorrelated with the X variable (remember the assumption in regression that these must be uncorrelated?). This graph shows that the regression line fits the data well - there appear to be about as many residuals which are positive (i.e., above the regression line) as negative You might also want to examine the assumption that the residuals are normally distributed. Can you figure out a way to do this?

Now, you should again stop to consider what you have done. In the first part of the exercise you generated two imaginary tests, X and Y. In the second part you did some analyses of these tests. The analyses told you that the means of the tests were near zero, which is no surprise because that's the way you set things up. Similarly, the bivariate graph and the correlation showed you that the two tests were positively related to each other. Again, you set them up to be correlated by including the same true ability score in both tests. Thus, in this first simulation exercise, you have confirmed through simulation that these statistical procedures do tell you something about what is in the data.

It would probably be worth your time to play around with variations on this exercise. This would help familiarize you with MINITAB and with basic simulation ideas. For example, try some of the following....

• Change the reliability of the X and Y tests. Recall that reliability is simply the ratio of true score variance to total score variance. You can increase reliability by increasing the standard deviation in the first Random/Normal statement from 3 to some higher number (while leaving the error variable standard deviations at 1). Similarly, you can lower the reliability by lowering the true score standard deviation to some value less than 3. Look at what happens to the correlation between X and Y when you do this. Also, look at what happens to the slope estimate in the regression equation.
• Construct tests with more "realistic" averages. You can do this very simply by putting in a different mean than 0 in the first Random/Normal statement. However, you should note that the true score measurement model always assumes that the mean of the error variables is zero. Confirm that the statistical analyses can detect the mean that you put in.
• You can always generate more than two tests. Just make sure that each test has some element of true score and error. If you really want to get fancy, why not try generating three variables using two different true scores. Have one variable get the first true score, one get only the second true score, and one have both true scores (you'll have to add in both in a Let statement). Of course, all three variables should have their own independent errors. Look at the correlations between the three variables. Also, try to run a regression analysis with all three (look at the Chapter in the MINITAB Handbook or try the Help command to see how to do this).
• One concept that we will discuss later in the simulation on nonequivalent group designs involves what happens in a regression analysis when we have error in the independent variable. To begin exploring this idea you might want to rerun the simulation with the one difference being that you don't add in the error on the X test (i.e., when you do the Let command the first time you leave out the C2 variable). Here, the X measure would have nothing but true score - it would be a perfect measure of true ability. See what happens to the correlation between X and Y. Also, see how the slope in the regression differs from the slope in the original analysis. Now, run the simulation again, this time including the error in the X test but not in the Y test. Again, observe what happens to the correlation and regression coefficient. The key question is whether there is bias in the regression analysis. How similar are the results from the three simulations (i.e., the original simulation, the perfect X simulation and the perfect Y simulation)? The hard question is why the results of the three simulations come out the way they do. If you are concerned that the results differ because the random numbers you generate are different, run the original three Random/Normal commands, do the original simulation as is, and then run the X-perfect and Y- perfect simulations beginning with the let commands and eliminating either the X or Y error terms. In this case you are using the same set of random numbers for all three runs and differences in results can be attributed to use of either perfect or imperfect measurement. At any rate, this is a complex issue that will be discussed in more detail later.
One of the best ways to learn MINITAB is to do it. Don't hesitate to sit at the computer and use the Help and Hint commands to explore the system. Simulation Home Page