The four types of Sums of Squares in SAS Why FOUR Types of Sums of Squares? Many of the basic ideas of statistical design concern how to layout an experiment so straightforward comparisons between treatments will be valid. There are techniques for effectively dealing with variable experimental units, multiple treatment factors, and physical constraints on the randomization, without sacrificing validity of interpretation. Most of us get an introduction to these topics in statistical design of experiments by taking an intermediate level statistics course or by picking up a standard textbook on the subject. If so, most of what we are told pertains to data from experiments that have the convenient property of being balanced. Balance is one of several ways of insuring that our design will have orthogonal effects. Think of orthogonal effects as being independently estimable. When effects are orthogonal there is no need to worry about possible masking or interference from other variables in the model when we want to assess the importance of one particular variable, or even the interactions among variables. If all experiments included only orthogonal effects there would be no need for different types of sums of squares, because there would be little room for controversy over how one should interpret an analysis of variance. As things now stand, it can be hard to find a clearly written set of recommendations that applies to all situations. SAS does have some advice on the subject, but to make use of this advice wisely you should understand what you are really choosing when you choose a particular type of sums of squares. This Note is intended to give you both the background and the specific pointers you need to make informed decisions. Balanced and unbalanced data Much of the data obtained from studies in biological/social sciences do not have orthogonal effects, because of practical considerations rather than by design. Sometimes the researcher does not get to assign levels of the research variables randomly to subjects (for example the sex of human subjects), and must take what experimental subjects are obtainable. Often a model includes quantitative covariates that were measured at the same time as the response variable. Such a covariate could only be orthogonal to the research treatments by accident, since its values were unknown at the time of randomization. Organisms have been known to die in the middle of an experiment to assess the beneficial effect of new drugs or diets, leaving even the most well planned experiment lop sided. The result is that researchers spend all their time in statistics class learning about balanced/orthogonal designs, and much of their careers grappling with non-orthogonal effects designs. It is not that teachers are unwilling to talk about non-orthogonal effects, but that intermediate students are not ready to hear about them. So many side-issues, special cases, and qualifying remarks come up that confusion sets in. The effort of dealing with unbalanced data can seem so overwhelming that the data set is trimmed down (by random deletion of data points) until balance is achieved. This should be done only if data points are cheap. If the expense or toil involved in collecting data are substantial, then it is a blow to one's pride and an admission of defeat to toss out perfectly good data because you don't know how to handle it. Keeping all of the data is the economical thing to do, but only if some sense can be made of it without calling in a team of consultants. I encourage you to take on the unbalanced analysis boldly, but to be bold without being foolish requires that you have the necessary information. This Stat Note will discuss the special issues that only arise with non-orthogonal layouts, on the assumption that you are already familiar with the general principles of design of experiments. (If you aren't, then you may want to read Stat Note 6 before going on). Afterwards, I will look at the four types of SAS sums of squares, and suggest some ways you can use your understanding of the basic issues to decide which type will best help you to interpret your analysis. A WARNING - Randomization Restrictions and Multiple Error Terms Much as I hate to start with a warning note, it should be admitted that some types of non-orthogonal designs won't be discussed in detail. This section is an attempt to explain why not. Some design layouts restrict how the treatments may be randomly assigned to the experimental units. For example: in a split-plot design the units are first separated into groups, then one set of treatments are randomly assigned to entire groups, after which a second set of treatments are randomly assigned to the units within each group. The hallmark of the ANOVA from such a design is the inclusion of multiple error terms in the table, one to be used for the "whole-plot treatment" tests, the other for the "split-plot treatment" tests. Since SAS likes to compute an error term by subtraction, one way to get the error term for the whole-plot analysis is to compute it either as a nested whole-plot replication effect, or if the whole-plot experiment has blocking, as a block by w.p. treatment interaction effect. Anyone who uses these tricks with unbalanced data is warned that some thought is in order before deciding if these effects (which are really ERROR effects) should be used to correct other effects in the model. Balanced split-plot experiments pose no such problem, since there is no need to correct effects for each other. The analysis of a repeated measures design is very similar to the split-plot analysis, and repeated measures designs are extremely common in all disciplines involving experiments with human or large animal subjects. Balancing such an experiment can really pay off when the time comes to interpret it. For unbalanced data, the aid of a statistical consultant may be necessary. So many things must be taken into account with such designs that I won't try to make general recommendations here. Types of data and types of sums of squares Now that we have prepared the way, it's finally time to talk about sums of squares. This material is adapted from the SAS course notes (a full reference is included at the end of this document). For a one-way completely randomized layout, there is no issue of what type of sums of squares to use, since they are all the same. A two-way layout is the simplest design for which these issues arise. First we distinguish four types of data sets in terms of their pattern of cell frequencies in a two-way layout. 1. Equal cell frequencies: n(i,j) constant for all i,j combinations. This is the so-called BALANCED layout. 2. Proportional cell frequencies: n(i,j)/n(i,l) = n(k,j)/n(k,l) for all i,j,k,l combinations. This is called a PROPORTIONALLY BALANCED layout. 3. Disproportionate, non-zero cell frequencies: n(i,j) > 0 for all i,j, but condition 2 does NOT hold for some i,j,k,l combinations. This is called an UNBALANCED layout. It can result when a balanced layout has missing values. 4. Empty cells: n(i,j) = 0 for some i,j combination(s). This layout has MISSING CELLS. I'll introduce the four types of sums of squares, then give recommendations on which type of SS to use for which type of data. Suppose in a Proc GLM we were to fit the following model and options: MODEL Y = A B A*B / SS1 SS2 SS3 SS4; SAS would generate four different types of sums of squares for testing interaction and main effects. They are: I. Fully sequential - analogous to the SS obtainable from regression routines. This type of SS depends on the order in which even main effects are listed in the MODEL statement. For some data sets, we could get different results for the model B A A*B than we get from the model statement above, because only the SECOND main effect listed in the MODEL statement is adjusted for the other main effect. If you fit two models, one with A then B, the other with B then A, not only can the type I SS for factor A be different under the two models, but there is NO certain way to predict whether the SS will go up or down when A comes second nstead of first! One good thing about type I SS is that it is the only one of the four types for which the SS for the three effects must always add up to the model SS. II. Method of fitting constants - SS for a given effect is adjusted for all effects listed in the MODEL statement that DO NOT contain the given effect. A and B main effects will both be adjusted for each other (since neither contains the other), but will NOT be adjusted for A*B (since it contains both A and B). A*B will be adjusted for both main effects. III. Weighted squares of means - SS for a given effect is adjusted for all other effects listed in the MODEL statement, regardless of whether they contain the given effect or not. In particular A and B main effects WILL be adjusted for the A*B interaction. IV. Goodnightºs method for missing cells layouts - similar to Type III in spirit, with a different strategy for compensating for missing cells when estimating the model parameters. The table shows which types of SS will be equal for the four kinds of data sets: DATA SET TYPE EFFECT BALANCED PROPORTIONAL UNBALANCED MISSING CELL A I=II=III=IV I=II, III=IV III=IV B I=II=III=IV I=II, III=IV I=II, III=IV I=II A*B I=II=III=IV I=II=III=IV I=II=III=IV I=II=III=IV Clearly, it makes no difference which type of SS one uses for balanced data, since they are all the same. Proc ANOVA for balanced data gives only type I SS; no other SS are needed for interpretation of a balanced Anova. For unbalanced or proportionally balanced data, it is usually advisable to use type III SS. On occasion, one may want to look at other types of SS. For example, if you believe that testing for main effects when interactions are known to be present is unwise, then you may prefer not to adjust main effect SS for interaction, and you can use type II SS. SAS does not recommend this method, but it is available should you want it. When analyzing data from an experiment with multiple error terms (such as a split-plot) you can use some type I SS, provided you pay attention to how terms are ordered in the model statement. On the whole, however, type III SS are preferred for unbalanced data. You will get this type without asking when you run a GLM on unbalanced data. SAS recommends that type IV SS be used with data containing missing cells. The SS for main effects can change, depending upon how the cells of the layout are labelled, so some caution in interpreting results is advisable. This will be explained more fully later on. SAS will automatically give you type IV SS from a GLM run on data with missing cells. Always remember that the choice of SS type will not influence the SS for error, so contrasts won't be affected. Hence, when in doubt you can use contrast SS to test the hypotheses of interest. When you carry out an analysis in which some of the multiple error terms are COMPUTED as interaction or nested effects, then you must ensure that the 'error' terms have been corrected for treatment effects they are to be used to test. You must pay attention to the type of SS you choose, as well as to the order in which you list any effects in the MODEL statement for which you are planning to use type I SS. Sum of Squares type determines the test hypotheses The reason it is so hard to digest the information about which type of sums of squares is best for various situations is this: ultimately, the question of which type of sums of squares you want is a question of what specific hypotheses about the model parameters you want to test. The various kinds of SS are actually testing different sets of hypotheses. A description of these hypotheses will make complete sense only if you are familiar with the cell means model, which can be discussed only briefly here. It is explained in full in the reference by Hocking. For the purposes of understanding this document, make sure you note the way in which mean cell responses are used to estimate model parameters. The parameters of the cell means model are just the expected (or mean) responses in each cell of the layout. For an n by m layout there are thus nXm parameters which collectively account for all the degrees of freedom that would, in the usual version of the two-way model, go to grand mean, factor A and B main effects, and the A by B interaction. Estimating these parameters is straightforward: the i,jth cell average response is our estimator for the expected response in that cell. Relating these cell estimates and parameters to the usual main effects and interaction effects is a bit tougher. Under the cell means model for our two-way layout, the expected responsefor level one of factor A is estimated as follows: add up all the mean response values for cells where factor A is at level one; this will entail adding across the levels of factor B. Then divide by the number of levels of factor B , since that is how many cell means you sum. You may wind up averaging together cell means that are computed from quite different numbers of observations, depending on how poorly balanced the data are. This is not a concern, as long as there is no reason to think that the imbalance in the data is a reflection of the nature of the population we are studying (for instance, if the relative cell sample sizes are in the same proportions as the sizes of strata in the study population). The cell means model assumes that the data lack balance due to accidental loss of some data points, or because of the sampling design, NOT because of some intrinsic property of the population under study. When the cell means model is used, the tests for A and B main effects are tests of hypotheses about ordinary sums of expected cell responses. Some other versions of the linear model for the two-way layout lead to parameter estimates and hypotheses tests involving WEIGHTED sums of cell responses, with weights proportional to cell sample sizes. These weighted sums have no clear interpretation if the cell sample sizes are just accidental, so it is better to stick to hypotheses about ordinary sums (again, unless the unequal sample sizes reflect relative population strata sizes). In most unbalanced data sets, the way to get F-tests that test these cell-means hypotheses is to request type III sums of squares. Problems arise when entire cells are empty in the layout, and it may not be immediately clear what hypotheses are being tested by the various SS. The issue comes down to whether or not partial rows and/or columns (those containing the empty cell) are used in comparisons between rows or columns. Because main effect testing boils down to just this kind of comparison, you cannot know exactly what hypotheses are being tested unless you know how these partial rows and columns are used. Let us illustrate how types IV and III differ using an example from the reference by Pendleton, et al. Suppose I have a two way layout, with two rows and four columns and one missing cell as follows: Factor B 1 2 3 4 Factor A 1 X 2 The X shows the position of the missing cell. When comparing two column means, that is assessing differences among the levels of factor B, type III SS only uses information from ROWS that are intact in these two columns. To contrast level 4 with level 1, 2 or 3, only cell means in the bottom row of the layout will be used, because the top row is not intact in column 4. When contrasting columns 2 and 3 however, the top row means ARE used, since those columns are intact. You can see that the presence of interactions among A and B would render the first contrast suspect (because the pattern of mean differences can change from row to row when we have interactions). Type III contrasts DO have the nice property that they don't depend upon the location of the missing cell in any way that would be changed if we arbitrarily re-labelled the levels of the factors. This is NOT true of type IV SS, yet there are examples of quite simple layouts where type III can give very misleading results, so type IV with all its flaws is the preferred method for lay outs with missing cells. To illustrate how the position of the empty cell can affect things, consider a second location for the missing cell: Factor B 1 2 3 4 Factor A 1 2 X When the first location for the missing cell holds, ALL type IV contrasts for factor B ignore the entire set of top row cell means. When the second location holds however, only the contrast that compares column 3 to column 4 will ignore the top row means. This is because type IV contrasts always compare treatment means to the LAST treatment. These two layouts thus lead to different test hypotheses. This strategy for setting up treatment contrasts can be used to your advantage. If you label the treatment levels so the last one is the control, then the contrasts will automatically compare all treatments to this control. Of course, the method can also work to the disadvantage of the unassuming user who trusts SAS to choose for him or her the hypotheses to test. Both SAS and Pendleton advise that type IV SS be used (with caution) with missing cell layouts. References Hocking, Ronald R.. The Analysis of Linear Models. Brooks/Cole, 1985. Pendleton, O. J., M. Von Tress and R. Bremer. Interpretation of the four types of analysis of variance tables in SAS, Comm. Statist Theor. Meth. 15(9) 2785-2808, 1986. General Linear Models: Practical Applications - Course Notes, pg 266-67. SAS Institute.