| Psychological Assessment | © 1995 by the American Psychological Association, Inc. |
September 1995 Vol. 7, No. 3, 320-329 | For personal use only--not for distribution. |
Personality questionnaires are among the most versatile and user-friendly approaches to personality assessment. This article focuses on methodological considerations in conducting research on the Minnesota Multiphasic Personality Inventory2 (MMPI2; J. N. Butcher, W. G. Dahlstrom, J. R. Graham, A. Tellegen, & B. Kaemmer, 1989 ), the most widely used clinical personality instrument. The article addresses ways of identifying methodological problems in research and alerts researchers to potential pitfalls in conducting personality assessment research. The topics addressed include the following: methodological factors addressing the continuity of the MMPI2 and the original MMPI; sample selection in MMPI2 research; issues concerning test administration; the application of exclusionary criteria in developing research samples; methodological factors in processing, reporting, and analyzing data; developing and evaluating new MMPI2 scales; and assessing test bias in personality research.
Personality assessment by the questionnaire method has a long history in psychologydating to Francis Galton at the end of the 19th century. Galton's early questionnaire results, as with those of many who followed in his footsteps, attested to the power and efficiency of this technique for obtaining information about human problems, characteristics, and motivations. The questionnaire method is one of the most versatile and user-friendly approaches to personality assessment in contemporary psychology and is likely to be an important method in the future, although computers or other electronic recording devices could eventually replace traditional paper-and-pencil forms.
The ability to use self-report measures in psychology is dependent on the quality of the data respondents provide. The importance of establishing the validity of psychological measures is, of course, well recognized. Of the various self-report measures in psychology, none has been subjected to more scrutiny and research than the Minnesota Multiphasic Personality Inventory (MMPI; Butcher & Rouse, in press ). This article focuses on methodological considerations in conducting research on the MMPI2. We will point out some ways of identifying problems and alert researchers to potential pitfalls in conducting MMPI2-based research. This article will address research in which the MMPI2 is the main focus. We will not address research efforts in which MMPI2 scales simply are used as dependent variables to assess other constructs of interest. Finally, although some of the issues raised in this article may seem rather rudimentary, the problems addressed appear frequently enough in MMPI2 studies submitted for publication that they merit identification and discussion.
Although the MMPI was revised in 1989 to update its norms and to expand its measurement scope by adding new items and developing new scales, a large portion of the original MMPI is contained in the MMPI2. The traditional validity scales ( L [Lie], F [Infrequency], and K [Correction]) and the 10 clinical scales, on which substantial research has accumulated, are essentially the same as they were in the original instrument. These scales were kept intact in the MMPI2 to ensure continuity between the two versions of the instrument for clinical and research purposes. Investigators who have collected data on the original version can convert MMPI scores to MMPI2 scores by deleting the 13 items that were dropped from the original scales and using the modified raw scores to derive T scores from MMPI2 norms. Similarly, for researchers wanting to compare data collected on the MMPI2 clinical scales with data from the original version of the instrument, a conversion table is provided in the test manual ( Butcher, Dahlstrom, Graham, Tellegen, & Kaemmer, 1989 ). Basic research information on the MMPI2 is available in several sources ( Butcher et al., 1989 ; Dahlstrom, 1993 ; Graham, 1993 ; Greene, 1991 ; Lucio, 1994 ). Information concerning the Minnesota Multiphasic Personality InventoryAdolescent (MMPIA) can be found in Archer (1992) , Butcher et al. (1992) , Butcher and Williams (1992) , and Williams, Butcher, Ben-Porath, and Graham (1992) .
An investigator conducting research with the MMPI2 must make several critical decisions at the outset of a study. A crucial decision, one that will perhaps more than any other affect the success of the study, is the selection of appropriate samples. The two most important decisions to be made in the selection of samples have to do with their composition and size. Each of these issues, in turn, relates to other important sample selection decisions.
Sample CompositionOptimally, a sample used in an MMPI2 study should be drawn randomly from the population with which the test is to be used. For example, if we are interested in determining how a certain scale functions for inpatients, we should draw randomly from the entire population of inpatients who might be administered an MMPI2. Unfortunately, the pragmatics of clinical research are such that this is rarely possible (this problem is by no means specific to the MMPI2 or, for that matter, to assessment studies). Most samples used in MMPI2 research are, therefore, samples of convenience. These could range from samples of individuals at a given clinical setting (i.e., in one particular inpatient facility) to samples of college undergraduates.
The use of a sample of convenience, in itself, although not optimal, is not necessarily a major methodological flaw. It does, however, impose significant limitations on the inferences one might draw. And, depending on the topic of study, certain samples of convenience, most notably college student samples, are inappropriate. More generally, because MMPI2 research samples are inevitably local samples of convenience, it is vital to provide in the method section of an MMPI2 research article a very complete description of the sample.
A specific result of use of samples of convenience is that one cannot assume that one setting's findings will apply elsewhere. Rather, one must await replication across diverse clinical settings before he or she can accept an MMPI2 finding as applying generally across a broad population. For example, if a new scale is developed, a researcher will have greater confidence in its overall utility if it is found to be valid across different clinical settings.
One of the more common methodological shortcomings in MMPI2 studies is the inappropriate use of nonclinical samples. As we just mentioned, the use of nonclinical samples is not uniformly a methodological flaw. In some instances (e.g., initial development and validation of a new scale or procedure or the study of normal range personality characteristics), nonclinical samples can be used. In most such instances, samples of college students typically are used. When college students are used in these studies, investigators should recognize that they are not an appropriate proxy for a normal sample. College students are younger, better educated, and more intelligent than the general population. To the extent that these and other differentiating attributes of college students may be correlated with the criterion variable or MMPI2 scale of interest, one should not expect the results to generalize to other nonclinical samples.
A good example of this limitation is the so-called faking studies. In a typical study, a group of college students is instructed to "fake-bad" on the MMPI2, and their scores on the validity scales of the test are compared with those of other college students and, it is hoped, with those of clinical participants who took the test under standard instructions. In recent faking studies, there has been an emphasis on "coaching" college students to avoid the detection of faking. These studies (e.g., Lamb, Berry, Wetter, & Baer, 1994 ) typically find that college students can be coached to avoid detection of faking bad. However, to what extent might one expect these findings to generalize to the general population? Moreover, to what extent should one expect these findings to generalize to clinical populations? Although the college student studies in this case are informative, their findings must be replicated in broader nonclinical and clinical samples.
In some cases, it is inappropriate to use college students in MMPI2 studies. A general rule of thumb is that whenever the topic of study is specifically a clinical phenomenon, a college sample is inappropriate. For example, an investigator interested in studying the MMPI2's ability to measure symptoms or characteristics of personality disorders should not use a college sample. Although it is a fact that some college students have full-blown personality disorders and some have various symptoms and characteristics of these conditions, there is significant restriction in the variance of the dependent variable(s) of interest. As a result, scales that might work well in a clinical sample might be found to be ineffective in a college sample. A negative finding in a college sample is therefore uninformative.
The problem of differential base rates of clinical phenomena is clear and pronounced when considering clinical and nonclinical samples. However, differential base rates may also be found between different clinical samples. For example, one would expect a much higher prevalence of positive symptoms of schizophrenia in an inpatient setting than in a community mental health center. Therefore, the efficacy of a scale in predicting these symptoms may be greater in one setting than in another. This is another reason why it is important to replicate MMPI2 findings across diverse clinical settings. We discuss this issue further later in this article.
A final consideration in composing an MMPI2 research sample is the importance of having adequate comparison groups. For example, if an investigator is interested in studying the ability of an MMPI2 scale to detect the presence of symptoms of schizophrenia, it is important to include in the sample not only individuals with symptoms of schizophrenia and others who are symptom free, but also individuals who have other psychological problems. A good MMPI2 indicator of a particular set of symptoms will discriminate between individuals who display the symptoms of interest and those who have other serious psychological problems. Collection of such samples is vital to the demonstration of a scale's discriminant validity.
Sample SizeAnother important consideration in planning an MMPI2 study is the size of the sample to be used. For various reasons, some types of MMPI2 studies tend to require relatively large samples. The advantage of large samples in MMPI2 studies is that they provide sufficient power for the type of analyses typically conducted. For example, a typical MMPI2 researcher might be interested in analyzing scores on all 10 clinical scales. When 10 statistical tests are performed, it is necessary to correct for familywise error by setting alpha at a lower level. Conducting an initial multivariate analysis (e.g., a multivariate analysis of variance), does not fulfill this requirement ( Huberty & Morris, 1989 ). Correction for familywise error, in turn, reduces the power of the analysis. However, collection of a sufficiently large sample should leave the investigator with adequate power even when alpha is set lower to correct for multiple statistical tests.
A second power consideration has to do with the typically small effect sizes in MMPI2 research. As is the case with most any personality study, effect sizes (e.g., correlations) tend to be relatively small. The reasons for this range from the limited reliability of criterion (dependent) measures to the heterogeneity of many clinical phenomena. Regardless of the causes, the result of attenuated effect sizes typical of MMPI2 studies is reduced power. This, compounded by the need to set alpha lower to correct for familywise error, highlights even further the need to collect relatively large samples in MMPI2 studies.
To illustrate the problem, consider the following example. An investigator is interested in identifying empirical correlates of the MMPI2 clinical scales in a given setting. Recognizing that effect sizes for personality scales tend to be small, the investigator sets a correlation of .20 as a minimum magnitude for a meaningful correlation. To control for familywise error, the investigator sets alpha at .005 (.05/10 [clinical scales]). With a sample of 20 the proposed data analyses have .03 power. Even with a generally respectable sample size at 100, there is only .21 power. To achieve what we would recommend as a minimum level of power, .80, 326 participants are required.
The challenge of collecting adequate sample sizes is compounded even further when investigators are interested in studying MMPI2 code types. In these studies, participants are typically classified into groups based on specified features of their clinical scale profile. Depending on the rigor of code-type definition (see the section later in this article for additional discussion of the issue of code-type definition), relatively small numbers of participants will be assigned to each code-type group. In fact, several thousand participants would be required to identify sufficiently large samples of individuals for a relatively large number of code types.
Although insufficient power is the most frequent sample size difficulty in MMPI2 research, in some instances a related but very different problem arises. Because the MMPI2 is so widely administered, it is sometimes possible for investigators to collect very large samples of participants. In such cases, even when alpha is corrected to a lower level, very small effect sizes may be statistically significant. For example, with alpha set at .005 and power at .80, in a sample of 1,000 participants a correlation of .11 will be statistically significant. In a sample of 1,500 individuals under the same constraints, a correlation of .09 will be significant. The clinical significance of these statistically significant correlations is questionable.
Thus, it is incumbent on investigators to set a minimum level for the clinical significance of an effect size. In our own research, we have generally found correlations lower than .20 to be uninformative, although exceptions sometimes can be made when there is reason to suspect range restriction in the dependent or independent variables.
Matched Versus Unmatched SamplesTypically, studies that have used the MMPI2 to compare two or more groups of individuals have attempted to match the groups on variables other than the one of primary interest in the studies. For example, a study comparing MMPI2 scores of persons with diagnoses of schizophrenia and persons with diagnoses of depressive disorder might match the groups in terms of education or some other indicator of socioeconomic status. It is our contention that such matching is not always appropriate. This recommendation is consistent with Meehl's (1971) position on this matter. Matching groups of individuals on variables other than the one of primary interest makes sense if differences between the groups on these variables can be assumed to result from sampling procedures. Thus, if the researchers are comfortable in assuming that the schizophrenic and depressive groups in the hypothetical study mentioned above do not really differ in terms of educational levels, it would be appropriate to try to match the groups on this variable. On the other hand, if schizophrenic and depressive individuals in the population differ significantly on educational level, which probably is the case, matching would not be appropriate. To do so would result in samples that are not representative of their underlying populations. Therefore, results based on these samples could not be generalized to all members of these diagnostic groups. It is our recommendation that, unless data are available suggesting that differences between groups on variables such as educational level are due to sampling procedures and do not represent actual differences in the populations from which the samples were chosen, groups should not be matched on these variables.
The sine qua non of objective psychological testing is that standard procedures should be used in administering a test to ensure that the conditions under which the test is being completed are similar for all individuals and, of course, as similar as possible to the procedures for the normative sample. The ability to replicate and generalize research findings is contingent on the use of standard administration procedures when conducting research with the MMPI2. The MMPI2 is relatively easy to administer by having participants read through the instructions and respond to the items in a true or false manner. Because of simplicity of administration, there is potential for deviation from standard instructions, and such invalidating deviations are common. In this section, we describe some problems that can occur with the alteration of test stimuli or by deviating from standard administration procedures.
Controlled Versus Uncontrolled AdministrationOne important consideration in the use of any standardized psychological test is that it should be administered in a controlled setting, thus ensuring that the participant responded to the items in a serious, cooperative manner. It is usually not appropriate to allow participants to self-administer the MMPI2 by taking the instrument home with them or by mailing test materials to them. In such uncontrolled situations, one cannot be confident that the intended participant is actually the person tested in the study or that the participant took the task seriously.
Full Form Versus Shortened VersionsResearchers need to be aware of the limitations imposed on their data by administering abbreviated versions of the MMPI2. Much has been written about the problems of using short forms of the MMPI (see Butcher & Hostetler, 1990 ; Dahlstrom, 1980 ; Greene, 1982 ; Lachar, 1979 ). A short form of a scale is a version that purports to measure the construct assessed by the full scale using a reduced number of items, often only a few. A short form should be distinguished from what has been referred to as an abbreviated MMPI, in which all of the items in scales of interest are administered but not all possible MMPI2 scales are available. Administering the first 370 items on MMPI2 or the first 350 items on MMPIA allows the researcher to score fully the standard validity and clinical scales. However, the content scales and many other supplementary scales cannot be scored from the abbreviated version.
Short forms, on the other hand, are reduced item versions of the traditional clinical scales that provide inadequate estimates of the full-scale scores. Investigators have found that MMPI short forms cannot be used for clinical prediction because the resulting profiles are not sufficiently close to full-form profiles (see Hoffman & Butcher, 1975 ). To ensure that psychometric properties (e.g., reliability) of the full scales are maintained, it is best to avoid shortened versions of scales.
Administering Individual Scales Out of ContextSeveral problems can result from extracting and administering scales, such as the Post Traumatic Stress Disorder (PTSD) Scale or the MacAndrew Alcoholism ScaleRevised (MAC-R), out of the context of the other items in the MMPI2. One problem that results when items with highly similar and obvious content are administered together in close sequence is that the mental set of the participant may differ from that during the completion of the entire MMPI2, in which the items are intermingled with other, perhaps more neutral items. The potential altered stimuli (i.e., only the Depression scale items) could produce a different response attitude from that obtained if all of the items had been administered. For example, Megargee (1979) reported that he obtained a correlation of only .55 when an extracted version of Overcontrolled Hostility (OH) scale was compared with the full form of the MMPI administered on the same day. However, in a similar study, MacAndrew (1979) found that scores on the MacAndrew Alcoholism Scale were similar when administered independently and in the context of the entire MMPI. Another problem associated with extracted scales is that the researchers may not able to examine the MMPI2 validity scales to exclude invalid data from their studies. In summary, investigators who use extracted scales have the responsibility of demonstrating that scores on extracted scales are equivalent to scores that would have been obtained if the entire MMPI2 had been administered. Even then, the unavailability of the validity scales may preclude using extracted scales.
Using Translated Versions of the MMPI2The MMPI and the MMPI2 have been found to have considerable validity when translated and adapted to other languages and cultures. An important first step in developing a viable test translation is to ensure that the items are translated in such a way as to have both linguistic and psychological equivalence ( Sperber, Devellis, & Boehlecke, 1994 ). Effective test translation procedures have been developed for the MMPI ( Butcher, in press ; Butcher & Pancheri, 1976 ) as follows: careful initial translation using multiple bilingual participants to obtain a satisfactory initial translation of the item pool; back-translation to determine the equivalence of the item translations; and demonstration of equivalence through bilingual testretest procedures. In the bilingual testretest study a group of bilingual individuals, that is, people who have lived in both cultures for 5 years or more and are fluent in both languages, is administered the instrument in both languages. Usually half are administered the English version first and the other half the target language first. The two administrations are then evaluated as a testretest study by examining differences in mean scores, scale intercorrelations, factor analyses ( Ben-Porath, 1990 ), and item response comparison studies ( Hulin, Drasgrow, & Parsons, 1983 ). It is never appropriate to translate an instrument "on the spot" to individuals in a research population because content equivalency cannot be ensured.
For MMPI2 data to be useful for either clinical or research purposes, one must have confidence that individuals responded to the test items in the manner that was intended. Basically, researchers expect persons completing the MMPI2 to read each item, consider its content, and respond honestly concerning how the item applies to them. To the extent that these expectations are met, the resulting MMPI2 scores could offer important information about persons completing the test. To the extent that participants deviate from expectations, the resulting scores are likely to be of limited to no use.
It is very important for investigators to exclude from MMPI2 research studies the data of persons who have not responded to the items in an acceptable (valid) manner. There are two kinds of deviant responding to be considered: content nonresponsiveness and content responsive faking ( Nichols, Greene, & Schmolck, 1989 ). Content nonresponsiveness occurs when individuals fail to respond to the content of items in a meaningful way (e.g., random responding) or when they omit a large number of items. In most MMPI2 studies it will be important to exclude persons who have responded in either of these ways. The Variable Response Inconsistency (VRIN) Scale is the best way to detect random responding. As recommended in the MMPI2 test manual ( Butcher et al., 1989 ), VRIN T scores greater than 80 indicate random responding, and persons with scores above this level should be excluded. The F scale was developed initially to detect random responding to MMPI items. The Backside F ( Fb ) scale was developed to identify invalid responding to items in the second half of the MMPI2 booklet. Although random responding produces very high F scale and Fb scale scores, there are other reasons for these scales being elevated (e.g., malingering, serious psychopathology). Therefore, the F and Fb scales should not be used to detect random responding.
When participants omit large numbers of MMPI2 items, scores on most scales will be artificially low. The Cannot Say scale provides an index of omitted items. Protocols with more than 30 omitted items (including items answered as both true and false) should be excluded from research studies. It is desirable to have even fewer omitted items, and investigators should encourage participants to complete items that initially were not answered.
Tendencies to answer MMPI2 items true or false without consideration of their content (i.e., yea saying and nay saying) lead to scores that do not accurately represent characteristics of test participants. Thus, participants who have adopted either of these response sets should be excluded from research studies. The True Response Inconsistency (TRIN) Scale was developed to detect these response sets. As recommended in the MMPI2 manual ( Butcher et al., 1989 ), TRIN raw scores of 13 or greater suggest a true response bias, and TRIN raw scores of 5 or less indicate false response bias. Individuals with such scores should be eliminated from research studies.
Content responsive faking occurs when participants consider the content of MMPI2 items but do not respond in an honest or candid manner to the items. Usually, such individuals are trying to appear better adjusted than they really are (faking good) or more poorly adjusted than they really are (faking bad or malingering). Because such approaches to the test produce scores that do not accurately reflect characteristics of these individuals, it is important to exclude persons who have approached the test in either of these ways. T scores greater than 65 on either the L or K scale suggest the possibility that individuals have responded to items with the intention of presenting themselves in unrealistically favorable ways. However, in some settings (e.g., employment screening, child custody), relatively high L and K scores are normative. Therefore, it is recommended that a T score of 80 be used to exclude individuals from research studies on the basis of defensiveness or faking good. The newly developed Superlative Self-Presentation ( S ) scale ( Butcher & Han, 1995 ) might also provide researchers with a means of detecting invalid (highly virtuous or superlative) self-descriptions.
The F scale and the F K index (raw score of the F scale minus raw score on the K scale), or Dissimulation Index, are effective measures of faking bad or malingering. Prior research has demonstrated that the optimal cutoffs for clinical use vary from setting to setting. For example, Graham, Watts, and Timbrook (1991) found that an F scale cutoff of 18 accurately detected normal participants who were instructed to fake bad. However, a much higher F scale raw score (> 27 for men and > 29 for women) was needed to correctly discriminate between normal individuals faking bad and psychiatric patients taking the MMPI2 with standard instructions. For research purposes, it is recommended that a relatively high F scale raw score (>30) be used to exclude individuals on the basis of faking bad or malingering. Although the F K index also is effective in identifying faking bad or malingering, there does not appear to be any advantage to using the index instead of or in addition to the F scale.
When research is designed to study the extent to which invalid responding can be detected by the MMPI2 validity scales and indexes, participants should not be excluded on the basis of scores on the measures that are being studied. For example, in a study of the extent to which VRIN scale scores can identify random responding, obviously one would not want to exclude individuals on the basis of VRIN scale scores or other scores that are likely to be affected by random responding (i.e., F or Fb ). The only exclusion criterion that should be used in such an instance would be more than 30 omitted items. In a study of faking bad or faking good it would not be appropriate to exclude individuals on the basis of L, F, Fb, or K scale scores. However, individuals should be excluded on the basis of VRIN and TRIN scale scores and omitted items.
If personality test items are processed manually it is important that item responses are entered accurately into data files. Human errors can creep into the data entry process. Consequently, protection against inaccurate or erroneous data entry needs to be implemented. One way to maximize accuracy with manual data entry is to double-key or verify the item response entries before analyses are conducted.
An efficient and cost-effective means of scoring objective test answer sheets involves using a mark sensing device or an optical scanner to read item responses from answer sheets to a data file. Scanning is usually a reliable means of data processing; however, there can be problems with this procedure, particularly if the answer sheets contain records that are lightly marked. Lightly marked answers may be read as blanks, resulting in high rates of item omissions. To prevent this data entry problem the researcher should enhance the marks on the answer sheet using a number two pencil or a special marking pen before scanning.
With the wide availability of computers and optical scanners, relatively few research studies use manual scoring of answer sheets. Researchers relying on manual scoring need to verify that the scoring templates used are correct (homemade stencils require careful checking for accuracy), and the scores should be checked (i.e., scored twice) to eliminate counting errors.
Analyzing K-Corrected or Non-K-Corrected T ScoresAn important decision in conducting MMPI2 research is whether to use K -corrected or non- K -corrected scores on scales that have the K correction. Either of these procedures could be appropriate to address some research questions but inappropriate if other research questions are being addressed. If the research is designed to address issues of clinical interpretation, K -corrected scores are preferable. For example, when studying correlates of clinical scales, it is appropriate to use K -corrected scores because those are the scores that clinicians typically will use in interpreting profiles. If the research is concerned with more basic psychometric issues, such as scale-level factor structure of the MMPI2, non- K -corrected scores are preferable, because the use of K -corrected scores would introduce artificial covariance among scales. The extent to which K -corrected or non- K -corrected scores are more useful in clinical applications of the MMPI2 is currently being investigated (e.g., Weed, Ben-Porath, & Butcher, 1990 ), but it is premature to conclude that one kind of score is superior to the other.
Controlling for Demographic Differences Between GroupsEarlier in this article we discussed the issue of matching groups in terms of demographic variables such as socioeconomic status. The same issues apply when considering whether to control statistically for demographic differences between groups. If it can be assumed that demographic differences between groups result from sampling procedures and matching groups on these variables is not possible, statistical procedures such as analysis of covariance should be considered. However, such procedures are necessary only when there are statistically significant relationships between the demographic variables and the dependent variables in the study.
Using Well-Defined Code TypesSince the MMPI was published in 1943, there has been emphasis placed on the interpretation of configurations of scores on the clinical scales (i.e., code types). Such interpretation requires empirical research data concerning the characteristics of persons having a particular code type (e.g., 49/94) that differentiates them from persons not having that code type. Although considerable research was published concerning code-type correlates for the original MMPI, to date little published research is available concerning correlates of MMPI2 code types. Because of the continuity between MMPI and MMPI2 clinical scales, it is likely that the correlates of MMPI and MMPI2 code types will be very similar. However, it will be important for future research to demonstrate the similarity of correlates.
When conducting research concerning MMPI2 code types, it is important to take into account the extent to which the code types are defined. Definition refers to the difference between the score on the lowest scale in a code type and the next highest clinical scale score in the profile. Given the standard errors of measurement of the clinical scales, T -score differences between scales of less than 5 points should not be considered meaningful. For example, with a 2-point code type with Scale 2 as the highest scale and Scale 7 as the second highest scale, one would require a difference of 5 T -score points between Scale 7 and whatever scale is third highest in the profile before considering the "27" code type well defined.
Those who have argued against using defined code types (e.g., Dahlstrom, 1992 ) seem to be ignoring measurement error. Tellegen and Ben-Porath (1993) have articulated the rationale for using defined code types. Basically, when persons are categorized using code types that are not well defined, researchers cannot have confidence that these persons really belong to the category suggested by their two highest scales. On retest, even after very short periods, their code types are not likely to remain the same as on the original testing. Furthermore, nonrestrictive code types are too heterogeneous, and their reported replicated correlates have been comparatively sparse.
The use of defined code types (i.e., those in which the lowest scale in the code type is at least 5 T -score points higher than the next highest clinical scale in the profile) ensures that the persons who are categorized as belonging to a particular code type have MMPI2 scale configurations that are meaningfully different from persons who do not have that code type. Therefore, it is more likely that persons with the code type will differ in meaningful ways from other persons on relevant extratest measures. Correlates of defined code types are expected to be more homogenous than those of nonrestricted code types. Because the relationships between code type membership and extratest characteristics are expected to be stronger when defined code types are studied, the likelihood that identified characteristics will apply to other persons with designated code types increases when the data are based on defined code types.
The original MMPI clinical scales epitomize the development of scales according to an empirical item selection approach. These scales were developed by empirically differentiating clearly defined clinical groups from a general sample of normal individuals and from each other ( Hathaway & McKinley, 1943 ). This contrasting groups method produced scales that were externally valid, but the scales possessed internal properties that were found to be cumbersome and undesirable (e.g., item overlap, heterogeneous content, and chance [subtle] items).
Despite these problems, empirically derived scales can be valuable for measuring instruments particularly if the internal structure problems can be reduced or eliminated. It is important to ensure that external criteria selected for scale development are appropriate to the construct of interest. The criterion group problem is notoriously difficult to solve in psychology (e.g., Hathaway, 1972 ; Megargee, 1979 ). However, empirical construct definition is a defensible alternative to rational or theoretical construct development. Given an ample item pool, careful construct definition, and selection of appropriate external criteria, empirical scale development is viable. Once an empirical construct is defined and an initial scale constructed, then careful refinement and development can improve on the initial item set.
The MMPI2 contains a broad range of new items that could allow for the development of an expanded array of effective scales. The following eight suggestions were made by Butcher and Williams (1992) for developing and improving new MMPI2 scales:
1. The construct on which the scale is being developed should be well defined. The construct should be demonstrated to be related to personality or symptomatic variables. Nonpersonality factors, such as abilities, work skills, and intellectual qualities, are not appropriate criteria for MMPI2 scale development.
2. The item pool needs to be pertinent to the construct being assessed. The MMPI2 item pool does not contain a sufficient item base for all potential personality variables. It is not possible, for example, to develop a scale measuring language ability or successful performance as a business manager or professional athlete. Scale developers need to determine if the item content of the MMPI2 contains a sufficient range of items for the construct in question.
3. Cross-validation is required to develop empirical scales. Cross-validation is important in empirical scale construction to eliminate items that would be selected on the basis of chance or specific sample characteristics. The sample sizes used for the developmental and the cross-validated studies should be large enough to provide stable test scores and to reduce error.
4. The resulting empirical scale should possess appropriate statistical properties. For example, if the scale is proposed as a measure of a single dimension, it should be internally consistent, for example, at an alpha level of .70 or greater. If the proposed measure is a multidimensional empirical scale, it is not appropriate to require that the scale be internally consistent. However, the resulting scale should possess acceptable testretest reliability.
5. All empirical scales should have clearly defined empirical correlates. It is important for any scale to measure what it is supposed to measure, regardless of the methodology used in scale construction. However, it is particularly important for empirical scales to have externally based test correlates.
6. Incremental validity of the scale should be established. A new scale must add significantly to prediction of relevant behaviors beyond what is possible using existing scales if it is to be useful for applied purposes. There is a more detailed discussion of establishing incremental validity later in this article.
7. Uses for the scale should be explored and demonstrated. Does the scale possess sufficient predictive power? How well does the scale classify relevant cases?
8. Construct validity for the scale should be reported. Initial publication on a newly derived empirical scale should present as much evidence as possible concerning its validity ( Cronbach & Meehl, 1955 ). Additional evidence concerning criterion, discriminant, convergent, and concurrent validity will accumulate over time.
Establishing Incremental ValidityMany research studies (published and unpublished) have developed new MMPI and MMPI2 scales or indexes based on new or existing scales. Such studies should report data concerning the validity of the new scales or indexes. Validity information should include relationships between MMPI2 measures and conceptually relevant extratest measures. For example, a scale developed to predict response to a particular treatment program should be validated by comparing the scale with outcome data for clients who have or have not been involved with the treatment program. However, for a new scale or index to have clinical utility, it must contribute incrementally to the prediction of the relevant behaviors. In other words, data must be presented demonstrating that the new scale or index adds significantly to the prediction of the relevant behaviors beyond what can be accomplished by using existing and more familiar scales or indexes. A study by Ben-Porath, Butcher, and Graham (1991) demonstrates the appropriate methodology. These investigators studied the incremental validity of MMPI2 content scales. They reported that two standard MMPI2 clinical scales (Scales 2 and 8) could discriminate significantly between patients who had diagnoses of either schizophrenia or major affective disorder. However, adding two MMPI2 content scales (Depression and Bizarre Mentation) to the two clinical scales resulted in a significant increase in discrimination between the two groups. Thus, these findings offered some preliminary support for the clinical use of the Depression and Bizarre Mentation content scales. Timbrook, Graham, Keiller, and Watts (1993) studied the extent to which subtle and obvious subscale scores could detect persons who were malingering on the MMPI2. They found that scores on the Obvious subscales were related significantly to the detection of malingering, but adding Obvious subscale scores to a standard validity scale ( F ) did not increase the detection of malingering. Thus, their study did not support the clinical use of the Subtle and Obvious subscales. It should be noted that the utility of a new scale may not be apparent in a single study. A new scale that does not add incrementally to existing scales in one setting or for one purpose could eventually be demonstrated to have incremental validity in a different setting or for a different purpose. However, until such time as a scale's incremental validity has been established, there is no foundation for recommending its use in clinical applications.
Establishing Validity of Cutoff ScoresA common practice in MMPI2 interpretation is to use cutoff scores to indicate clinically meaningful elevation. The best example of this is the black line drawn at the T score of 65 on the MMPI2 profile to mark the point of clinically meaningful elevation on the clinical scales. Cutoffs also have been proposed for other MMPI2 scales (e.g., a raw score greater than 26 or 28 on MAC-R or a T score greater than 80 on VRIN).
MMPI2 researchers who are interested in studying the validity of cutoff scores often report various hit rate indexes. In the typical study, a sample of individuals known to have a certain condition is compared with others who are known not to have that condition. For both groups, MMPI2 scores are available. To illustrate this paradigm, let us consider the condition of alcohol abuse and the MAC-R scale. Figure 1 presents the 2 × 2 contingency table that can be derived from the available data along with formulas that define the relevant indexes.
The base rate, then, is the percentage of participants in the sample who actually have the condition of interest (in this case alcohol abuse). The selection ratio is the percentage of participants who exceed the MAC-R cutoff score and are therefore predicted to have the condition of alcohol abuse. In many MMPI2 cutoff score studies, researchers report the sensitivity and specificity indexes. In our example, sensitivity is the probability that an individual who has the alcohol abuse condition will be identified as an alcohol abuser by an elevated MAC-R score (i.e., having a score on MAC-R that exceeds the designated cutoff), and specificity is the probability that an individual who does not have the alcohol abuse condition will not be identified as an alcohol abuser by the MAC-R (i.e., the MAC-R score does not exceed the designated cutoff).
The sensitivity and specificity of tests are of particular interest to epidemiologists who wish to assess the accuracy of prevalence and other epidemiological indexes. However, they are not of primary interest to MMPI2 researchers who wish to investigate the validity of proposed cutoff scores. Instead, the focus of MMPI2 researchers should be on the positive and negative predictive power of the scale using the proposed cutoff. Positive predictive power refers to the probability that an individual identified by an elevated MAC-R actually has the condition of alcohol abuse. Negative predictive power is the probability that an individual who does not have an elevated MAC-R score actually does not have the condition of alcohol abuse. Positive and negative predictive power are more important than sensitivity and specificity because in actual clinical practice one knows the individual's test score and wants to evaluate the score's accuracy in determining the presence of a particular condition. This information is conveyed by predictive power indicators, not by sensitivity and specificity.
Positive and negative predictive power are sometimes termed true positive and true negatives. These are the relevant validity indicators for cutoff score evaluation; sensitivity and specificity are not. A common practice in MMPI2 cutoff research is to present cumulative frequency data on the scale of interest for individuals who have and do not have the condition of interest. These frequencies are then examined to identify the optimal cutoff score. Unfortunately, such analyses identify a score that maximizes sensitivity and specificity which, as was just indicated, are less relevant validity indicators. Researchers who wish to identify optimal cutoff scores must focus on positive and negative predictive power instead.
A second methodological flaw found in some MMPI2 cutoff studies is failure to cross-validate proposed cutoff scores. Just as one would expect some shrinkage in a multiple correlation when a regression equation is cross-validated with a new sample, it is also likely that hit rates will shrink when a cutoff is applied to a new sample. It is therefore inappropriate to report hit rate data based on the same sample that was used to identify the optimal cutoff score.
A further consideration in evaluating cutoff scores is the validity of the criterion information. The essential question in our example would be: Are all individuals who we assign to the known alcohol abuse group in fact alcohol abusers and are all individuals assigned to the nonalcohol-abusing group indeed not alcohol abusers? The accuracy of our hit rate estimates is constrained by the validity of these classifications. Often, these assignments are based on clinical diagnosis. It is, of course, well known that clinical diagnoses are not infallible, and we should therefore recognize that our hit rate estimates are only as accurate as our criterion data. For example, it is possible that there would be unrecognized substance abusers in a psychiatric comparison sample.
A final consideration in conducting MMPI2 cutoff research is the issue of base rates ( Finn & Kamphuis, 1995 ). It has long been recognized that base rates play a vital role in the accuracy of categorical prediction ( Meehl & Rosen, 1955 ). In conducting cutoff research, investigators sometimes create artificial base rates. Using our example, a researcher might collect an equal number of alcohol abusers and nonalcohol abusers. As seen in the formula in Figure 1 , this creates a situation in which the base rate is .50. However, as the base rate in a given clinical setting deviates from .50, the predictive power of the test diminishes. Moreover, this effect could be masked by very high sensitivity indexes. An example of this situation is depicted in Figure 2 .
The top panel of Figure 2 depicts an example in which the base rate is .50 and all four indexes are .90. However, in the bottom panel, when the base rate drops to .20, sensitivity remains at a misleadingly high .90, whereas the more relevant validity indicator, positive and negative predictive power, is reduced to an unacceptably low .20. All else being equal, as the base rate of a variable diminishes, the positive predictive power of tests that are designed to identify the variable decreases.
Therefore, when conducting MMPI2 cutoff research, it is essential to design the study such that the base rate in the sample approximates that found in the clinical setting in which the results are to be applied. Moreover, it is wrong to assume that hit rates identified in one setting apply to other settings in which the base rate of the condition is different. For actual examples of the issues discussed in this section and illustrated in Figure 2 , the reader is referred to a review by Gottesman and Prescott (1989) of the methodological limitations in research with the original MMPI MAC scale.
Because individuals from ethnic or racial groups were not included in the development or norming of the original MMPI, there was concern that the test might be biased against members of such groups. Although early studies suggested important MMPI differences between ethnic and racial group participants and majority participants, later research indicated that MMPI differences between these groups were minimal when the groups were equated for socioeconomic status.
Although early research concerning test bias emphasized differences in mean MMPI scores between ethnic and racial and majority groups, current research stresses that absence of mean differences does not prove that the test is not biased nor does the presence of mean differences prove that the test is biased. Pritchard and Rosenblatt (1980) emphasized that test bias should be assessed by determining if the MMPI is as valid for racial and ethnic group participants as it is for majority group participants. In other words, does the MMPI2 permit equally accurate predictions about members of both groups? If it does, then the test is not biased, whether or not the groups have similar or different mean scores.
Timbrook and Graham (1994) illustrated this methodology using individuals from the MMPI2 normative sample. MMPI2 scores were used to predict conceptually relevant characteristics of African American and Caucasian normative individuals, and the relative accuracy of predictions was assessed by comparing predictions to ratings of the normative individuals completed by spouses or others who knew them very well. Although there were some significant mean differences between African Americans and Caucasians on some MMPI2 scales, accuracy of predictions of extratest characteristics from MMPI2 scores was not different for African Americans and Caucasians. Although this study is viewed as preliminary because it used nonclinical participants, examined only one ethnic group, and did not include all MMPI2 scales (because appropriate criterion measures were not available), it illustrates a strategy for studying possible ethnic bias.
Figure 1. 2 × 2 contingency table derived from MacAndrew Alcoholism ScaleRevised (MAC-R) cuttoff scores for alcohol abusers versus nonalcohol abusers. Formulas: base rate = (A + C)/(A + B + C + D); selection ratio = (A + B)/(A + B + C + D); sensitivity = A/(A + C); specificity = D/(B + D); positive predictive power = A/(A + B); negative predictive power = D/(C + D).
Correspondence may be addressed to
Electronic mail may be sent to butch001@maroon.tc.umn.edu
Received:
Revised: April 3, 1995
Accepted: April 3, 1995

Figure 2. 2 × 2 contingency tables derived from MacAndrew Alcoholism ScaleRevised (MAC-R) cutoff scores for alcohol abusers versus nonalcohol abusers. Top panel: Table and corresponding index values when the base rate = .50. sensitivity = .90; specificity = .90; positive predictive power = .90; negative predictive power = .90. Bottom panel: Table and corresponding index values when the base rate = .20. sensitivity = .90; specificity = .10; positive predictive power = .20; negative predictive power = .80.