TUGAS STATISTIK 6 - [DOCX Document] (2024)

A Review of Multivariate Analysis STORMark J. Schervish

Statistical Science, Volume 2, Issue 4 (Nov., 1987),396-413.

Your use of the JSTOR database indicates your acceptance ofJSTOR's Terms and Conditions of Use. A copy of JSTOR's Terms andConditions of Use is available athttp://www.jstor.org/aboutiterms.html, by contacting JSTOR[emailprotected], or by calling JSTOR at (888)388-3574,(734)998-9101 or (FAX) (734)998-9113. No part of a JSTORtransmission may be copied, downloaded, stored, furthertransmitted, transferred, distributed, altered, or otherwise used,in any form or by any means, except: (1) one stored electronic andone paper copy of any article solely for your personal,non-commercial use, or (2) with prior written permission of JSTORand the publisher ofthe article or other text.

Each copy of any part of a JSTOR transmission must contain thesame copyright notice that appears on the screen or printed page ofsuch transmission.

Statistical Science is published by Institute of MathematicalStatistics. Please contact the publisher for further permissionsregarding the use of this work. Publisher contact information maybe obtained at http://www.jstor.org/joumals/ims.html.

Statistical Science1987 Institute of Mathematical Statistics

JSTOR and the JSTOR logo are trademarks of JSTOR, and areRegistered in the U.S. Patent and Trademark Office. For moreinformation on JSTOR [emailprotected].

2001 JSTOR

http://www.jstor.org/ Thu Aug 1609:14:452001StatisticalScience1987, Vol. 2, No.4, 396-433

A Review of Multivariate Analysis

Mark J. Schervish

A survey of topics in multivariate analysis inspired by thepublication of. T. W. ANDERSONA, n Introduction to MultivariateStatistical Analysis, 2nd ed., John Wiley & Sons, New York,1984, xvii + 675 p~ges., $47.50, a~dWILLIAMR. DILLON andMATTHEWGOLDSTEINM,

ultioariate Analysis:Methods and Applications, John Wiley &Sons, New York, 1984, xii + 587 pages, $39.95. .This review anddiscussion are dedicated to the memory of P. R. Krishnaiah, aleader in the area of Multivariate Analysis, who died of canceronAugust 1, 1987.

1. INTRODUCTION

It has been a long time coming, but it is finally here. Thesecond edition of T. W. Anderson's classic, An Introduction toMultivariate Statistical Analysis, will please all of those whohave enjoyed the first editi?n for so many years. It essentiallyupdates the material in the first edition without going far beyondthe topics already included there. A reader who had spent theintervening 26 years on another planet might get the impressionthat work inmultivariate analysis has been concentrated on justthose topics with the addition of factor analysis. Of course thisimpression is mistaken, and Anderson himself notes in the Preface(page vii) that "It is impossible to cover all relevant material inthis book." So, in the course of reviewing this book, and comparingit to the first edition, I thought it might be interesting to takea thoroughly biased and narrow look at the development ofmultivariate analy sis over the 26 years between the two editions.A reader interested in a more complete and less person alisticreview might refer to Subramaniam and Subramaniam (1973) and/orAnderson, Das Gupta and Styan (1972). Recent reviews of somecontempo rary multivariate texts (less cluttered by reviewer bias)were performed by Wijsman (1984) and Sen (1986).Suppose we begin atthe end. Nearly simultaneous with the publication of the secondedition of Ander son's book is the release of Multivariate Analysisby Dillon and Goldstein (the Prefaces are dated June and May 1984,respectively). This text, which is subtitled Methods andApplications, is different from Ander son's in every respect exceptthe publisher. It even seems to begin where Anderson leaves offwith factor analysis and principal components. I believe thatthe

Mark J. Schervish is Associate Professor, Department ofStatistics, Carnegie Mellon University, Pittsburgh, Pennsylvania15213.

differences between the texts reflect two very differentdirections in which multivariate analysis has pro gressed. Thetopics covered by Dillon and Goldstein have, by and large, beendeveloped more recently than those covered by Anderson. As anillustration, fewer than 18% of the references cited by Dillon andGold stein are pre-1958, whereas almost 42% of Anderson'sreferences are pre-1958. (Of course Anderson had a headstart, butthe other authors had access to his 1958 book. In three places,they cite Anderson's 1958 book in lieu of earlier work.) The majordifference in em phasis is between theory and methods. Toillustrate this distinction, Anderson had twelve examples workedout with data in his first edition and the same examples appear inthe second edition, with no new ones (but one correction). This isdue, in large part, to the fact that the topics covered in the twoeditions are nearly identical. (Although factor analysis has beenadded as a topic, no numerical examples are given, and no numericalexercises are included.) Dillon and Goldstein work out numerousexamples, often reanalyzing the same data several times toillustrate the differences between various techniques.Since 1958,the development of multivariate theory has been concentrated, to alarge extent, in the general areas that Anderson covered in hisfirst edition. Mul tivariate methods, on the other hand, have takenon a life of their own, with or without the theory thatmathematical statisticians would like to see developed. This hasled to an entire industry of exploratory and ad hoc methods fordealing with multivariate data. Researchers are not about to waitfor theoreticians to develop the necessary theory when theyperceive the need for methods that they think they understand. Thetheoretical statisticians' approach to multivariate analysis seemsto have been to follow the first princi ple of classical inference."If the problem is too hard test a hypothesis." The development ofproce dure~ like cluster analysis, factor analysis, graphical

396REVIEW OF MDL TIV ARIATE ANALYSIS

397

methods and the like argue that more tests are not going to beenough to satisfy the growing desire for useful multivariatemethods.

2. BACK TO THE BEGINNING

2. 1 What's Old

The basic theoretical results with which Anderson began hisfirst edition are repeated in the second edition with only minorclarifications. These include the properties of the multivariatenormal distribution and the sampling distributions of thesufficient statis tics. They comprise the bulk of Chapters 2 and 3.Dillon and Goldstein deal with all of these concepts in fewer than12 pages of an appendix. The new material that Anderson adds toChapter 3 includes the noncentral X 2 distribution for calculationof the power functions of tests with known covariance matrices. InChapter 5, he adds a section on the power of tests based onHotelling's T2. The pace at which power functions have beencalculated for multivariate pro cedures is very much slower thanthe pace at which tests have been proposed, even though it does notmake much sense to test a hypothesis without being able to examinethe power function. For a level a chosen without regard to thepower, one could reject with too high a probability foralternatives fairly close to the hypothesis or with too low aprobability for alternatives far away without knowing it. (SeeLehmann, 1958, and Schervish, 1983, for discussions of this issuein the univariate case.) Multivariate power functions are, ofcourse, much more difficult to produce than are tests. They arealso more difficult to understand than univariate power functions.Even inthe simple case of testing that the mean vector J..L equalsa specific value u based on Hotelling's T2, the power functiondepends on the quantity 72 =(J..L - V)T~-l(J..L - v). Just as inunivariate analysis, itis rarely (if ever) the case that one isinterested in testing that J..Lexactly equals u, rather one isinterested in how far J..L is from u, If one uses the T2 test, oneis implicitly assuming that 72 adequately measures that distance.If it does not, one needs a different test. If72 is an adequatemeasure, what one needs is some post-data measure of how far 72 islikely to be from o.The posterior distribution of 72 would servethis pur pose. This posterior distribution is easy to derive in theconjugate prior case. In Chapter 7 (page 270), Anderson derives theposterior joint distribution of J..L and ~. This posterior is givenby

J..L I ~ ,..., Np(J..Ll' l/Al~)'(1)~ ,..., W;I(Al' ad,

where W;1 (AI, al) denotes the inverse Wishart distribution withscale matrix AI, dimension p, and al

degrees of freedom. In words, the conditional distri bution ofJ..L given ~ is p-variate normal with mean vector J..Ll andcovariance matrix l/Al~; the marginal distribution of ~ is inverseWishart. The constants J..Ll, AI, AI, and al are functions of boththe data and the prior, but their particular values are notimportant to the present discussion. (For large sample sizes, aland Al are both approximately the size of the sample, whereas J..Llis approximately the sample mean vector and Al is approximately thesample sum of squares and cross-products matrix.) It follows that,condi tional on ~, A172 has noncentral X2 distribution with pdegrees of freedom and noncentrality parameter1] = AI (J..L1 - v)T~ -1 (J..L1 - v).

The distribution of 1] is a one-dimensional Wishart or gammadistribution r(lf2al, If2"1/;-2), where

"1/;2 = AdJ..Ll - v)TAll(J..Ll - v).

We get the marginal distribution of 72 by integrating1] out ofthe joint distribution of 72 and 1]. The result is that thecumulative distribution function of 72 is00 [( 1 )a1/2( "1/;2 )kr(k + If2adF(t) = k~O 1 + "1/;2 1 + "1/;2 k!r(lf2al)

t (lf2Ad k+p/2 k+p/2-1 (_ Al ) d ]

Io r(k+lf2p)u exp 2u u.

This function can be accurately calculated numerically by usingan incomplete gamma function program and only a few terms in thesummation, because the inte gral decreases as k increases. Due tothe similarity that this distribution bears to the noncentral X 2dis tribution (the only difference being that the coeffi cients aregeneralized negative binomial probabilities rather than Poissonprobabilities), I will call it the alternate noncentral X2(p, aI,"1/;2/(1-"1/;2, abbreviated ANC X2. The ANC X2 distribution wasderived in a discriminant analysis setting by Geisser (1967). Italso turns out to be the distribution of many of the noncentralityparameters in univariate analysis of variance tests.In other cases,when 72 does not adequately measurethe distance between J..L and u,the experimenter will have to say exactly how he/she would like tomeasure that distance. Perhaps several different measures areimportant. One thing theoretical statisticians can do is to deriveposterior distributions for a wide class of possible distancemeasures in the hope that at least one of them will be appropriatein a given application. What they are more likely to do is topropose more tests whose power functions depend on parameters otherthan 72 Any movement in this direction, how ever, would be welcomein that it would force users to think about what is important todetect before just using the easiest procedure.398

2.2 What's New

M. J. SCHERVISH

3. DECISION THEORY AND BAYESIAN

An interesting addition to the chapter on Hotell ing's T2 isSection 5.5 on the multivariate Behrens Fisher problem. Consider qsamples of size Ni, i =1, ... , q from normal distributions withdifferent covariance matrices. The goal is to test Hi: :L;=1{3i/-li =u, The procedures described amount to transforming the qsamples into one sample of size min{NI, ... , N; Iin such a waythat the mean of the observations in the one sample is :L;=1{3i/-li. The usual T2 statistic isnow calculated for thistransformed sample. These methods are classic illustrations of thelevel a mindset, that is, the overriding concern for having a testpro cedure with prechosen level a regardless of the data structure,sample size or application. Data is discarded with a vengeance bythe methods described in this section, although Anderson claims(pages 178), "The sacrifice of observations in estimating acovariance matrix is not so important." Also, the results depend onthe order in which observations are numbered. Ofcourse, theposterior distribution of :L~=1 {3i/-li is nosimple item tocalculate, but some effort might usefullybe devoted to itsderivation or approximation.One other unfortunate feature ofSection 5.5 is the inclusion of what Anderson calls (pages 180)"Another problem that is amenable to this kind of treatment.", Thisis a test of the hypothesis /-l (1) = /-l (2) where

/-l(1)/-l = ( /-l(2)

is the mean vector of a 2q-variate normal distribution. The testgiven is a special case of tl e general test ofH: A/-l = 0 with Aof full rank. The general test isbased on T2 = N(Ai)T(ASA T)-l(Ai)and it neitherdiscards degrees of freedom nor depends on the ordering of the observations. This test is simply not another example ofthe type of test proposed for the Behrens Fisher problem.A topicthat has been added to the treatment ofcorrelation is the unbiasedestimation of correlation coefficients. This topic illustrates thesecond principle of classical inference: "Always use an unbiasedesti mat or except when you shouldn't." The case of thesquaredmultiple correlation iP is one in which youshouldn't use anunbiased estimator. When the sample multiple correlation R2 is near0, the unique unbiased estimator based on R2 may be negative. Thisis not uncommon for unbiased estimators. Just because the averageof an estimator over the sample space is equalto the parameterdoesn't mean that the observed value of the estimator will be asensible estimate of the parameter, even if the variance is assmall as possible. I would suggest an alternative to the secondprinciple of classical inference: "Only use an unbiased estimatorwhen you can justify its use on other grounds."

INFERENCE

A welcome addition to the second edition is the treatment ofdecision theoretic concepts in various places in the text. InSection 3.4.2, the reader first sees loss and risk as well asBayesian estimation. Admissibility of tests based on T2 isdiscussed in Section 5.6. One topic in the area of admissibility ofestimators that has been studied almost furiously since 1958 isJames-Stein type estimation. Stein (1956) showed that the maximumlikelihood estimate (MLE) of a multivariate mean (with known covariance) is inadmissible with respect to sum of squared errors loss,when the dimension is at least 3. Then, James and Stein (1961)produced the famous "shrunken" estimator, which has everywheresmaller risk function. Since that time, the literature on shrunkenestimators has expanded dramatically to include a host of resultsconcerning their admissibility, minimaxity and proper Bayesianity.Anderson has added a brief survey of those results in a new Section 3.5. He seems, however, reluctant to recommend a procedurethat acknowledges its dependence on subjective information. This isevidenced by his comment (page 91) concerning the improvement inrisk for the James-Stein estimator of u, shrunken toward v :

However, as seen from Table 3.2, the improvement is small if /-l- v is very large. Thus, to be effective some knowledge of theposition of /-l is necessary. A disadvantage of the procedure isthat it is not objec tive; the choice of v is up to theinvestigator.

Anderson comes so close to recognizing the impor- tance ofsubjective information in making good infer ences, but I will notaccuse him of having Bayesian tendencies based on the above remark.It should also be noted, of course, that the choice of themultivariate normal distribution as a model for the data Y is alsonot objective, and is probably of greater consequence than thechoice of u. For example, if the chosen dis tribution of Y hadinfinite second moments and /-l were still a location vector,admissibility with respect to sum of squared errors loss would noteven be studied seriously.In addition to the simple shrinkageestimator andits varieties, Anderson reviews such estimators forthe mean in the case in which the covariance matrix is unknown(Section 5.3.7) and for the covariance matrix itself (Section 7.8).He also gives the joint posterior distribution of /-l and 2; basedon a conjugate prior, as well as the marginal posteriors of /-l and2;. He does not give any predictive distributions, for example, thedistribution of a single future random vector, or of the average ofan arbitrary number of future observations. Unfortunately, he gotthe covariance matrix of the marginal distribution of /-lincorrect. For those of youREVIEW OF MULTIVARIATE ANALYSIS

399

who are reading along (page 273), the correct formula is [(N +k)(N + m - 1 - p)]-IB. Press (1982) gives a more detailedpresentation of a Bayesian approach toinference in multivariateanalysis.Bayesian inference in multivariate analysis has notprogressed by anywhere near the amount that classicalinference has.An oversimplified reason may be thefact that everyone knows what todo when you use conjugate prior distributions and nobody knowswhatto do when you don't. There are, however many (per haps toomany) problems that can still be addressed within the conjugateprior setting. There is the issue of exactly what summaries shouldbe calculated from the posterior distribution. The standardcalculationsare moments and regions of high posterior density. Thefirst principle of Bayesian inference appears to be "Calculatesomething that is analogous to a classi cal calculation." TheBayesian paradigm is much more powerful than that, however. Havingthe posterior distribution theoretically allows the calculationofpos terior probabilities that parameters are in arbitrary sets.It also allows the calculation of the predictive distribution offuture data, which in turn includes the probabilities that futureobservations lie in arbitrary sets. These are the sorts ofnumerical summaries that people would like to see, but thetechnology needed to supply them is very slow in developing.Onereason for the slow progress in Bayesian meth ods is thecomputational burden of performing even the simplest of theoreticalcalculations. Multivariate probabilities require enormous amountsof computer time to calculate. Also, calculation of summary measures when prior distributions are not conjugate is very timeconsuming. Programs like those developed by Smith, Skene, Shaw,Naylor and Dransfield (1984) are making such calculations easier,but more effort is needed. Computational difficulties have alsohind ered the development of power function calculations formultivariate tests. Perhaps breakthroughs in one area will helpresearchers in the other also.

4. DISCRIMINANT ANALYSIS

Chapter 6 of Anderson, concerning classification has expandedsomewhat compared to the first edition, although the introductorysections have remained bas ically intact. Notation has been alteredto reflect standardization. In addition, the formula for the"plug-in" discriminant function Wand the formula for the maximumlikelihood criterion Z are introduced for future comparison in anew section on error rates. A great deal of work had been donebetween the two editions in the area of error rate estimation. Someof this work is discussed in Section 6.6, "Probabilities ofMisclassification." The presentation consists of sev eral theoremsand corollaries giving asymptotic expan sions for error rates ofclassification rules based both

on W and on Z for the two population case. In light of thedryness of this section, perhaps the author can be forgiven forfailing to discuss any results on error rate estimation in the caseof several populations such as the asymptotic expansions given bySchervish (1981a,b). Surprisingly, Dillon and Goldstein say evenless about error rate estimation, giving only a verbal descriptionof a few existing methods. This is an area in which recent progresshas consisted mainly of the introduction of several methodsinvolving bootstraps, jackknives and asymptotics. The theory behindthe methods is a bit sparse, which helps to explain their neglectby Anderson, but not of their shallow treat ment by Dillon andGoldstein.Anderson's treatment of the multiple group classificationproblem is identical in the two editions, although Dillon andGoldstein adopt the alternative approach based on the eigenanalysisof the matrix W-l B, in their notation. In this approach, one triesto find a reduced set of discriminant functions that providesnearly the same discriminatory power as the optimal discriminantfunctions. For example, if one wishes to use only one discriminantfunction, one would choose the eigenvector of W-I B correspondingto the largest eigenvalue. Geisser (1977) gives an ex ampleillustrating how this first linear discriminant function can leadto poorer classification than other linear functions that are noteigenvectors of W-I B. The problem is that discriminatory power(measured by misclassification probability) is not reflected in thesquared deviations that the eigenvalues of W-l B measure. Guseman,Peters and Walker (1975) attack the problem of finding optimalreduced sets of discrim inant functions for the purposes ofclassification. A simplified solution in the case of threepopulations was given by Schervish (1984). The theoretical analysis through the eigenstructure of W-I B is based on (what else?)tests of the hypotheses that successiveeigenvalues are o. Ihesitate to mention that the successive tests are rarely performedconditionally on theprevious hypotheses being rejected, for fearthat some-.one may then think that this would be aninterestingproblem to pursue. I was surprised to see Andersonsuggesting a similar sort of sequential test procedurein therelated problem of determining the number of nonzero cannonicalcorrelations. Anderson does note (page 498) that "these proceduresare not statistically independent, even asymptotically." Dillon andGold stein also give an example (11.2-2, page 405) ofthissuccessive unconditional testing. This example is noteworthyfor another lapse of rigor which may be even more dangerous. Theyuse V to denote the test statistic and say:Because V = 269.59 isapproximately distributed as x2 with P(K - 1) = 5(3) = 15 df, it isstatistically significant at better than the 0.01 level.

Obviously, 269.59 is not approximately x2, but neither is Vsince the hypothesis is most likely false. It seems a bit strangeto use the approbatory description "betterIthan" when "less than"is meant. It is as if one were rooting for the alternative. Whatkind of hypothesistesting habits will a reader with littletheoretical sta tistical training ,develop if this is the type ofexample he/she is learning from?

5. EXPLORATORY METHODSAs mentioned earlier, several well knownad hoc procedures have emerged from the need to do explor atoryanalysis with multivariate data. These proce dures can be quiteuseful for gaining insight from data sets or helping to developtheories about how the data is generated. Theoreticians often thinkof these pro cedures as incomplete unless they can lead to thecalculation of a significance level or a posterior prob ability.(This reviewer admits to being guilty of that charge on occasion.)Although some procedures are essentially exploratory, such asChernoff's (1973) faces, others may suggest probability models,which in turn lead to inferences. I discuss a few of the betterknown exploratory methods below. Of course, it is impossible tocover all exploratory methods in this review. None of these methodsis described in Ander son's book, presumably due to the lack oftheoretical results. Dillon and Goldstein give at least some coverage to each topic. Their coverage of cluster analysis andmultidimensional scaling is adequate for an intro ductory text onmultivariate methods, but I believe they short change the readerwith regard to graphical methods (as does virtually every othertext on multi variate analysis). N9w that the computer age is infull swing, exploratory methods will become more and more importantin data analysis as researchers realize that they do not have tosettle for an inferential analysis based on normal distributionswhen all they want is a good look at the data.

5.1 Cluster Analysis

Cluster analysis is an old topic that has flourished to a largeextent in the last 30 years partly due to the advent of high speedcomputers that made it a feasible technique. It consists of avariety of procedures that usually require significant amounts ofcomputation. It is essentially an exploratory tool, which helps are searcher search for groups of data values even without any clearidea of where they might be or how many there might be. Statisticalconcepts such as between groups and within groups dispersion haveproven use ful in developing such methods, but little statisticaltheory exists concerning the problems that give rise to the needfor clustering.

Not surprisingly, some authors have begun to de-_velop tests ofthe hypothesis that there is only one cluster. Here, one mustdistinguish two forms of clus ter analysis. Cluster analysis ofobservations concerns ways of grouping observation vectors intohom*ogene ous clusters. It is this form that has proven amenable toprobabilistic analysis. The other form is cluster analysis ofvariables (or abstract objects) in which the only input is a matrixof pairwise similarities (or differences) between the objects. Theactual values of the similarity measures often have no clearmeaning, and when they do have clear meaning, there may be nosuggestion of any population from which the ob jects were sampledor to which future inference will be applied. In these cases,cluster analysis may be nothing more than a technique forsummarizing the similarity or difference measures in less numericalform. As an exploratory technique, cluster analysis will succeed orfail according to whether it either does or does not help a userbetter understand his/her data.From a theoretical viewpoint,interesting questionsarise from problems in which data clusters.Suppose we define a cluster probabilistic ally as a subset of theobservations that arose independently (conditional on someparameters if necessary) from the same proba bility distribution.For convenience consider the case in which each of those specificdistributions is a mul tivariate normal and the data all arose inone large sample. We may be interested in questions such as (i)What is probability that there are 2 clusters? (ii) What is theprobability that items k and j are in separate clusters if thereare 2 clusters. (iii) If there are two clusters, where are theylocated? Answers to the three questions raised requireprobabilities that there are K clusters for K = 1, 2. They alsorequire conditional distributions for the cluster means andcovariances given the number of clusters, and they require probabilities for the 2 n partitions of the n data values among the twoclusters given that there are two clusters. There are some sensibleways to construct the above distributions, but the computations getout of hand rapidly as n increases. Furthermore, as the number ofpotential clusters gets larger than 2 or as the dimen sion of thedata gets large, the theoretical problems become overwhelming.Following the first principle of classical inference Engleman andHartigan (1969) have proposed a test, in the univariate case, ofthe one cluster hypothesis with the alternative being that thereare two clusters. Although easier to construct than thedistributions mentioned, such a test doesn't begin to answer any ofthe three questions raised above.

5.2 Multidimensional Scaling

400M. J. SCHERVISHDillon and Goldstein introducemultidimensional scaling (MDS) as a data reduction technique.Another

REVIEW OF MDL TIV ARIATE ANALYSIS 401

way to describe it would be as a data reconstruction technique.One begins with a set of pairwise similari ties or differencesamong a set of objects and con structs a set of points in someEuclidean space (one point for each object) so that the distancesbetween the points correspond (in some sense) to the differ encesor similarities between the objects (closer points being moresimilar). If the Euclidean space is two dimensional, such methodscan provide graphical dis plays of otherwise difficult to readdifference matrices. For example, the dimensions of the constructedspace may be interpretable as measuring gross features of theobjects. Any objects that are very different in those featuresshould be far apart along the corre sponding dimension.There aretwo types of MDS. When the similarities or differences are measuredon interval or ratio scales, then metric MDS can be used to try tomake the distances between points in the Euclidean represen tationmatch the differences between the objects in magnitude. This typeof scaling dates back to Torger son (1952). When the similaritiesor differences are only ordinal, then nonmetric MDS can be used tofind a Euclidean representation that matches the rank order of thedistances to the rank order of the original difference measures.Shepard (1962a, b) and Kruskal (1964a, b) introduced the methodsand computational algorithms of nonmetric MDS. The methodology ofboth types of MDS is not cluttered with tests of significance orprobability models. In its current state it appears to be a purelyexploratory technique designed for gaining insight rather thanmaking inference.

5.3 Graphical Methods

Graphical display of multivariate data has been performed formany years. Tufte (1983) gives some excellent historical examplesof multivariate displays. Computers have made the display ofmultivariate data much easier and allowed the introduction of techniques not considered feasible before. Chernoff's (1973) faces areone ingenious example, as are Andrews' (1972) function plots. Suchmethods are often used as part of a cluster analysis in order tosuggest the number of clusters or to visually assess the results ofa clustering algorithm. Gnanadesikan (1977) describes several othergraphical techniques that can be used to detect outliers inmultivariate samples. Tukey and Tukey (1981a, b, c) describe alarge number of approaches to viewing multivariate samples,including Anderson's (1957) glyphs and the trees of Kleiner andHartigan (1981). Most of these techniques require sophisticatedgraphics hardware and software in order to be used routinely. Theirpopularity (or lack thereof) is due in large part to both theexpense involved in acquiring good graphics equip-

ment and the lack of a widely accepted graphics stand ard. Thatis, what runs on a Tektronix device will not necessarily run on anIBM PC or a CALCOMP, etc., unless the software is completelyrewritten. Most stat isticians (this author included) can think ofmore interesting things to do than rewriting graphics soft ware torun on their own particular device. Perhaps the graphics kernelstandard (GKS) will (slowly) elim inate this problem.

6. REGRESSION

Regression analysis, in one form or another, is prob ably themost widely used statistical method in the computer age. What wouldhave taken many minutes or hours (if attempted at all) in the earlydays of multivariate analysis is now done in seconds or less evenon microcomputers. Hence, we expect to see some discussion ofmultivariate regression in any modern multivariate analysis text.Chapter 8 of Anderson's text deals with the multivariate generallinear model. The title of the chapter, unfortunately, exposes whatthe emphasis will be: "Testing the general linear hy pothesis;MANOVA." Nevertheless, the treatment is thorough, providing moredistributions, confidence regions and tests than in the firstedition.Oddly enough, however, Dillon and Goldstein devote twochapters of their text to multiple regression with a singlecriterion variable. This is a topic usually covered as part of aunivariate analysis course, because only the criterion variable isconsidered random. But this reasoning only goes to furtherillustrate the dis tinction between the theoretical andmethodological approaches to statistics. If the observationconsists of (Xl, ... ,Xp, Y), then why not treat it asmultivariate? The authors reinforce this point by denotingtheregression line E( Y I X). In addition to the mandatorytests ofhypotheses, they also discuss model selection procedures, outliers,influence, leverage, multicolline arity (in some depth), weightedleast squares and autocorrelation. Neither text, however, considersthose additional topics in the case of multivariate regression.Gnanadesikan (1977) has some suggestions for how to deal with a fewof them. As an alternative to the usual MANOVA treatment of themultivariate linear model, Dillon and Goldstein include a chapteron linear structural relations (LISREL), which I dis cuss inSection 10.

7. CANONICAL CORRELATIONS

A topic very closely related to multivariate regres sion, butusually developed separately, is canonical correlation analysis.Anderson develops it as an ex ploratory technique, being sure toadd new material on tests of hypotheses. Dillon and Goldsteinintroduce the topic by saying (page 337), "The study of the

M. J. SCHERVISH402

relationship between a set of predictor variables and a set ofresponse measures is known as canonical correlation analysis." Itseems clear that they intend this to at least replace anydiscussion of multivariate regression. What coverage of MANOVA theyprovide is a special topic under multiple discriminant analysis.Canonical correlation goes one step beyond multivar iateregression, however. In regression analysis, the focus is onpredicting the criterion variables Y from the independent variablesX. Canonical correlation goes on to ask which linear functions of Ycan be most effectively predicted by X? The canonical variablesbecome those linear functions of Y together with their best linearpredictors. Because the multivariate regression {3Xalready givesthe best linear predictor of Y, the X canonical variablecorresponding to ca nonical variable a Ty turns out to beaT{3Xtimes a normalizing constant.The theory and methodology ofcanonical correla tion, as described above, has been available formany years. Anderson takes the methodology further by showing howit applies to structural equation models and linear functionalrelationships. For those unfa miliar with these topics, theintroduction of linear functional relationships in Section12.6.5will be a bit confusing. It begins, essentially, as follows(page 507): For example, the balanced one-way analysis ofvariancecan be set up asYOIj = VOl + J.L + UOIj, a = 1, .. " m, j = 1, "',l,where

and

0vOI=O. a=l, "',m,

where 0 is q X PI of rank q (~PI) . No mention is given in thisdiscussion of where the matrix e comes from or what it means. Theinference is that it specifieslinear functional relationships, butthese have not been part of any discussion of the one-way analysisof variance prior to this point in the text. The discussion ofstructural equation models and two-stage least squares in Section12.7 is more coherent and illus- "trates the author's ingenuity.Although the limited information maximum likelihood estimator introduced there appears ad hoc, it does show that canon icalcorrelation analysis is a bit more versatile than most textbooksgive it credit for being. Dillon and Goldstein present a much moregrandiose treatmentof linear structural relations (LISREL), which Idiscuss in Section 10.

8. PRINCIPAJ-COMPONENTS

As mentioned earlier, Dillon and Goldstein begin where Andersonleaves off by discussing principal

components. Although both authors give only a brief treatment ofthis topic, their treatments differ dra matically. Anderson givesasymptotic distributions for the vectors and eigenvalues. He evenadds some new discussion of efficient methods of computing theeigenstructure. Other new material includes confi dence bounds forthe characteristic roots and tests of various hypotheses about theroots. Dillon and Gold stein, in contrast, say next to nothingabout how to calculate principal components, aside from the mathematical formulas. They give brief mention of one hypothesis test(lip service to the first principle of classical inference, nodoubt). They describe the ge ometry of principal components inextensive detail, and they present a brief treatment of some ad hocmethods for choosing how many components to keep. The majordifference between the two treatments, however, is that Dillon andGoldstein present princi pal components analysis as one part of alarger factor analysis rather than as a separate procedure.Aninteresting alternative derivation and interpretation of principalcomponents is suggested by resultsof O'Hagan (1984). Let R be thecorrelation matrix of a random vector X that has been standardizedso that R is also the covariance matrix. In most treatments,thefirst principal component is that linear functionof X that has thehighest variance subject to the coefficient vector having norm 1.It also happens to be that linear function whose average squaredcorre lation with each of the X/s is largest. That is, if ri(c) =corr(cTX, Xi) then the c, which maximizes Li=I rr(c), is the firstprincipal component. So the first principal component is thatlinear function of X that would best serve as a regressor variableif one wished to predict all coordinates of X from the sameregressor. Suppose now that we regress X on the firstprincipalcomponent and calculate the residual covar iance matrix. In theresidual problem, the second principal component is that linearfunction of X that maximizes the weighted average of the squaredcor relations with the coordinates of Xi. The weights are theresidual variances after regression on the first principalcomponent. That is, the second principal component is the bestregressor variable for predicting all of the residuals of the X/ safter regression on the first principal component. The remainingprincipal components are generated in a similar fashion. Theadvantages to this approach over the more standard approaches are2-fold. First, if one wishes to reduce dimensionality, the goalshould be to be able to predict the whole data vector as well aspossible from the reduced data vector. That this is achieved byprincipal components is not at all obvious from their derivation aslinear functions with maximum variance. Second, there is no need tointroduce the artificial constraints that the principal componentshave norm 1 and that they be uncorrelated or orthogonal. One canscaleREVIEW OF MULTIVARIATE ANALYSIS

403

them any way one wishes for uniqueness, and they areautomatically uncorrelated because each one lies in the space ofresiduals from regression on the previous ones. Hence, themaximization problem one solves for each principal component isidentical with all of the others except that the covariance matrixkeeps chang ing. This approach 1s described in more detail bySchervish (1986).

9. FACTOR ANALYSIS

Factor analysis has been described both as a data reductiontechnique and as a data expansion tech nique. The basic goal is tofind a small number of underlying factors such that the observedvariables are all linear combinations of these few factors plussmall amounts of independent noise. Because the factors are notobservable variables, it turns out that there is a great deal ofindeterminacy in any particular factor solution. That is, given aparticular solution, there are many alternative solutions thatproduce the very same estimated covariance structure for theobserved vari ables, but with different factors. Some arbitrary restrictions must be placed on the solution in order to obtain aunique answer. Chapter 14 of Anderson's second edition is all newand contains a good exposi tion of the maximum likelihood approachto factor analysis. This is the only approach in which statisticaltheory has played an important role. It includes a particulararbitrary restriction that allows calculation of a uniquesolution.

9.1 Exploratory Factor Analysis

There are traditionally two modes in which one can performfactor analysis, First, there is exploratory factor analysis. Inthis mode, one is trying to deter mined both how many (if any)factors there are and what they mean, if there are any. Once onehas fit a model with a specific number of factors, one can rotatethe factors through all of the equivalent solutions by using any ofseveral exotically named techniques. With the maximum likelihoodapproach, one can also test the hypothesis that there are only mcommon factors where m is smaller than the dimension of theobservation vectors. If the test rejects the hypothesis, one isfree to add more factors until the result is insignificant. Thispractice is deplorable in the usual hypothesis testing framework,although I am sure that some unfortunate person somewhere iscurrently trying to solve the problem of determining the level ofthis procedure, or sequences of critical values to guar antee aspecified level. Because it is never conclusively decideable howmany factors there are in a given application, it would be worthwhile to have a model that would allow calculation of theprobability distri bution of the number of factors. This wouldrequire subjective information about the factor structure.

Consider the example analyzed in Section 3.4 of Dillon andGoldstein by both the principal factor method and maximumlikelihood. The' example con cerns ten political and economicvariables measured on 14 countries. Dillon and Goldstein present aprin cipal factor solution with four factors and a maximumlikelihood solution with three factors. The fourth prin cipalfactor contributes almost as much to the solution as does thethird. But Dillon and Goldstein claim that the likelihood ratiotest of the three-factor model (using the maximum likelihoodmethod) produces a X 2 value of 20.36 with 18 degrees of freedom,and accepts the model at any commonly used ex level. They do notreport the result of a test of the two-factor model, and they claimthat the fitting of a four-factor model failed to converge. I usedBMDP4M (cf. Dixon,1985) to fit the two- three- and four-factormodels sothat I could compare them. Unfortunately, I was un able toreproduce Dillon and Goldstein's results. The two- three- andfour-factor models converged in 7, 17 and 8 iterations,respectively. The X 2 values were50.475, 38.400 and 19.857 for two,three, and four factors, respectively, with 26, 18 and 11 degreesoffreedom. (Note that BMDP4M does not calculate theX 2 value so Ihad to work with the output, which wasrounded to three digits.Hence, some rounding error has been introduced into my calculation.I used both the raw data and the correlation matrix and got similarresults.) The results of the three-factor fit with a varimaxrotation are given in Table 1. The results of the four-factor fitwith a varimax rotation are givenin Table 2.The point of thisexample is to illustrate the difficulty one has in determining thenumber of factors. The hypothesis test is not conclusive(regardless ofwhether Dillon and Goldstein's or my calculations arecorrect). The fourth factor in Table 2 is certainly not easy tointerpret, but does that mean that we should believe there are onlythree factors? The fourth factor contributes 84% as much varianceas does the third factor. One has to look carefully at the meaningsof

TABLE 1Maximum likelihood solution with 3 factors and varimaxrotation

Factor

Variable 1 2 31 0.846 0.298 0.3382 0.870 0.471 0.1453 0.7690.010 -0.0954 0.442 0.141 0.6585 -0.102 0.929 0.3566 0.510 -0.3750.2247 0.237 0.754 0.1928 0.814 -0.076 0.2419 0.341 -0.254 -0.03410-0.038 0.288 0.823

TABLE 2Maximum likelihood solution with 4 factors and varimaxrotation

404M. J. SCHERVISHthan of the hypothesized model. I took thesame data and used BMDP4M to find the unrestricted maximumlikelihood solution with three factors and a varimax

Factorrotation. The x2 statistic was 2p.99with 25 df (IrefuseVariable1234to look up the p-value). This IS presumably apretty10.5880.5700.3520.444good fit. The solution bares a good dealof resemblance20.7140.6060.1770.219to the hypothesized solution andonly has high load3 0.993 -0.100 -0.007 0.0554 0.275 0.302 0.5720.2045 0.005 0.750 0.360 -0.5556 0.212 -0.105 0.169 0.5477 0.0280.929 0.112 -0.0148 0.653 0.181 0.212 0.5109 0.042 0.028 -0.1590.44410 0.013 0.140 0.970 -0.200

the variables and try to imagine what, if anything, couldcontribute to the variables in the proportions given by each of thecolumns. If this is not possible, rotate the factors and try again.When done, one may have a deeper understanding of the data set oreven have developed a new theory for explaining the data. One doesnot (in this case at least) have a conclusion as to how manyfactors there are. I am beginning to understand why Anderson didnot include any numer ical examples of factor analysis in hissecond edition.

9.2 Confirmatory Factor Analysis

In the second mode of operation, namely confirma tory factoranalysis, one hypothesizes a factor structure of a particular sortand then uses the data to find the best fitting model satisfyingthe hypothesized struc ture. The specified structure may beextremely specific (going so far as to specify all of the factorloadings) or less specific, such as only saying that some loadingsare required to be zero. In general, confirmatory analy sis doesnot permit arbitrary rotations of the factors because the specifiedstructure might be destroyed by the rotation. After fitting themodel, one is compelled to test the hypothesis that the model fits,presumably by using the likelihood ratio test. Dillon and Goldsteinpresent an example of this procedure in Section 3.8.5.The exampleconcerns eleven variables on n = 840subjects and three hypothesizedfactors with certain, specified loadings equal to zero. 'Theycalculate the lik,elihood ratio x2 statistic as 50.99 with 35 df (p= 0.0395) and claim (page 104), "The fit of this model is notsatisfactory." First of all, a x2 value soclose to the degrees offreedom with n = 840 is not bad if the hypothesized model has any apriori credi bility. Aside from this often neglected point, onemust ask, "Then what?" Dillon and Goldstein fit a second model withcomparable results and conclude (page106) "that the data do notconfirm the a priori as sumptions about their structure." I suggestthat this is more a failure of the hypothesis testing mentality

ings in two of the thirteen places hypothesized to be zero. Thisis not to say that the hypothesis should be accepted, but ratherthat one should not (just) calcu late the p-value and ignore howclose the data really are to the hypothesis.

9.3 Interpretation

As an exploratory technique, factor analysis is as good as theinsights its users gain from using it. As an inferential technique,however, it suffers from a lack of predictive validity. One cannotobserve factor scores and then predict observables. However, thereis no arguing the fact that the statement of the factor analysisproblem is very appealing intuitively. Large sets of moderatelycorrelated variables probably have some common structure, thediscovery of which might shed considerable light on the processgenerating the variables. What seems so mystifying about factoranalysis is how that discovery occurs. After forming a factorsolution, one is still left with the question of whether theoriginal variables are linear combinations of the factors or if thefactors are just linear combi nations of the original variables.Certainly the esti mated factor scores are just linear combinationsof the original variables. If these later prove useful in some asyet unspecified problem, it may still be the original variables andnot the hypothesized factors that are doing the work. Put moresimply, the way the common factor model is implemented, it is as ifthe user is regressing the original variables on each other to finda few best linear predictors. This is essentially what principalcomponents analysis does, and that is why the two methods are oftenused for similar purposes. This discussion is not intended todiscourage or de nigrate work in the area of factor analysis, butrather to encourage those, who feel that the common factor modelhas something to offer, to develop experiments in which the use ofthat model can be distinguished from regression.

10. PATH ANALYSIS AND LlSREL

The path analysis and LISREL models are generally not well knownto mathematical statisticians, because they are most commonlydiscussed in writings by and for psychometricians. In this section,I present a very cursory overview of the ideas underlying thesemodels and some examples of how they can be used and misused.

10.1 Path Analysis

REVIEW OF MULTIVARIATE ANALYSIS 405

ture, PYXl would equal .9 and there would be no

When dealing with a large collection of variables, it is veryuseful to sort out which of them one would like to be able topredict from which other ones. The same variables may play the roleof predictor in one situa tion and criterion in another. The powerof multivar iate analysis is its ability to treat jointdistributions, not just conditional ones like traditionalregression analysis. Hence, the initial stages of a path analysiscan be quite useful. A diagram illustrating which vari ables onethinks influence which others, and which ought to be correlatedwith each other can help one to organize the analysis moresensibly. (See Darroch, Lauritzen and Speed, 1980, for anintroduction to general graphical models. Also, see Howard andMatheson (1981) and Shachter (1986) for descriptions of howinfluence diagrams can be used to model sub jective probabilisticdependence between variables. Spiegelhalter (1986) and Lauritzenand Spiegelhalter (1987) show how such diagrams can be useful inexpert systems.)What I would object to in the practice of pathanalysis are the attempts to interpret the coefficients placedalong the path arrows. Take the following trivial example in whichtwo correlated exogenous variables Xl, X2 are thought to influencethe endoge nous variable Y. The residual of Y is ey. The notationis borrowed from Dillon and Goldstein (Chapter 12). Figure 1 is atypical path diagram. The single-headed arrows denote effect orcausation, whereas the double headed arrows denote correlation.Suppose all three variables have variance one and intercorrelationsof0.9. Without going into details, the path coefficients would beas follows:

PYX2 = .4737, Ps, = .2768.

One would be led, by the path analysis methodology, to interpretPYXl = .4737 as the direct effect of Xl on Y.The remainder of thecorrelation between Y and Xl is .9 - .4737 = Pyx2rXlX2 = .4263 andis attributed to "unanalyzed effects." (If X2 had not.been in thepic-

FIG.1. A path diagram.

unanalyzed effects.) Suppose, that we know that X2 = X, + Z, andwe set X3 = .J5z (the standardized version of Z). Then rXlX3 =-.2236. Replacing X2 by X3 in thepath analysis leads to thefollowing path coefficients:

rXlX3 = -.2236, PYXl = .9474,

PYX3 = .2120, Pey = .2768.

Now the direct effect of X, is .9474 and the unanalyzed effectis -.0474. For a simple path diagram like Fig ure 1, such anambiguous definition of "direct effect of X, on Y" is easy tounderstand. But in more complicated analyses, such ambiguity willaffect and indirect effects of Xl on variables in other parts ofthe diagram making any interpretation tenuous at best.Of course,the ambiguity of regression coefficients is not news to mostreaders. For this reason, it is surprising that Dillon andGoldstein do not mention multicolinearity as one of the potentialdrawbacks to such models. Statisticians constantly tell their students to be careful not to interpret a regression coef ficient asmeasuring the effect of one variable on another when the dataarises from an observational study. It is not even the effect ofone variable on the other ceteris paribus. In the example above, itwould be impossible to vary Xl while keeping X2 and X3 fixed. Theonly safe interpretation of a regression coefficient is simply asthe number you multiply by the independent variable Xi in aspecific regression model to try to predict the dependent variableY, assuming that Y and the X/ s all arise in a fashion similar tothe way they arose in the original data set. When the variables allarise in a designed experiment, in which each Xi is fixed at eachof several values and the other X, are chosen equal to one of theirseveral values, then the interpretation is clearer due to the waythe data arose. If one now fixes all of the Xi but one, thecoefficient of the other variable does measure how much we expectthe response to change for one unit of change in that variable(assuming the change occurs in a manner consistent with how thevariable changed in the experiment). If, on the other hand, onemerely observes the Xi for a new observation and then wishes topredict Y, based on the results of a designed experiment, one hasthe problem of assuming that the conditions of the experiment weresufficiently similar to those under which the new observation isgenerated. This is closely related to Rubin's (1978) notion ofignorable treatment assign ments. The basic question to be answeredis, "What effect, if any, does a deliberate intervention to affectthe exogenous variables have on the relationship be tween theendogenous and exogenous variables?" This question can only beaddressed by people with signif icant subject matter knowledge.

407REVIEW OF MULTIVARIATE ANALYSIS

10.2 Linear Structural Relations

A more general method for analyzing path diagrams is the LISRELmodel for linear structural relations.This model is quite generaland allows the fitting of

matrices of the ~ and 11 vectors are, respectively,

( :~~ cP22 ) and ( ~:: ,p,,)'

hybrids of factor analysis and general linear models. Itsgenerality also makes it very easy to misuse, how ever. In Section12.5.3, Dillon and Goldstein consider an example borrowed fromBagozzi (1980). The goal of the example was (Bagozzi, 1980, page65) "... to discover the true relationship between performance andsatisfaction in an industrial sales force." More specifically (samepage) "... four possibilities exist: (1) satisfaction causesperformance, (2) performance causes satisfaction, (3) the twovariables are related reciprocally, or (4) the variables are notcausally re lated at all and any empirical association must be aspurious one due to common antecedents." The linear structuralrelations are stated in terms of latent variables ~1 = achievementmotivation, ~2 = task specific self esteem, ~3 = verbalintelligence, 111 = performance, 112 = job satisfaction. Theexogenous latent vari ables ~i are introduced as possible "commonanteced ents." Based on the above statement of goals, onewould nowexpect to see models in which 111 and 112 were causally related toeach other along with models in which they were causally unrelated,but in which causal effects existed from the ~i to the 11i. Theinitial model of Bagozzi (1980) is described by the equation

(2)

where the fi are disturbance terms and the matrix multiplyingthe 11's is assumed nonsingular. This equa tion is the algebraicrepresentation of the path dia gram in Figure 2. Figure 2 is theportion of the path diagram that concerns the latent variablesonly. The observed variables can be appended with more arrows tomake a much more impressive diagram. The paths in Figure 2 withcoefficients (31 and (32 represent recip rocal causation between111 and 112. The covariance

FIG. 2. Bagozzi (1980) initial model.

cP31 cP32 cP33

Bagozzi deletes those paths with coefficients (31 and1/121because the estimates are not significant at level.05 and arrivesat his final model. It has a likelihood ratio X 2 of 15.4 with 15degrees of freedom and is depicted in Figure 3. Because the (31coefficient is estimated to be zero (more precisely, because thehypothesis that (31 = 0 is not rejected), Bagozzi claims (page 71),"Perhaps the most striking finding is that job satisfaction doesnot necessarily lead to better performance." He then goes on tooffer advice to management based on this finding, such as (page 71)". . . resources should be devoted to enhancement of jobsatisfaction only if this is valued as an end in and of itself ..." Bagozzi appears to have fallen into a common trap described byPratt and Schlaifer (1984, page 14) (but presumably known in1980):

Exclusion of a regressor because it contributes little to R2 orbecause -its estimated coefficient is not statistically significantmay make sense when one wants to predict y given a naturallyoccurring x, but not when one wants to know how two or more x'saffect y. Here it implies that if the data provide very littleinformation about the separate effects of two factors, it is betterto attribute almost all of their joint effect to one and none tothe other than to acknowledge the unavoidable uncertainty abouttheir separate effects.

As an example of how to fit a specified LISREL model, theBagozzi example is excellent in that it illustrates severalfeatures of the model and allows comparison of the initial andfinal models. As an example of how causal analysis should be done,how ever, I find this example disappointing. First of all, it wasan expressed goal of the project to see if common antecedents canexplain the association between per formance and satisfaction. Nocausal models involving only paths from common antecedents weredescribed

FIG. 3. Bagozzi final model.

in the example. Some hypothesis tests on the partial correlationbetween performance and satisfaction given some other variableswere performed, but the other variables did not include all threeof the ~ variables. In fact, there are models involving no causalarrows between performance and satisfaction, which are equivalent(not just similar) to the models in (2). It is well known that, inmany cases, several causal models are equivalent in the sense thatthe parameters are one-to-one functions of each other. As anexample, the following model is equivalent to (2):

(3)

The model of (3) is is not linear in the parameters, hence, itcannot be fit with the computer program LISREL IV of Joreskog andSorbom (1978), nor can it be fit with the EQS program of Bentler(1985). However, it can be fit via straightforward maximumlikelihood. The equations relating the two models are

/31 = ada4'/32 = c,

"'12= a2(1 - caI!(4),

( (1 1"'13= a3(1 - caI!(4),

t/lll ) _ -(31)(t/lil )( -(32)t/l21 t/l22 - -/32 1 t/I;l t/I;2-/31 1 '

with some restrictions on the parameters. The model (3)corresponds to the path diagram in Figure 4. Notice that there areno paths between rll and 712, although there are extra paths fromthe ~i to the 71j. In this model, the 71 variables are not causallyrelated, but are both affected by the three common antecedents. Onecould just as easily start with a model of this sort and deletepaths until one had a model that made sense and fit acceptably. Thefinal model would lead to different conclusions from the model thatBagozzi arrived at, and one would be hard pressed to distin-, guishthem based on the data.As an example, I replaced the coefficientsca2 and ca3 in Figure 4 with a5 and a6, respectively, so that Icould use the program EQS of Bentler (1985). Themodel had alikelihood ratio X2 of 9.3 with 12 degrees of freedom. To fit amodel more like the final model ofBagozzi, I set t/I;l = 0 and a4 =0 and got a likelihood ratio X2 = 14.2 with 14 degrees of freedom.If I set a6 = 0, I get X 2 = 16 with 15 degrees of freedom. Thislast model is depicted in Figure 5. All of these models (the onesdepicted in Figures 2 to 5) fit the data comparably with an averageabsolute difference be-

G-G_tP_;,1

FIG. 4. Model equivalent to Bagozzi initial model.

'---(i

FIG. 5. Final model with no causation between 1/'s.

tween the observed and fitted correlations of about11% of theaverage absolute correlation. The model of Figure 5 is notequivalent to that of Figure 3, due to the deleted paths, and thecausal conclusions that would be drawn from the two models would bediffer ent. Because I do not claim to be an expert in managementscience, I will not begin to offer adviceto managers. Nor will Irecommend the model of Figure 5 over that of Figure 3. (In fact, Iwould recommend that neither model be used for causal inference,but rather only for prediction, as suggested by Pratt andSchlaifer.) But I would offer advice to users of structuralequation models: Don't start draw ing conclusions from your modelsuntil you have spent more time looking at alternative but nearlyequivalent models that have different causal links. (See Glymour,Scheines, Spirtes and Kelly, 1987, for a description of one way toexamine alternative causal models.)

10.3 Interpretations

The issue of how to detect causation is a difficult one.Philosophers have been arguing about it for cen turies, and I donot propose to settle it here. Holland (1986) describes a precisebut narrow view of how to define and detect causation. Pratt andSchlaifer (1984) offer a different account of causation instatistical models. The discussions of these papers suggest that weare no closer to understanding causation than were Aristotle andHume. Fortunately, the sensible practice of statistical techniquesdoes not require that one even

pretend to have an understanding of causation. It is in thevarious subject matter disciplines in which statistics is used thatresearchers can attempt to model and understand causation. TakeBagozzi's model for example. It mayor may not be reasonable withinthe various theories of management science to model a causalrelationship between the various latent con structs described inthe example. The statistical meth ods merely give you ways toquantify your uncertainty about those relationships, given that youbelieve a particular model for the generation of the data. It isthe beliefs about those relationships, whether stated explicitly orimplied by the form of the model, that express the causalrelations. Two different researchers who believed strongly in twodifferent, but predictively equivalent, causal models for the datacould collect data for an eternity and never be able to distinguishthe two models based on the data. Only by arguing from subjectmatter considerations (or designing dif ferent experiments) wouldthey be able to conclude that one model is better supported thanthe other. Perhaps the Bagozzi example is an isolated instance, butI would remind the reader that Bentler (1985) also presents it asan example of the use of EQS. If this example is being singled outas exemplary or proto typical, then those who teach the use ofLISREL models to their students ought to look for some betterexamples.The most important thing which Dillon and Goldstein haveto say about the use of the LISREL model is contained in aparagraph at the end of Chapter 12entitled "Indeterminacy":

If the analysis is data driven and not grounded in strong apriori theoretical notions, it is always pos sible to find anacceptable x2-fit, and it is always possible to find several modelsthat fit the data equally well. Thus, in the absence of theoreticalknowledge, covariance structure analysis becomes a limitlessexercise in data snooping contributing lit tle, if anything, toscientific progress. It is a simple fact that exploratory analysisis better performed by other methods that impose fewer restrictiveassumptions [e.g., principal components analysis (Chapter 2)].

It is 'possible, of course, to make use of structural equationmodels without getting hog-tied by the ambiguity of causalinterpretations. By making only predictive inferences, one gives upthe compulsion to draw causal inferences from observational dataand concentrates on simply modeling the joint distribu tions of theunknown quantities. For example, if I were to learn the value of112"job satisfaction" for a salesperson selected from a populationlike that in this study, then what wouldbe the (conditional) distribution of 111"performance"? Which one "causes"

the other is not an issue. In fact, Lauritzen and Spie gelhalter(1987) drop the directional arrows from the paths in theirgraphical models to further emphasize that inference is a two-waystreet. One can condition on whatever variables become known andmake infer ence about the others. On the other hand, if I need tomake some policy decisions as to whether to try to increase jobsatisfaction or something else in the hopes of affectingperformance, I must raise the question of whether the associationsof the variables measured in the observational study remain thesame when I in tervene with new policies. This is a subject matterquestion that mere statistics alone cannot address (at least notwithout a different data set). Such issues do not invalidate theuse of structural equations models, but rather, they make it clearthat it is irresponsible to teach causal modeling without preparingthe students to make the appropriate subject matter judgments.

11. TESTING HYPOTHESES

As mentioned earlier, a great deal of the theoretical researchperformed in multivariate analysis since 1958 has been in the areaof hypothesis testing. Hence, it is not surprising that Chapters 8,9 and 10 of Ander son's book have been substantially rewritten.These chapters consider testing everything under the sun.Discussion of more invariant tests has been added, where just thelikelihood ratio tests were discussed before. Distributions of thetest statistics have been developed in the intervening years andthese are given for all of the tests considered. New results onadmis sibility of tests and properties of power functions have beenincluded. There is also an expanded treatment of confidenceregions. A remark from the Preface of the first edition seems tohave been adopted as a battle cry by an entire generation ofmultivariate researchers: "In many situations, however, the theoryof desirable or optimum procedures is lacking." Un fortunately, theemphasis has been on the procedures and not on the desirabilityand/or optimality of them. The result is that the likelihood ratiocriterion has been augmented by a battery of uniformly most wonderful invariant tests and confidence regions.One possibleexplanation for the plethora of invariant multivariate tests,despite their dubious inferential relevance is the fact that thedistributions of the test statistics depend only on the smalldimensional maximal invariant, and are therefore easier to derivemathematically. Power function calculationsare largely ignored,even when they are available,

409REVIEW OF MULTIVARIATE ANALYSISbecause the maximal invariantis generally not the parameter of interest to the researcher whocollected the data. Ease of derivation is also a reason why so muchof the Bayesian methodology in multivariate

analysis relies on conjugate priors. This situation isreminiscent of the following story of a man who lost his roomkey:

A man lost his room key one night and began searching for itunder a street lamp. A police officer happened by and began to helphim look.Officer: What are you looking for?Man: My room key. Iheard it drop from my keychain.Officer: Where were you standingwhen you heard it drop?Man: About half-way up the nextblock.Officer: Then why are you looking for it here?Man: Becausethe light is better under the street lamp.

In multivariate analysis (if not in the entire field ofstatistics), we have taken to solving problems because we can solvethem and not because somebody needs the solution. If a problem ishard to solve, it makes more sense to try to approximate a solutionto the problem than to make up and solve a problem whose solutionnobody wants. The theory of invariant tests is elegantmathematically, but it does not begin to address the questions ofinterest to researchers, such as "How much better or worse will mypredictions be if I use model B instead of model A?" or "To whatextent has the treatment improved the response and how certain canI be of my conclusion?"This point about the relevance of themaximal in variant parameter was raised in Section 2.1 with re gardto Hotelling's T2. As Lehmann (1959, page 275) puts it:

When applying the principle of invariance, it is important tomake sure that the underlying sym metry assumptions really aresatisfied. In the problem of testing the equality of a number ofnormal means u, ... , Jls, for example, all param eter points,which have the same value of l/;2 =L n,(Jli - Jl.) 2/ (J2, areidentified under the principleof invariance. This is appropriateonly when thesealternatives can be considered as being equidistantfrom the hypothesis. In particular, it should then be immaterialwhether the given value of l/;2 is built up by a number of smallcontributions or a single large one. Situations where instead themain emphasis is on the detection of large individual deviations donot possess the required symmetry, ...

The justification for the use of invariant procedures has alwaysbeen mystifying. Anderson (page 322) gives the only legitimatereason of which I am aware for using invariant procedures: "Weshall use the prin

Perhaps in the next 26 years, those who feel com pelled todevelop tests for null hypotheses will at least enlarge theirhorizons and consider 'tests whose power functions depend on moregeneral parameters that might be of interest in specificapplications. Implicit also is the hope that the deviation of thepower func tion will be treated as equal in importance to thederivation of the test.But, it will take more than a new battery ofvariant(opposite of invariant?) tests to get the focus of multivariate analysis straight. The entire hypothesis test ingmentality needs to be reassessed. The level a mindset has causedpeople to lose sight of what they are actually testing. Thefollowing example is taken from one of the few numerical problemsworked out in Anderson's text (page 341) and is attributed toBarnard (1935) and Bartlett (1947). It concerns p = 4measurementstaken on a total of N = 398 skullsfrom q = 4 different periods. Thehypothesisis that the mean vectors Jl (i) for the four differentperiods are the same. Anderson uses the likelihood ratio criterion-k log Up,Q-1,n, where n = N - q andk = n - 1/2(p - q + 2), andwrites (page 342):

Since n is very large, we may assume -k log U4,3,394 isdistributed as X2 with 12 degrees offreedom (when the nullhypothesis is true). Here -k log U = 77.30.Since the 1% point ofthe X I2 distribution is 26.2,the hypothesis of Jl (1) = Jl (2) =Jl (3) = Jl (4) is rejected.

The corresponding coordinates of the sample mean vectors do notdiffer very much compared to the sample standard deviations. If wewere to consider the problem of sampling a new observation andclassifying it into one of the four populations, we could calculatethe correct classification rates for the four populations (assuminga uniform prior over the four populations). By using the asymptoticexpan sions of Schervish (1981a), we get the results in Table 3.The reason these numbers are so small (we could get 0.25 by justguessing), despite the low p-value for the hypothesis, is that themean vectors are actually quite close. The square roots of the estimated Mahalanobis distances between the pairs ofpopulations (.y(i)- y(j))Ti;-l(.y(i) - y(j)) are givenin Table 4. Population 4 doesseem to be uniformly separated from the others, accounting for ithaving the largest correct classification rate. Even so, it is nomore than one estimated standard deviation (in the observationscale) from any of the other three popu lations. A one standarddeviation difference between

TABLE 3Estimated correct classification ratesciple of invarianceto reduce the set of tests to be considered. "

PopulationEstimated rate

10.41

20.32

30.30

40.54

411REVIEW OF MULTIVARIATE ANALYSIS

TABLE 4Pairwise estimated distances

Population

Population123

2.653

3.632.407

4.986.946.923

two populations allows a correct classification rate of0.69compared to the 0.5 you would get by mere guessing. On the average,the correct classification rates are not much larger than what onecould obtain by guessing, except for population 4. Simply rejectingthe hypothesis does not tell the story of how little the meanvectors differ. The low p-value is due as much to the large samplesize as it is to the differences between the mean vectors.

12. DISCRETE MULTIVARIATE ANALYSIS

Some people do not consider categorical data analy sis as"multivariate." Bishop, Fienberg and Holland (1975) are notableexceptions. Anderson does not say a word about it. Nor does he evenacknowledge it as a multivariate topic which he will not cover. Onthe other hand, Dillon and Goldstein devote two chapters todiscrete multivariate analysis. These two chapters, however, are asdistant in approach as they are in location in the book (Chapters 8and 13). The earlier chapter discusses classical methods like X 2tests and log-linear models. The later chapter describes anapproach more familiar to psychometricians, namely latent structureanalysis.Latent structure analysis attempts to construct anadditional discrete unobserved variable X, whose val ues are calledlatent classes, to go with the observed categorical variables YiThe Yi, in turn, are modeled as conditionally independent given X.This sounds a lot like the construction of factors in factoranalysis. In fact, latent class modeling is actually quite a bitlike discrete factor analysis. In particular, it shares

table and one of the subtables does not have the rows andcolumns independent, I have revised the data as little as possibleto make them correspond to the description above. The data are inTable 5. I have converted the subtables to probabilities, so as toavoid the embarassment of fractional persons. The subtables givethe conditional probabilities given the corre sponding level of thelatent variable. The probability in the lower right corner of eachsubtable is the marginal probability of that latent class.Thestrange feature of this example is that it would be impossible touse the latent class modeling meth odology to arrive at thesolution given in Table 5 without placing arbitrary restrictions onthe parame ters of the solution. The reason is that a latent classmodel with two latent classes is nonidentifiable in a2 X K table.Such a model would require 4K - 3parameters to be estimated,whereas there are only2K - 1 degrees of freedom in the table. Thenoniden tifiability in this example is disguised by the fact thatthe latent classes have been named "Low Education" and "HighEducation" and corresponded to actually observable variables. Hadthey been unspecified, as in most problems in which latent classmodeling is ap plied, the user would have had a two-dimensionalspace of possible latent classes from which to choose. Withexpressed prior beliefs, about the classes, one can at least findan "average" solution by finding the posterior mean, say, of thecell probabilities under the model. For example, suppose I have auniform prior over the five probabilities PI, the probability ofbeing in the first class (assumed less than 0.5 for identifiability), PTII, the conditional probability of reading the Timesgiven class 1, Pn 11, the conditional probability of reading theDaily News given class 1, and PTI2 ando 12 similarly defined forClass 2. The posterior meansof the conditional and marginal cellcounts are given in Table 6. The estimation was done by usingthe

TABLE 5Hypothetical two-way tables

Regularly Readsome of, but not all of, the identifiabilityproblems of factor analysis. Take the first example of Chapter 13given in Table 13.1-1 on page 492 of Dillon and

Regularly ReadTimes

Aggregate table

Daily News

Yes NoGoldstein. It is a hypothetical two-way table exhibitingsignificant dependence between rows and columns. Below the tableare two subtables corresponding to levels of a third (unobserved)variable (in this case education). In each of the subtables, thetwo observed variables are independent. This is an example of conditional independence of two categorical variables given a thirdlatent variable. Because the actual tables given by Dillon andGoldstein have errors in them (for example, the subtables do notadd up to the aggregate

Yes 116 244 360No 524 116 640Total 640 360 1000Latent class 1(high education)Yes .2311 .5689 .8000No .0578 .1422 .2000Total.2889 .7111 .3714Latent class 2 (low education)Yes .0812 .0188.1000No .7308 .1692 .9000Total .8120 .1880 .6286

TABLE 6Posterior from uniform prior

Regularly ReadRegularly Read Daily News

ing a confirmatory latent class analysis and avoiding thetrap.

13. CONCLUSION

TimesYesNoThe theory and practice of multivariate analysishasSmaller latentYesclass (1)..0232.0005.0236come a long way since1958, and a great many talentedpeople have contributed to theprogress. The books byNo.9532.0232.9764Anderson and Dillon andGoldstein give an excellentTotal.9764.0236.4830overview of thatprogress. Each one does a good job ofLarger latentclass(2)Yes.2137.4766.6904No.0959.2137.3096theoretical course inmultivariate statistics to graduTotal.3096.6904.5170ate students,one could do much worse than followAnderson's text. Onecould doslightly better byaugmenting it with asupplementary textofferingwhat it sets out to do. Were one to teach a purely

Marginal probabilitiesYes .1218 .2465 .3683No .7308 .1692.6317Total .6317 .3683 1.0

program of Smith, Skene, Shaw, Naylor and Drans field (1984).The numbers in the lower right corners of the subtables are themeans of PI and 1 - Pl. The marginal table is not identical withthe original table, but we do not expect it to be due tosubstantial, uncertainty and asymmetry in the posteriordistribution. I also used a different prior distribution that hadhigh prior means in the cells with low probabilities in Table 6 tosee how sensitive the fit was to the prior. The posterior means ofthe cell probabilities were very close to those in Table 6. Theimportant thing to keep in mind when estimating latent classparameters is that, unless one has an a priori reason to believethere are such classes and what they are, one will be hard pressedto offer any explanation for what the estimates are estimates of.Ifone has prior beliefs about what the latent classesare, we saw howa Bayesian analysis can help to deal with the nonidentifiability insmall tables. Identifia bility is not a problem in larger tables inwhich the number of cells is much larger than the number ofparameters fit by a latent class model. Also, "rotation" of latentclasses is not an option as was rotation of factors in factoranalysis. However, there is still more to the analogy betweenlatent class models and factor analysis. The analogy extends to thetwo modes in'which they can operate. Exploratory latent classmodeling is searching for latent classes and hoping you caninterpret them. There is also a mode, which I would callconfirmatory latent class modeling. Just as in confirmatory factoranalysis, one can incorporate prior assumptions about the latentclasses and then fit model

TUGAS STATISTIK 6 - [DOCX Document] (2024)
Top Articles
Latest Posts
Article information

Author: Fr. Dewey Fisher

Last Updated:

Views: 5988

Rating: 4.1 / 5 (62 voted)

Reviews: 85% of readers found this page helpful

Author information

Name: Fr. Dewey Fisher

Birthday: 1993-03-26

Address: 917 Hyun Views, Rogahnmouth, KY 91013-8827

Phone: +5938540192553

Job: Administration Developer

Hobby: Embroidery, Horseback riding, Juggling, Urban exploration, Skiing, Cycling, Handball

Introduction: My name is Fr. Dewey Fisher, I am a powerful, open, faithful, combative, spotless, faithful, fair person who loves writing and wants to share my knowledge and understanding with you.