Plot supplementary variables and observations in principal components analysis


As part of a principal components analysis (PCA), you can produce useful plots about both the variables and the observations being analyzed. For variables, the component pattern plot shows the loadings (correlations) of the analyzed variables with each of the principal components. The component pattern plot is available with the PLOTS=PATTERN option in the PRINCOMP procedure. For observations, a plot of their scores on the principal components is available with the PLOTS=SCORE option.

Suppose there are additional variables and/or observations, which were not used in the PCA, and you want to include them in the pattern and score plots. These are called supplementary variables and observations. Including supplementary variables in the pattern plot allows you to visualize how strongly they correlate with the principal components and also how close they are to other variables suggesting similarities. Similarly in the score plot, the location of a supplementary observation relative to the analyzed observations indicates similarity with nearby observations.

The following example shows how you can produce pattern and score plots that include supplementary variables and observations. It uses the crime data in the Getting Started example in the PRINCOMP documentation. This data set contains crime statistics for the various states.

Plot Supplementary Observations

PROC PRINCOMP ignores observations with missing or nonpositive weight. But if these observations are not missing on any of the analysis variables specified in the VAR statement, scores on the principal components can be computed and are available in the OUT= data set. So, supplementary observations can be designated by assigning them a missing or nonpositive weight.

To illustrate, suppose that the states beginning with the letter W (Washington, West Virginia, Wisconsin, and Wyoming) are not included in the PCA but are to be plotted as supplementary observations. This is done in the DATA step below by adding a weight variable (SupObs) in the data set with value 1 for each observation to be included in the analysis and 0 for the supplementary observations:

   data crime; set crime;
      SupObs=(substr(state,1,1) ne 'W');
      run;

The following statements conduct the PCA using PROC PRINCOMP on five of the variables in the Crime data set. The N=2 option requests extraction of two principal components. The PLOTS option requests both the component pattern plot, with circles indicating correlations 0.5, 0.75, and 1, and the component score plot with a 95% prediction ellipse. The OUT= option saves a data set that adds variables containing the principal component scores to the input data set. The WEIGHT statement weights each observation using the SupObs variable as described above. The ID statement uses the State variable to label the observations in the score plot. The ODS OUTPUT statement saves the data set that defines the pattern plot, to be used in the next section.

   proc princomp data=crime out=PCscores n=2
      plots(only)=(pattern(circles=50 75 100)
                   score(ellipse));
      weight SupObs;
      var robbery--auto_theft;
      id state;
      ods output PatternPlot=PatPlot;
      run;

In the OUT= data set (not shown), even though the supplementary observations were omitted from the analysis because they have zero weights, they have scores on the principal components. Because of this, they appear in the component scores plot.

Plot Supplementary Variables

The PLOTS=PATTERN option displays a plot (not shown) of the loadings on the principal components of the five variables included in the PCA. Suppose the remaining two variables, Murder and Rape, are to be treated as supplementary variables and added to the pattern plot. Unlike for supplementary observations, supplementary variables cannot be designated and automatically added in the plot produced by the PLOTS=PATTERN option in PRINCOMP. Instead, the correlations of these variables with the principal components must be computed and added to the data set that defines the pattern plot. You can then produce the plot, with the supplementary variables added, using PROC SGPLOT.

The following statements use PROC CORR to compute the Pearson correlations between the supplementary variables and the principal components. The correlations (appearing in observations with _TYPE_='CORR') are saved in the OUTP= data set. In order to match the structure of the PatternPlot data set (saved as PatPlot), the OUTP= data set from PROC CORR is transposed and added to the PatPlot data set. The resulting data set (AllCorrs), shown below, can then be used to reconstruct the pattern plot in PROC SGPLOT. The SupVar variable is added with value 1 that indicates the correlations with the supplementary variables. SupVar will be used as a grouping variable in order to distinguish the loadings of the analysis and supplementary variables in the plot.

   proc corr data=PCscores outp=SupCorr(where=(_type_='CORR'));
      var murder rape; with prin:;
      run;
   proc transpose data=SupCorr out=SupCorr name=Variable;
      run;
   data AllCorrs;
      set PatPlot SupCorr(in=s);
      SupVar=(s=1);
      run;

The pattern plot has several components, each of which is created by a statement in PROC SGPLOT. The REFLINE statements create the central axes while the value ranges for the axes that are shown at the edges of the plot are set by the XAXIS and YAXIS statements. The ELLIPSEPARM statements draw the circles that indicate specific correlation magnitudes. A circle with unit radius will designate a correlation of 1 (100%). The radius value, r, appearing in the SEMIMINOR= and SEMIMAJOR= options in each statement is the area of the desired circle relative to the area of the circle with unit radius. For a circle with area = a representing correlation = a, the radius, r, is then

r = sqrt(a/π) / sqrt(1/π)) = sqrt(a) .

For circles representing correlations 0.5, 0.75, and 1, r = 0.7071, 0.8660, and 1, respectively. The first SCATTER statement places the labels of the correlation circles on the plot. The second SCATTER statement plots the symbols for the variable loadings. The GROUP=SupVar option allows different colors to be used for the variable loadings of the analysis and the supplementary variables. The ASPECT= option in the PROC SGPLOT statement is used to assure that the correlation circles appear circular rather than elliptical and might require adjustment. The NOAUTOLEGEND option omits the default legend.

   proc sgplot data=AllCorrs aspect=.9 noautolegend;
     ellipseparm semimajor=1 semiminor=1 /
       slope=0 xorigin=0 yorigin=0 clip
       lineattrs=(color=blue)
       transparency=0.9;
     ellipseparm semimajor=0.87 semiminor=0.87 /
       slope=0 xorigin=0 yorigin=0 clip
       lineattrs=(color=blue)
       transparency=0.9;
     ellipseparm semimajor=0.71 semiminor=0.71 /
       slope=0 xorigin=0 yorigin=0 clip
       lineattrs=(color=blue)
       transparency=0.9;
     scatter x=xcirclelabel y=ycirclelabel /
        markercharattrs=(size=9pt)
        markerchar=circlelabel transparency=0.7;
     scatter x=prin1 y=prin2 / group=SupVar datalabel=variable;
     refline 0 / axis=x;
     refline 0 / axis=y;
     xaxis values=(-1.0 to 1.0 by 0.2) label="Component 1";
     yaxis values=(-1.0 to 1.0 by 0.2) label="Component 2";
     title "Component Pattern";
     run;

In the resulting pattern plot, note that the supplementary variables, Murder and Rape, are in a similar direction as Assault, probably because these are crimes on the person rather than on property such as Auto Theft. This personal versus property distinction seems to be the interpretation of the second (vertical) principal component. All crime types pointing in a positive direction on the first component suggests it is interpreted as the overall amount of crime.

Biplot of Scores and Loadings

The plots of observation scores and variable loadings can be combined to form a biplot which makes it possible to see the association of observations and variables. As is conventional, the variable loadings are depicted as vectors which indicate the direction of increasing variable value.

The following statements combine the data sets of observation scores (PCscores) and loadings (AllCorrs). In order to extend the lengths of the loading vectors to make them more visible in the plot, they are multiplied by 4.

   data AllCorrs;
      set PatPlot SupCorr(in=s);
      SupVar=(s=1);
      vecprin1=4*prin1;
      vecprin2=4*prin2;
      drop prin1 prin2;
      run;
   data Biplot;
      merge AllCorrs PCscores;
      run;
   proc sgplot data=Biplot aspect=.9 noautolegend;
      vector x=vecprin1 y=vecprin2 / group=SupVar datalabel=variable;
      scatter x=prin1 y=prin2 / datalabel=state;
      refline 0 / axis=x;
      refline 0 / axis=y;
      xaxis values=(-5.0 to 5.0) label="Component 1";
      yaxis values=(-3.0 to 3.0) label="Component 2";
      title "Biplot of scores and loadings";
      run;

In this plot, perpendicular projections of observations onto a variable vector roughly indicate their relative strengths on that variable. For example, Massachusetts, Rhode Island, Alaska, and New York have high values on Auto Theft. North Carolina and Mississippi have low values. Similarly, South Carolina and Florida have high values on the personal crimes (Assault, Murder, Rape) while North Dakota and Wisconsin are low.