PROC CLUSTER results might be incorrect for TYPE=DISTANCE data dependingon the order of the variables


PROC CLUSTER is able to use a TYPE=DISTANCE data set as input. Because of the special structure of this data set, PROC CLUSTER has to make the assumption that the order of the columns matches the order of the rows. This is typically not a problem, but problems can occur under certain conditions, described here.

The most common way of creating a TYPE=DISTANCE data set is to use PROC DISTANCE. The DISTANCE procedure organizes the OUT= data set in such a way that the rows and columns are in the same order and PROC CLUSTER expects the data to be in this ordered arrangement. If you do not use a VAR statement when you run PROC CLUSTER, it will use the order of the variables created by PROC DISTANCE correctly.

However, if you choose to use a VAR statement when you run PROC CLUSTER on a TYPE=DISTANCE data set created by PROC DISTANCE, the VAR statement has to be identical to the VAR statement used in PROC DISTANCE; both the number and the order of the variables has to be the same. If PROC DISTANCE did not have a VAR statement, then the VAR statement in PROC CLUSTER must include all numeric variables contained in the data set input to PROC DISTANCE, and in the same order. If either the number or order of the variables differs, the PROC CLUSTER results will be incorrect. There will be no warnings or errors to indicate a problem.

If you create your own TYPE=DISTANCE data set by making changes to the data set created by PROC DISTANCE, or by inputting your own distance data, the resulting rows and columns must be in the same order. If the rows and columns are not ordered in the same way, the PROC CLUSTER results may be incorrect. There will be no warnings or errors to indicate that a problem has occurred.

For more information on TYPE=DISTANCE data sets, see the chapter "Special SAS Data Sets" in SAS/STAT User's Guide.


EXAMPLES

Name       Alfred     Alice     Barbara

Alfred      0.0000      .           .
Alice      31.1207     0.0000       .
Barbara    14.9646    16.5360       0

proc cluster...
  id name;
run;

The names of the columns (Alfred, Alice, Barbara) are in the same order as the rows (Alfred, Alice, Barbara).

Name       Alfred     Alice     Barbara

Alfred      0.0000      .           .
Alice      31.1207     0.0000       .
Barbara    14.9646    16.5360       0

proc cluster...
  id name;
  var Alfred Alice Barbara;

This time a VAR statement is used and the order of the variables on the VAR statement (Alfred, Alice, Barbara) is the same order as the columns in the data.

Name       Alfred     Alice     Barbara

Alfred      0.0000      .         .
Alice      31.1207     0.0000     .
Barbara    14.9646    16.5360     0


proc cluster...
   id name;
   var Alice Alfred Barbara;

The names of the variables on the VAR statement (Alice, Alfred, Barbara) are in an order different from those in the data set (Alfred, Alice, Barbara). The PROC CLUSTER results might be incorrect.

Name       Alice      Alfred    Barbara

Alfred     31.1207     0.0000    14.9646
Alice       0.0000    31.1207    16.5360
Barbara    16.5360    14.9646     0.0000

proc cluster...
   id name;
run;

The order of the columns in the data set (Alice, Alfred, Barbara) is different from the order of the rows (Alfred, Alice, Barbara) so the PROC CLUSTER results might be incorrect.

Name       Alice      Alfred    Barbara

Alfred     31.1207     0.0000    14.9646
Alice       0.0000    31.1207    16.5360
Barbara    16.5360    14.9646     0.0000

proc cluster...
   id name;
   var Alice Alfred Barbara;
run;