Understanding automatically retained variables when performing a merge: results of doing a merge and conditional processing in the same DATA step


When you perform a many-to-one merge, the variable values in the non-duplicate data set are retained and carried down the BY-group. The values are not reinitialized to missing until the BY-group changes. If you are performing a merge and then making changes to one of the variable values that originated in the non-duplicate data set, that change carries down the BY-group. This is why you will see different results from doing the merge and variable manipulation in one DATA step versus doing the merge in one DATA step and then doing the variable manipulation in a second DATA step.

The sample code below illustrates this.

data one;
   input a b;
   datalines;
10 10 
10 20 
10 30
10 40
10 50
;
run;

data two;
   input a c;
   datalines;
10 100
;
run;

/*
   Do the merge and subset in two DATA steps

   The data set MATCH1 is created via a many-to-one merge.
   The value of variable C carries down the BY-Group, as is
   the nature of merge.
*/

data match1;
   merge one two;
   by a; 
run;

/*
  Subsetting with a second step

  Since you are doing a separate SET to the data set, 
  any changes that are made to a variable's
  value apply only to that observation. Note that only 
  one observation has a value of 50 for C in this example.
*/

data match2 ;
   set match1;
   if b = 20 then c = 50;
run;

/* RESULTS *

  Obs     a     b     c

   1     10    10    100
   2     10    20     50
   3     10    30    100
   4     10    40    100
   5     10    50    100


*/

/*
Try it all in one step. Here, the value of variable C is
changed in the second observation from 100 to 50. Then, because
you are using BY-group processing in the merge, the
value is retained for the remainder of the BY-group. Note that 
four observations have a value of 50 for C in this example. */

data match3 ;
   merge one two;
   by a;
   if b = 20 then c = 50;
run;

/* RESULTS *

  Obs     a     b     c

   1     10    10    100
   2     10    20     50
   3     10    30     50
   4     10    40     50
   5     10    50     50
*/