PROC SURVEYLOGISTIC provides parameter estimates for levels of CLASS variables that do not exist in the domain


View the complete data and code for this example at the end Full Code section. 

When a model is built using sample data from several domains, the model design matrix is built using all levels of CLASS variables found across all domains. With a domain analysis, the procedure does not remove columns from this design matrix the way a BY statement would, but instead modifies the weights associated with non-domain observations. This means that parameters for CLASS variable levels not observed in the domain are still in the model because there are columns for it in the design matrix. Because of the use of dummy variables produced by the CLASS statement and nonlinear operations in estimating the logistic model, it is possible for the model to estimate parameters for unobserved levels in the domain. This is unlike ordinary linear regression, in which a zero weight in an unobserved level for a particular domain always results in zero estimates for those levels in the domain.

The presence of nonzero estimates for the unobserved levels can make interpretation of the results difficult. To ensure that unobserved CLASS levels have zero estimates, set the response for all the observations outside the domain of interest to missing. Specify the NOMCAR option in the SURVEYLOGISTIC statement and omit the DOMAIN statement. Because the NOMCAR option essentially treats missing values as a separate domain, columns are added in the design matrix only for levels observed in the nonmissing observations. The model fit will be unaffected, yielding the same log likelihood.

The following example illustrates this issue. Data were collected from farms in two separate domains (State) — Iowa with three regions and Nebraska with four regions. These statements save the data in a data set named Farms.

      data Farms;
        input State $ Region FarmArea CornYield Weight Resp;
        datalines;
      Iowa   1 100 54 33.333 0
      Iowa   1 83 25 33.333 0
      Iowa   1 25 10 33.333 1
      Iowa   4 120 83 10.000 0
      Iowa   4 50 35 10.000 0
      Iowa   4 110 65 10.000 1
      Iowa   4 60 35 10.000 1
      Iowa   4 45 20 10.000 1
      Iowa   3 23  5 5.000 0
      Iowa   3 10  8 5.000 0
      Iowa   3 350 125 5.000 1
      Nebraska 1 130 20 5.000 0
      Nebraska 1 245 25 5.000 0
      Nebraska 1 150 33 5.000 0
      Nebraska 1 263 50 5.000 1
      Nebraska 1 320 47 5.000 1
      Nebraska 1 204 25 5.000 1
      Nebraska 2 80 11 10.000 0
      Nebraska 2 48  8 10.000 1
      Nebraska 3 180 13 10.000 0
      Nebraska 3 148 28 10.000 1
      Nebraska 4 180 13 10.000 0
      Nebraska 4 128 48 13.000 1
      ;
      proc freq data=Farms;
        tables State*Region;
        title 'Domain Specific Observations for each Region';
        run;

Note that Iowa does not have any observations in Region=2 while Nebraska has observations from all four regions.

Domain Specific Observations for each Region
 
The FREQ Procedure

 

Frequency
Percent
Row Pct
Col Pct
Table of State by Region
StateRegion
1234Total
Iowa
3
13.04
27.27
33.33
0
0.00
0.00
0.00
3
13.04
27.27
60.00
5
21.74
45.45
71.43
11
47.83
 
 
Nebraska
6
26.09
50.00
66.67
2
8.70
16.67
100.00
2
8.70
16.67
40.00
2
8.70
16.67
28.57
12
52.17
 
 
Total
9
39.13
2
8.70
5
21.74
7
30.43
23
100.00

The following statements do a domain analysis fitting separate models for each state. In this case however, only the results of Iowa are of interest to the investigator.

      proc surveylogistic data=Farms;
        domain State;
        class Region / param=glm;
        model Resp = Region FarmArea;
        weight Weight;
        title 'Domain Analysis results for Iowa only - DOMAIN statement';
        run;

Despite not having any observations in Region=2, the domain specific estimates for Iowa include an estimate for Region=2 leading to difficulty in interpreting the individual estimates.

The SURVEYLOGISTIC Procedure
 
Domain Analysis for domain State=Iowa

 

Class Level Information
ClassValueDesign Variables
Region11000
 20100
 30010
 40001
 
Model Fit Statistics
CriterionIntercept OnlyIntercept and
Covariates
AIC225.848221.169
SC228.954236.699
-2 Log L223.848211.169
 
Analysis of Maximum Likelihood Estimates
Parameter EstimateStandard
Error
t ValuePr > |t|
Intercept -0.79871.9191-0.420.6813
Region11.14431.60250.710.4827
Region20.47391.86860.250.8022
Region30.96672.26680.430.6739
Region40...
FarmArea 0.005070.02090.240.8102
NOTE: The degrees of freedom for the t tests is 22.

To avoid the problem, the same model can be fit for Iowa alone by setting the response to missing for all the Nebraska observations as done in the following DATA step.

      data Farms2;
        set Farms;
        if State='Nebraska' then Resp=.;
        run;

Adding the NOMCAR option in the SURVEYLOGISTIC statement provides a proper domain analysis that adjusts the variance for non-domain observations.

      proc surveylogistic data=Farms2 nomcar;
        class Region / param=glm;
        model Resp = Region FarmArea;
        weight Weight;
        title 'Domain Analysis results for Iowa only - NOMCAR option';
        run;

Notice that the fit statistics are identical to those in the first SURVEYLOGISTIC analysis that used the DOMAIN statement, but the absence of a Region=2 estimate makes the resulting parameter estimates more consistent with the attributes of the domain.

Domain Analysis results for Iowa only - NOMCAR option
 
The SURVEYLOGISTIC Procedure

 

Class Level Information
ClassValueDesign Variables
Region1100
 3010
 4001
Model Fit Statistics
CriterionIntercept OnlyIntercept and
Covariates
AIC225.848219.169
SC228.954231.593
-2 Log L223.848211.169
Analysis of Maximum Likelihood Estimates
Parameter EstimateStandard
Error
t ValuePr > |t|
Intercept -0.79871.8680-0.430.6731
Region11.14431.55970.730.4709
Region30.96672.20630.440.6656
Region40...
FarmArea 0.005070.02030.250.8051
NOTE: The degrees of freedom for the t tests is 22.


Full Code for this example:

data Farms;
      input State $ Region FarmArea CornYield Weight resp; 
      datalines; 
   Iowa     1 100  54 33.333 0
   Iowa     1  83  25 33.333 0
   Iowa     1  25  10 33.333 1
   Iowa     4 120  83 10.000 0
   Iowa     4  50  35 10.000 0
   Iowa     4 110  65 10.000 1
   Iowa     4  60  35 10.000 1
   Iowa     4  45  20 10.000 1
   Iowa     3  23   5  5.000 0
   Iowa     3  10   8  5.000 0
   Iowa     3 350 125  5.000 1
   Nebraska 1 130  20  5.000 0 
   Nebraska 1 245  25  5.000 0
   Nebraska 1 150  33  5.000 0
   Nebraska 1 263  50  5.000 1
   Nebraska 1 320  47  5.000 1
   Nebraska 1 204  25  5.000 1
   Nebraska 2  80  11 10.000 0
   Nebraska 2  48   8 10.000 1
   Nebraska 3 180  13 10.000 0
   Nebraska 3 148  28 10.000 1
   Nebraska 4 180  13 10.000 0
   Nebraska 4 128  48 13.000 1

   ;
run;

proc freq data=farms;
tables state*region;
title 'Domain Specific Observations for each region';
run;

proc surveylogistic data=Farms; 
domain state;
class  region/param=glm;
model  resp = region FarmArea;
weight Weight;
title 'Domain Analysis results for Iowa Only using DOMAIN statement';
ods select Surveylogistic.Domain1.ClassLevelInfo Surveylogistic.Domain1.FitStatistics Surveylogistic.Domain1.ParameterEstimates;
run;

data farms2;set farms;
if state='Nebraska' then resp=.;
run;

proc surveylogistic data=farms2 nomcar;
class  region/param=glm;
title 'Domain Analysis results for Iowa Only using NOMCAR option';
model  resp = region FarmArea;
weight Weight;
ods select ClassLevelInfo FitStatistics ParameterEstimates;
run;