Knowledge Article View - Customer Support

View the complete data and code for this example at the end Full Code section.

When a model is built using sample data from several domains, the model design matrix is built using all levels of CLASS variables found across all domains. With a domain analysis, the procedure does not remove columns from this design matrix the way a BY statement would, but instead modifies the weights associated with non-domain observations. This means that parameters for CLASS variable levels not observed in the domain are still in the model because there are columns for it in the design matrix. Because of the use of dummy variables produced by the CLASS statement and nonlinear operations in estimating the logistic model, it is possible for the model to estimate parameters for unobserved levels in the domain. This is unlike ordinary linear regression, in which a zero weight in an unobserved level for a particular domain always results in zero estimates for those levels in the domain.

The presence of nonzero estimates for the unobserved levels can make interpretation of the results difficult. To ensure that unobserved CLASS levels have zero estimates, set the response for all the observations outside the domain of interest to missing. Specify the NOMCAR option in the SURVEYLOGISTIC statement and omit the DOMAIN statement. Because the NOMCAR option essentially treats missing values as a separate domain, columns are added in the design matrix only for levels observed in the nonmissing observations. The model fit will be unaffected, yielding the same log likelihood.

The following example illustrates this issue. Data were collected from farms in two separate domains (State) — Iowa with three regions and Nebraska with four regions. These statements save the data in a data set named Farms.

      data Farms;
        input State $ Region FarmArea CornYield Weight Resp;
        datalines;
      Iowa   1 100 54 33.333 0
      Iowa   1 83 25 33.333 0
      Iowa   1 25 10 33.333 1
      Iowa   4 120 83 10.000 0
      Iowa   4 50 35 10.000 0
      Iowa   4 110 65 10.000 1
      Iowa   4 60 35 10.000 1
      Iowa   4 45 20 10.000 1
      Iowa   3 23  5 5.000 0
      Iowa   3 10  8 5.000 0
      Iowa   3 350 125 5.000 1
      Nebraska 1 130 20 5.000 0
      Nebraska 1 245 25 5.000 0
      Nebraska 1 150 33 5.000 0
      Nebraska 1 263 50 5.000 1
      Nebraska 1 320 47 5.000 1
      Nebraska 1 204 25 5.000 1
      Nebraska 2 80 11 10.000 0
      Nebraska 2 48  8 10.000 1
      Nebraska 3 180 13 10.000 0
      Nebraska 3 148 28 10.000 1
      Nebraska 4 180 13 10.000 0
      Nebraska 4 128 48 13.000 1
      ;
      proc freq data=Farms;
        tables State*Region;
        title 'Domain Specific Observations for each Region';
        run;

Note that Iowa does not have any observations in Region=2 while Nebraska has observations from all four regions.

Domain Specific Observations for each Region

The FREQ Procedure

Iowa

13.04

27.27

33.33

0.00

13.04

27.27

60.00

21.74

45.45

71.43

47.83

Nebraska

26.09

50.00

66.67

8.70

16.67

100.00

8.70

16.67

40.00

8.70

16.67

28.57

52.17

Total

39.13

8.70

21.74

30.43

100.00

The following statements do a domain analysis fitting separate models for each state. In this case however, only the results of Iowa are of interest to the investigator.

      proc surveylogistic data=Farms;
        domain State;
        class Region / param=glm;
        model Resp = Region FarmArea;
        weight Weight;
        title 'Domain Analysis results for Iowa only - DOMAIN statement';
        run;

Despite not having any observations in Region=2, the domain specific estimates for Iowa include an estimate for Region=2 leading to difficulty in interpreting the individual estimates.

The SURVEYLOGISTIC Procedure

Domain Analysis for domain State=Iowa

Class Level Information
Class	Value	Design Variables
Region	1	1	0	0	0
	2	0	1	0	0
	3	0	0	1	0
	4	0	0	0	1

Model Fit Statistics
Criterion	Intercept Only	Intercept and Covariates
AIC	225.848	221.169
SC	228.954	236.699
-2 Log L	223.848	211.169

Analysis of Maximum Likelihood Estimates
Parameter		Estimate	Standard Error	t Value	Pr > \|t\|
Intercept		-0.7987	1.9191	-0.42	0.6813
Region	1	1.1443	1.6025	0.71	0.4827
Region	2	0.4739	1.8686	0.25	0.8022
Region	3	0.9667	2.2668	0.43	0.6739
Region	4	0	.	.	.
FarmArea		0.00507	0.0209	0.24	0.8102
NOTE: The degrees of freedom for the t tests is 22.

To avoid the problem, the same model can be fit for Iowa alone by setting the response to missing for all the Nebraska observations as done in the following DATA step.

      data Farms2;
        set Farms;
        if State='Nebraska' then Resp=.;
        run;

Adding the NOMCAR option in the SURVEYLOGISTIC statement provides a proper domain analysis that adjusts the variance for non-domain observations.

      proc surveylogistic data=Farms2 nomcar;
        class Region / param=glm;
        model Resp = Region FarmArea;
        weight Weight;
        title 'Domain Analysis results for Iowa only - NOMCAR option';
        run;

Notice that the fit statistics are identical to those in the first SURVEYLOGISTIC analysis that used the DOMAIN statement, but the absence of a Region=2 estimate makes the resulting parameter estimates more consistent with the attributes of the domain.

Domain Analysis results for Iowa only - NOMCAR option

The SURVEYLOGISTIC Procedure

Class Level Information
Class	Value	Design Variables
Region	1	1	0	0
	3	0	1	0
	4	0	0	1

Model Fit Statistics
Criterion	Intercept Only	Intercept and Covariates
AIC	225.848	219.169
SC	228.954	231.593
-2 Log L	223.848	211.169

Analysis of Maximum Likelihood Estimates
Parameter		Estimate	Standard Error	t Value	Pr > \|t\|
Intercept		-0.7987	1.8680	-0.43	0.6731
Region	1	1.1443	1.5597	0.73	0.4709
Region	3	0.9667	2.2063	0.44	0.6656
Region	4	0	.	.	.
FarmArea		0.00507	0.0203	0.25	0.8051
NOTE: The degrees of freedom for the t tests is 22.

Full Code for this example:

data Farms;
input State $ Region FarmArea CornYield Weight resp;
datalines;
Iowa 1 100 54 33.333 0
Iowa 1 83 25 33.333 0
Iowa 1 25 10 33.333 1
Iowa 4 120 83 10.000 0
Iowa 4 50 35 10.000 0
Iowa 4 110 65 10.000 1
Iowa 4 60 35 10.000 1
Iowa 4 45 20 10.000 1
Iowa 3 23 5 5.000 0
Iowa 3 10 8 5.000 0
Iowa 3 350 125 5.000 1
Nebraska 1 130 20 5.000 0
Nebraska 1 245 25 5.000 0
Nebraska 1 150 33 5.000 0
Nebraska 1 263 50 5.000 1
Nebraska 1 320 47 5.000 1
Nebraska 1 204 25 5.000 1
Nebraska 2 80 11 10.000 0
Nebraska 2 48 8 10.000 1
Nebraska 3 180 13 10.000 0
Nebraska 3 148 28 10.000 1
Nebraska 4 180 13 10.000 0
Nebraska 4 128 48 13.000 1

;
run;

proc freq data=farms;
tables state*region;
title 'Domain Specific Observations for each region';
run;

proc surveylogistic data=Farms;
domain state;
class region/param=glm;
model resp = region FarmArea;
weight Weight;
title 'Domain Analysis results for Iowa Only using DOMAIN statement';
ods select Surveylogistic.Domain1.ClassLevelInfo Surveylogistic.Domain1.FitStatistics Surveylogistic.Domain1.ParameterEstimates;
run;

data farms2;set farms;
if state='Nebraska' then resp=.;
run;

proc surveylogistic data=farms2 nomcar;
class region/param=glm;
title 'Domain Analysis results for Iowa Only using NOMCAR option';
model resp = region FarmArea;
weight Weight;
ods select ClassLevelInfo FitStatistics ParameterEstimates;
run;

PROC SURVEYLOGISTIC provides parameter estimates for levels of CLASS variables that do not exist in the domain