View the complete data and code for this example at the end Full Code section.
When a model is built using sample data from several domains, the model design matrix is built using all levels of CLASS variables found across all domains. With a domain analysis, the procedure does not remove columns from this design matrix the way a BY statement would, but instead modifies the weights associated with non-domain observations. This means that parameters for CLASS variable levels not observed in the domain are still in the model because there are columns for it in the design matrix. Because of the use of dummy variables produced by the CLASS statement and nonlinear operations in estimating the logistic model, it is possible for the model to estimate parameters for unobserved levels in the domain. This is unlike ordinary linear regression, in which a zero weight in an unobserved level for a particular domain always results in zero estimates for those levels in the domain.
The presence of nonzero estimates for the unobserved levels can make interpretation of the results difficult. To ensure that unobserved CLASS levels have zero estimates, set the response for all the observations outside the domain of interest to missing. Specify the NOMCAR option in the SURVEYLOGISTIC statement and omit the DOMAIN statement. Because the NOMCAR option essentially treats missing values as a separate domain, columns are added in the design matrix only for levels observed in the nonmissing observations. The model fit will be unaffected, yielding the same log likelihood.
The following example illustrates this issue. Data were collected from farms in two separate domains (State) — Iowa with three regions and Nebraska with four regions. These statements save the data in a data set named Farms.
data Farms;
input State $ Region FarmArea CornYield Weight Resp;
datalines;
Iowa 1 100 54 33.333 0
Iowa 1 83 25 33.333 0
Iowa 1 25 10 33.333 1
Iowa 4 120 83 10.000 0
Iowa 4 50 35 10.000 0
Iowa 4 110 65 10.000 1
Iowa 4 60 35 10.000 1
Iowa 4 45 20 10.000 1
Iowa 3 23 5 5.000 0
Iowa 3 10 8 5.000 0
Iowa 3 350 125 5.000 1
Nebraska 1 130 20 5.000 0
Nebraska 1 245 25 5.000 0
Nebraska 1 150 33 5.000 0
Nebraska 1 263 50 5.000 1
Nebraska 1 320 47 5.000 1
Nebraska 1 204 25 5.000 1
Nebraska 2 80 11 10.000 0
Nebraska 2 48 8 10.000 1
Nebraska 3 180 13 10.000 0
Nebraska 3 148 28 10.000 1
Nebraska 4 180 13 10.000 0
Nebraska 4 128 48 13.000 1
;
proc freq data=Farms;
tables State*Region;
title 'Domain Specific Observations for each Region';
run;
Note that Iowa does not have any observations in Region=2 while Nebraska has observations from all four regions.
The FREQ Procedure
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
The following statements do a domain analysis fitting separate models for each state. In this case however, only the results of Iowa are of interest to the investigator.
proc surveylogistic data=Farms;
domain State;
class Region / param=glm;
model Resp = Region FarmArea;
weight Weight;
title 'Domain Analysis results for Iowa only - DOMAIN statement';
run;
Despite not having any observations in Region=2, the domain specific estimates for Iowa include an estimate for Region=2 leading to difficulty in interpreting the individual estimates.
|
The SURVEYLOGISTIC Procedure
Domain Analysis for domain State=Iowa
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
To avoid the problem, the same model can be fit for Iowa alone by setting the response to missing for all the Nebraska observations as done in the following DATA step.
data Farms2;
set Farms;
if State='Nebraska' then Resp=.;
run;
Adding the NOMCAR option in the SURVEYLOGISTIC statement provides a proper domain analysis that adjusts the variance for non-domain observations.
proc surveylogistic data=Farms2 nomcar;
class Region / param=glm;
model Resp = Region FarmArea;
weight Weight;
title 'Domain Analysis results for Iowa only - NOMCAR option';
run;
Notice that the fit statistics are identical to those in the first SURVEYLOGISTIC analysis that used the DOMAIN statement, but the absence of a Region=2 estimate makes the resulting parameter estimates more consistent with the attributes of the domain.
The SURVEYLOGISTIC Procedure
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
data Farms;
input State $ Region FarmArea CornYield Weight resp;
datalines;
Iowa 1 100 54 33.333 0
Iowa 1 83 25 33.333 0
Iowa 1 25 10 33.333 1
Iowa 4 120 83 10.000 0
Iowa 4 50 35 10.000 0
Iowa 4 110 65 10.000 1
Iowa 4 60 35 10.000 1
Iowa 4 45 20 10.000 1
Iowa 3 23 5 5.000 0
Iowa 3 10 8 5.000 0
Iowa 3 350 125 5.000 1
Nebraska 1 130 20 5.000 0
Nebraska 1 245 25 5.000 0
Nebraska 1 150 33 5.000 0
Nebraska 1 263 50 5.000 1
Nebraska 1 320 47 5.000 1
Nebraska 1 204 25 5.000 1
Nebraska 2 80 11 10.000 0
Nebraska 2 48 8 10.000 1
Nebraska 3 180 13 10.000 0
Nebraska 3 148 28 10.000 1
Nebraska 4 180 13 10.000 0
Nebraska 4 128 48 13.000 1
;
run;
proc freq data=farms;
tables state*region;
title 'Domain Specific Observations for each region';
run;
proc surveylogistic data=Farms;
domain state;
class region/param=glm;
model resp = region FarmArea;
weight Weight;
title 'Domain Analysis results for Iowa Only using DOMAIN statement';
ods select Surveylogistic.Domain1.ClassLevelInfo Surveylogistic.Domain1.FitStatistics Surveylogistic.Domain1.ParameterEstimates;
run;
data farms2;set farms;
if state='Nebraska' then resp=.;
run;
proc surveylogistic data=farms2 nomcar;
class region/param=glm;
title 'Domain Analysis results for Iowa Only using NOMCAR option';
model resp = region FarmArea;
weight Weight;
ods select ClassLevelInfo FitStatistics ParameterEstimates;
run;