Addressing endogeneity in models involving limited and qualitative endogenous variables using the QLIM procedure


You can use the SAS/ETS® procedure QLIM to estimate simultaneous equations models involving discrete and limited dependent variables and/or endogenous explanatory variables. In linear models with continuous dependent variables and endogenous regressors, two stage least squares and three stage least squares estimation methods are commonly used to address endogeneity in the model. However, when the dependent variable or an endogenous regressor in the model is discrete or limited, traditional two stage least squares or three stage least squares methods are usually not the desired approach. In these cases, in order to account for endogeneity, you can specify the structural model and the reduced form models for the endogenous explanatory variables together and estimate the joint likelihood of the dependent variable and the endogenous variables using the QLIM procedure. When there is only one endogenous explanatory variable, PROC QLIM reports the Full Information Maximum Likelihood (FIML) estimates using the analytical likelihood function from the joint normal distribution of the dependent variable and the endogenous variable. When there is more than one endogenous explanatory variable, the analytical form of the likelihood function is usually not available; in this case, PROC QLIM reports the simulated maximum likelihood estimates.

Below are two examples using PROC QLIM to fit models involving discrete and limited dependent and endogenous variables and perform endogeneity and overidentification tests. For more details and additional examples with different models, see "Endogeneity and Instrumental Variables" in the Details section of the PROC QLIM documentation.

Example 1. Endogenous Dummy Variable Model

The following DATA step generates the data that is used in the analysis.

data a;
 do i = 1 to 100;
   x1 = normal(1);
   x2 = normal(1);
   z3 = normal(1);
   z4 = normal(1);
   u1 = rannor(235);
   u2 = rannor(2352);
   y2star = 0.5 + 0.8*x1 + 1.2*x2 + 0.9*z3 + 0.5*z4 + u2;
   if (y2star > 0) then y2 = 1;
   else y2 = 0;
   y1 = 3 + 3*y2 + 3*x1 + 2*x2 + 0.6*u2 + u1;
   output;
end;
run;

The following PROC QLIM step specifies the model. The two MODEL statements specify the structural equation for y1 and the reduced form equation for the endogenous variable y2. The DISCRETE option specifies that endogenous variable y2 is a discrete variable. Bivariate normal log likelihood is used to estimate the parameters in this model.

proc qlim data = a;
  model y1 = y2 x1 x2;
  model y2 = x1 x2 z3 z4 / discrete;
  run;

The parameter estimates of the model are shown below. All parameters are significant at the 5% level. The _Rho parameter in the output is the correlation coefficient between the structural equation for y1 and the reduced form equation for the binary endogenous variable y2. The estimate of _Rho is 0.767 and is significant (p<.0001). This shows evidence of endogeneity for y2, which is also shown in the subsequent endogeneity tests.

parameter estimates table

The following steps perform two endogeneity tests for binary variable y2. The LR option in the TEST statement requests the likelihood ratio test that the correlation between y1 and y2 is zero. The ENDOTEST option requests an endogeneity test of variable y2, specified in parentheses.

proc qlim data = a;
  model y1 = y2 x1 x2;
  model y2 = x1 x2 z3 z4 / discrete;
  test _rho = 0 / lr;
  run;
proc qlim data = a;
  model y1 = y2 x1 x2 / endotest(y2);
  model y2 = x1 x2 z3 z4 / discrete;
  run;

The results from the two endogeneity tests are shown below and indicate that the likelihood ratio test rejects the null hypothesis that y2 is exogenous in the model for y1 (p=0.0006). The WALD test produced by the ENDOTEST option also rejects the null hypothesis that y2 is exogenous in the model for y1 (p=0.0156).

test results tables

The following step specifies the OVERID option to perform the overidentification test for validity of instruments z3 and z4.

proc qlim data = a;
  model y1 = y2 x1 x2 / overid(y2.z4);
  model y2 = x1 x2 z3 z4 / discrete;
  run;

In this case, PROC QLIM estimates the structural model y1, including the overidentifying instrumental variable z4 as an additional explanatory variable in this model, jointly with the reduced form model y2. Then it uses the likelihood ratio test to test the hypothesis that the overidentifying instrumental variable is insignificant. The following results show that the OVERID likelihood ratio test statistic, 0.06, is not significant (p=0.8093), indicating that the validity of the instruments z3 and z4 is not rejected.

overid test results table

Example 2. Probit Model with a Binary Endogenous Explanatory Variable

The following DATA step generates the data that is used in the analysis.

data a;
   keep y1 y2 x1 x2 z3 z4;
   do i = 1 to 500;
      x1 = rannor( 19283 );
      x2 = rannor( 19283 );
      z3 = rannor( 19283 );
      z4 = rannor( 19283 );
      u1 = rannor( 19283 );
      u2 = rannor( 19283 );
      y2l = 0.5 + 0.8*x1 + 1.2*x2 + 0.9*z3 + 0.6*z4 + u2;
      if ( y2l > 0 ) then y2 = 1;
      else y2 = 0;
      y1l = 2 + 3*y2 + 3 * x1 + 2 * x2 + u2 + u1;
      if ( y1l > 0 ) then y1 = 1;
      else y1 = 0;
      output;
end;
run;

The following statements specify the biprobit model. The two MODEL statements specify the structural equation for y1 and the reduced form equation for endogenous variable y2. The DISCRETE option in the ENDOGENOUS statement specifies that both endogenous variable y1 and y2 are discrete variables following probit (normal) distributions. PROC QLIM fits the bivariate probit model for the structural equation and reduced form equation.

proc qlim data=a;
   model y1 = y2 x1 x2;
   model y2 = x1 x2 z3 z4;
   endogenous y1 y2 ~ discrete;
   run;

The parameter estimates of the model are shown below. All parameters are significant at the 5% significance level. The estimate of _Rho, the correlation coefficient between the structural equation for y1 and reduced form equation for y2, is 0.77 and is significant (p<.0001). This is an indication of endogeneity of y2, which is also shown in the subsequent endogeneity tests.

parameter estimates table

The following steps perform two endogeneity tests for binary variable y2. The LR option in the TEST statement requests the likelihood ratio test that the correlation between y1 and y2 is zero. The ENDOTEST option tests the endogenous variable, y2, specified in parentheses. The null hypothesis for both tests is that the endogenous variable y2 is exogenous. Rejection of the null hypothesis indicates that y2 is endogenous.

proc qlim data=a;
   model y1 = y2 x1 x2;
   model y2 = x1 x2 z3 z4;
   endogenous y1 y2 ~ discrete;
   test _rho = 0 / lr;
   run;
proc qlim data=a;
   model y1 = y2 x1 x2 / endotest(y2);
   model y2 = x1 x2 z3 z4;
   endogenous y1 y2 ~ discrete;
   run;

Results of the two endogeneity tests are shown below. The likelihood ratio test rejects the null hypothesis that correlation between equation y1 and y2 is zero (p=0.0093). This indicates that variable y2 is endogenous in the model for y1. The WALD test produced by the ENDOTEST option also rejects the null hypothesis that y2 is exogenous in the model for y1 (p=0.037).

test results tables

The following step performs the overidentification test for validity of instruments z3 and z4.

proc qlim data=a;
   model y1 = y2 x1 x2 / overid(y2.z4);
   model y2 = x1 x2 z3 z4;
   endogenous y1 y2 ~ discrete;
   run;

The resulting OVERID test has likelihood ratio statistic 0.05 that is not significant (p=0.8187), indicating that the validity of instruments z3 and z4 is not rejected.

overid test results table