Resolve the CAS error "A block of records on a failed node could not be activated among the remaining worker nodes"


When you interact with tables loaded to the SAS® Cloud Analytic Services (CAS) server, the following errors might occur:

The table is in an unsatisfactory state and a reload of the table is recommended

A block of records on a failed node could not be activated among the remaining worker nodes

These messages mean that one or more CAS worker nodes have failed and not enough table data remains in redundant copies on other CAS worker nodes to provide the full data set. The number of worker nodes that can be lost is equal to the number of table copies. Therefore, if the table was loaded to CAS with the default value of one copy (when you use the table.loadtable CAS action or PROC CASUTIL LOAD), the error occurs when two worker nodes have been lost. If the table was loaded with zero copies, losing one worker node triggers this problem.  

Circumvention

To circumvent the issue, an admin should check whether all the expected CAS workers are connected to the CAS server:

  1. Check in SAS® Environment Manager under Servers ► cas-shared-default ► Configuration ► Nodes tab.
  2. On this tab, all the expected workers should appear with green connected check marks. If any workers are not connected, use either of the options below to add them into the CAS server:

Option 1

Follow the documentation at Manage CAS Server Nodes in the SAS® Viya® 3.5 Administration Guide. In step 6, use "Add a worker node."

Option 2

Restart the CAS server, which results in a brief CAS outage and means that you must reload all tables afterward. If all the CAS tables are in a broken state, this might be the most efficient resolution.


To restart CAS, run the following on the CAS controller host:

sudo systemctl stop sas-viya-cascontroller-default
sudo systemctl start sas-viya-cascontroller-default

Using systemctl stop and then systemctl start is recommended over systemctl restart, because there is the possibility that file locking issues can occur on restart if the CAS server is not given enough time to terminate.

  1. After you use either of these options, you can check to see whether the workers are connected by navigating to SAS Environment Manager's Servers ► cas-shared-default ► Configuration ► Nodes tab.
  2. If the workers are not connected, review the recent logs from the CAS controller and the failing CAS worker nodes, found in this location: /opt/sas/viya/config/var/log/cas/default
  3. Once all CAS workers are connected, reload the table. If you restarted the CAS server so that all tables are now unloaded, simply load the table. If you added only the CAS worker nodes so the server did not completely restart, do one of the following:
  1. Once the broken table is unloaded, reload the table via your preferred method.