Apache SkyWalking 修復TTL timer 失效bug詳解

2022-09-26 14:05:56

正文

近期，Apache SkyWalking 修復了一個隱藏了近4年的Bug - TTL timer 可能失效問題，這個 bug 在 SkyWalking <=9.2.0 版本中存在。關於這個 bug 的詳細資訊可以看郵寄清單 lists.apache.org/thread/ztp4… 具體如下

首先說下這個 Bug 導致的現象：

過期的索引不能被刪除，所有的OAP節點都出現類似紀錄檔 The selected first getAddress is xxx.xxx.xx.xx:port. The remove stage is skipped.
對於以 no-init 模式啟動的 OAP 節點，重啟的時候會一直列印類似紀錄檔 table:xxx does not exist. OAP is running in 'no-init' mode, waiting... retry 3s later.

如果 SkyWalking OAP 出現上面的兩個問題，很可能就是這個 Bug 導致的。

下面我們先了解一下 SkyWalking OAP 叢集方面的設計

SkyWalking OAP 角色

SkyWalking OAP 可選的角色有 Mixed、Receiver、Aggregator

Mixed 角色主要負責接收資料、L1聚合和L2聚合；
Receiver 角色負責接收資料和L1聚合；
Aggregator 角色負責L2聚合。

預設角色是 Mixed，可以通過修改 application.yml 進行設定

core:
  selector: ${SW_CORE:default}
  default:
    # Mixed: Receive agent data, Level 1 aggregate, Level 2 aggregate
    # Receiver: Receive agent data, Level 1 aggregate
    # Aggregator: Level 2 aggregate
    role: ${SW_CORE_ROLE:Mixed} # Mixed/Receiver/Aggregator
    restHost: ${SW_CORE_REST_HOST:0.0.0.0}
    restPort: ${SW_CORE_REST_PORT:12800}
# 省略部分設定...

L1聚合：為了減少記憶體及網路負載，對於接收到的 metrics 資料進行當前 OAP 節點內的聚合，具體實現參考 MetricsAggregateWorker#onWork() 方法的實現；

L2聚合：又稱分散式聚合，OAP 節點將L1聚合後的資料，根據一定的路由規則，傳送給叢集中的其他OAP節點，進行二次聚合，併入庫。具體實現見 MetricsPersistentWorker 類。

SkyWalking OAP 叢集

OAP 支援叢集部署，目前支援的註冊中心有

zookeeper
kubernetes
consul
etcd
nacos

預設是 standalone，可以通過修改 application.yml 進行設定

cluster:
  selector: ${SW_CLUSTER:standalone}
  standalone:
  # Please check your ZooKeeper is 3.5+, However, it is also compatible with ZooKeeper 3.4.x. Replace the ZooKeeper 3.5+
  # library the oap-libs folder with your ZooKeeper 3.4.x library.
  zookeeper:
    namespace: ${SW_NAMESPACE:""}
    hostPort: ${SW_CLUSTER_ZK_HOST_PORT:localhost:2181}
    # Retry Policy
    baseSleepTimeMs: ${SW_CLUSTER_ZK_SLEEP_TIME:1000} # initial amount of time to wait between retries
    maxRetries: ${SW_CLUSTER_ZK_MAX_RETRIES:3} # max number of times to retry
    # Enable ACL
    enableACL: ${SW_ZK_ENABLE_ACL:false} # disable ACL in default
    schema: ${SW_ZK_SCHEMA:digest} # only support digest schema
    expression: ${SW_ZK_EXPRESSION:skywalking:skywalking}
    internalComHost: ${SW_CLUSTER_INTERNAL_COM_HOST:""}
    internalComPort: ${SW_CLUSTER_INTERNAL_COM_PORT:-1}
  kubernetes:
    namespace: ${SW_CLUSTER_K8S_NAMESPACE:default}
  # 省略部分設定...

OAP 啟動的時候，如果當前角色是 Mixed 或 Aggregator，則會將自己註冊到叢集註冊中心，standalone 模式下也有一個記憶體級叢集管理器，參見 StandaloneManager 類的實現。

Data TTL timer 設定

application.yml 中的設定

core:
  selector: ${SW_CORE:default}
  default:
    # Mixed: Receive agent data, Level 1 aggregate, Level 2 aggregate
    # Receiver: Receive agent data, Level 1 aggregate
    # Aggregator: Level 2 aggregate
    role: ${SW_CORE_ROLE:Mixed} # Mixed/Receiver/Aggregator
    restHost: ${SW_CORE_REST_HOST:0.0.0.0}
    restPort: ${SW_CORE_REST_PORT:12800}
    # 省略部分設定...
    # Set a timeout on metrics data. After the timeout has expired, the metrics data will automatically be deleted.
    enableDataKeeperExecutor: ${SW_CORE_ENABLE_DATA_KEEPER_EXECUTOR:true} # Turn it off then automatically metrics data delete will be close.
    dataKeeperExecutePeriod: ${SW_CORE_DATA_KEEPER_EXECUTE_PERIOD:5} # How often the data keeper executor runs periodically, unit is minute
    recordDataTTL: ${SW_CORE_RECORD_DATA_TTL:3} # Unit is day
    metricsDataTTL: ${SW_CORE_METRICS_DATA_TTL:7} # Unit is day
    # 省略部分設定...

enableDataKeeperExecutor 自動刪除過去資料的執行器開關，預設是開啟的；
dataKeeperExecutePeriod 執行週期，預設5分鐘；
recordDataTTL record 資料的 TTL(Time To Live)，單位：天；
metricsDataTTL metrics 資料的 TTL，單位：天。

DataTTLKeeperTimer 定時任務

DataTTLKeeperTimer 負責刪除過期的資料，SkyWalking OAP 在啟動的時候會根據 enableDataKeeperExecutor 設定決定是否開啟 DataTTLKeeperTimer，也就是是否執行 DataTTLKeeperTimer#start() 方法。 DataTTLKeeperTimer#start() 方法的執行邏輯主要是通過 JDK 內建的 Executors.newSingleThreadScheduledExecutor() 建立一個單執行緒的定時任務，執行 DataTTLKeeperTimer#delete() 方法刪除過期的資料，執行週期是dataKeeperExecutePeriod 設定值，預設5分鐘執行一次。

Bug 產生的原因

DataTTLKeeperTimer#start() 方法會在所有 OAP 節點啟動一個定時任務，那如果所有節點都去執行資料刪除操作可能會有問題，那麼如何保證只有一個節點執行呢？

如果讓我們設計的話，可能會引入一個分散式任務排程框架或者實現分散式鎖，這樣的話 SkyWalking 就要強依賴某個中介軟體了，SkyWalking 可能是考慮到了這些也沒有選擇這麼實現。

那我們看下 SkyWalking 是如何解決這個問題的呢，我們前面提到 OAP 在啟動的時候，如果當前角色是 Mixed 或 Aggregator，則會將自己註冊到叢集註冊中心，SkyWalking OAP 呼叫 clusterNodesQuery#queryRemoteNodes() 方法，從註冊中心獲取這些節點的註冊資訊（host:port）集合，然後判斷集合中的第一個節點是否就是當前節點，如果是那麼當前節點執行過期資料刪除操作，如下圖所示

節點A和節點集合中的第一個元素相等，則節點A負責執行過期資料刪除操作。

這就要求 queryRemoteNodes 返回的節點集合是有序的，為什麼這麼說呢，試想一下，如果每個 OAP 節點呼叫 queryRemoteNodes 方法返回的註冊資訊順序不一致的話，就可能出現所有節點都不和集合中的第一個節點相等，這種情況下就沒有 OAP 節點能執行過期資料刪除操作了，而 queryRemoteNodes 方法恰恰無法保證返回的註冊資訊順序一致。

解決 Bug

我們既然知道了 bug 產生的原因，解決起來就比較簡單了，只需要對獲取到的節點集合呼叫 Collections.sort() 方法對 RemoteInstance（實現了java.lang.Comparable 介面）做排序，保證所有OAP節點做比較時都是一致的順序，程式碼如下

相關程式碼如下：

/**
 * TTL = Time To Live
 *
 * DataTTLKeeperTimer is an internal timer, it drives the {@link IHistoryDeleteDAO} to remove the expired data. TTL
 * configurations are provided in {@link CoreModuleConfig}, some storage implementations, such as ES6/ES7, provides an
 * override TTL, which could be more suitable for the implementation. No matter which TTL configurations are set, they
 * are all driven by this timer.
 */
@Slf4j
public enum DataTTLKeeperTimer {
    INSTANCE;
    private ModuleManager moduleManager;
    private ClusterNodesQuery clusterNodesQuery;
    private CoreModuleConfig moduleConfig;
    public void start(ModuleManager moduleManager, CoreModuleConfig moduleConfig) {
        this.moduleManager = moduleManager;
        this.clusterNodesQuery = moduleManager.find(ClusterModule.NAME).provider().getService(ClusterNodesQuery.class);
        this.moduleConfig = moduleConfig;
        // 建立定時任務
        Executors.newSingleThreadScheduledExecutor()
                 .scheduleAtFixedRate(
                     new RunnableWithExceptionProtection(
                         this::delete, // 刪除過期的資料
                         t -> log.error("Remove data in background failure.", t)
                     ), moduleConfig
                         .getDataKeeperExecutePeriod(), moduleConfig.getDataKeeperExecutePeriod(), TimeUnit.MINUTES);
    }
    /**
     * DataTTLKeeperTimer starts in every OAP node, but the deletion only work when it is as the first node in the OAP
     * node list from {@link ClusterNodesQuery}.
     */
    private void delete() {
        IModelManager modelGetter = moduleManager.find(CoreModule.NAME).provider().getService(IModelManager.class);
        List<Model> models = modelGetter.allModels();
        try {
            // 查詢服務節點
            List<RemoteInstance> remoteInstances = clusterNodesQuery.queryRemoteNodes();
            if (CollectionUtils.isNotEmpty(remoteInstances) && !remoteInstances.get(0).getAddress().isSelf()) {
                log.info(
                    "The selected first getAddress is {}. The remove stage is skipped.",
                    remoteInstances.get(0).toString()
                );
                return;
            }
            // 返回的第一個節點是自己，則執行刪除操作
            log.info("Beginning to remove expired metrics from the storage.");
            models.forEach(this::execute);
        } finally {
            log.info("Beginning to inspect data boundaries.");
            this.inspect(models);
        }
    }
    private void execute(Model model) {
        try {
            if (!model.isTimeSeries()) {
                return;
            }
            if (log.isDebugEnabled()) {
                log.debug(
                    "Is record? {}. RecordDataTTL {}, MetricsDataTTL {}",
                    model.isRecord(),
                    moduleConfig.getRecordDataTTL(),
                    moduleConfig.getMetricsDataTTL());
            }
            // 獲取 IHistoryDeleteDAO 介面的具體實現
            moduleManager.find(StorageModule.NAME)
                         .provider()
                         .getService(IHistoryDeleteDAO.class)
                         .deleteHistory(model, Metrics.TIME_BUCKET,
                                        model.isRecord() ? moduleConfig.getRecordDataTTL() : moduleConfig.getMetricsDataTTL()
                         );
        } catch (IOException e) {
            log.warn("History of {} delete failure", model.getName());
            log.error(e.getMessage(), e);
        }
    }
    private void inspect(List<Model> models) {
        try {
            moduleManager.find(StorageModule.NAME)
                         .provider()
                         .getService(IHistoryDeleteDAO.class)
                         .inspect(models, Metrics.TIME_BUCKET);
        } catch (IOException e) {
            log.error(e.getMessage(), e);
        }
    }
}

更多技術細節大家可以參考下面的連結

相關連結

以上就是Apache SkyWalking 修復TTL timer 失效bug詳解的詳細內容，更多關於Apache SkyWalking 修復bug的資料請關注it145.com其它相關文章！