Kubernetes High Availability (HA)

4 min readMar 15, 2020

--

Cluster 由 master node 與 worker nodes 所構成。Master node 負責運行 control plane 的功能，也就是管理 cluster 底下的 worker nodes。Worker nodes 負責運行 master node 指派的 application。

Cluster 若是只存在一個 master node。 Master node 節點失效，再也無法於 cluster 建立 pods, services 之類的物件 (object)。

HA 便是解決此問題。於 cluster 內建立起多個 master nodes。官方有列出兩種拓樸（topology）

1. Stacked etcd topology

把 datastore 的部分『 etcd 』跟 master node 綁再一起。主要的好處就是節省機器。壞處就是 master node 壞掉，datastore 與 controller plane 都一併壞掉。

2. External etcd topology

把 datastore 的部分『 etcd 』跟 master node 獨立開來。好處是 control plane 或 etcd 壞掉，皆不會影響到另一方。壞處就是需要更多的機器。

上面的架構中會牽扯到兩個問題：

1. 每個節點是否都會執行 Control plane 的功能 ?

根據 k8s, Set up High-Availability Kubernetes Masters 所述。

etcd instance: 
all instances will be clustered together using consensus.API server: 
each server will talk to local etcd - all API servers in the cluster will be available.controllers, scheduler, and cluster auto-scaler: 
will use lease mechanism - only one instance of each of them will be active in the clusteradd-on manager: 
each manager will work independently trying to keep add-ons in sync.

2. 所需的節點數為何 ?

由於 etcd 是採用 raft 來達成各個 etcd 節點資料的一致性。而 raft 有一個特性就是需要過半數 (majority) 節點存活著才可以正常運作。

Cluster size 與 majority 的關係也決定了可以容錯 (Failure Tolerance) 的節點數。以 etcd FAQ 所列的表格來看。基數相對於鄰近的偶數有更好的 failure tolerance 以及需要較少的節點數。

舉例來說：

cluster 數量為 3，對於 cluster 數量為 2 的情況。雖有相同的 majority，但是有更好的 failure tolerance。跟 cluster 數量為 4 的情況。雖有相同的 failure tolerance。但是需要更多的 majority。

根據 etcd FAQ，cluster 數量只需要 5 個，即可以應付大多數的情況了

A 5-member etcd cluster can tolerate two member failures, which is enough in most cases.

Kubernetes High Availability (HA)

1. Stacked etcd topology

2. External etcd topology

1. 每個節點是否都會執行 Control plane 的功能 ?

2. 所需的節點數為何 ?

Reference

Written by Yi Jyun