XMiDT

Move from SD+Consistent Hash to Fixed Cluster Size & Rendezvous Hash

Proposal:

Goals:

  • Simplify deployment by:
    • Eliminating “automatic inclusion” of Talaria into production server group due to service discovery.
    • Eliminating re-sharding events that impact CPEs connected to Talaria nodes not pulled from service.
    • Enabling specific Talaria nodes that are being pulled from service to be drained before being pulled from service.
  • Enable stronger Geo affinity by providing automatic fail-over in a DC with minimal impact to CPEs attached to unaffected Talaria.
  • Enable a deployment that moves the minimal amount of CPEs, yet does so under very controlled conditions.
  • Retain complete DC failover support.

What:

Change from consistent hashing and service discovery for determining the Talaria cluster count to rendezvous hashing & a fixed list of Talarias per DC. The list will be stored/shared via Consul’s key-value service per DC.

In addition to the simple rendezvous hashing, the fan-out pattern will be limited by a xxx_hash_k parameter that limits the fan-out permitted by talaria, petasos and scytale. This feature is designed to enable highly optimal fan-out rates (normally 1 or 2 valid talaria per DC) vs the need to either progressively go through all talaria nodes or need to fan out to all talaria nodes in a DC to find the CPE.

Petasos will have a new feature where it will randomize from [0 to petasos_hash_k) and return only that talaria URL. By making petasos randomize, failures are automatically routed around without the need to determine if a node has failed. This feature also prevents the need for clients to alter their existing behavior.

Talaria will also have a new management function that forces talaria to re-read the /kv/talaria_list value as well as the /kv/talaria_hash_k value and apply a “pruning process” where CPEs that are no longer eligible to be attached to that talaria will be disconnected. The API call will return when the process has completed & return the combination of /kv/talaria_list & /kv/talaria_hash_k in the response to inform the caller of the state achieved. This allows automation to ensure talaria is behaving correctly prior to changing the /kv/scytale_hash_k value to a more restrictive value.

Finally, this feature will require some new tools to be developed to perform parts of the upgrade process automatically. These tools will need to interact with Consul & it’s kv store as well as the service discovery.

Consul key list:

/kv/talaria_list = "talaria-0000-v010203-xyz.xmidt.comcast.net,
                    talaria-0001-v010203-aad.xmidt.comcast.net,
                    ..."

/kv/talaria_hash_k = 1
/kv/scytale_hash_k = 1
/kv/petasos_hash_k = 1

I think there are some clarifications needed.

The goal of the proposal is to create independent regions that are highly available without requiring clients to connect to a different region. This will allow for a deployment scenario that will have geographic affinity. However, this scenario is not part of the proposal. In the current design, a failure of a single Talaria node will create a time window where devices belonging to failed hash buckets cannot be serviced until this failure is detected by the service discovery mechanism. This current design rely on having redundant regions that can service these failed hash buckets.

It should be noted that simply switching to Rendezvous Hashing using the k value of 1 would not perceptibly change the behavior from how hashing works today if we don’t change the service discovery mechanism. At the minimum, a k value of 2 is needed for any kind of redundancy. The ability to have this redundancy allows us to abandon service discovery all together, needing only failure detection at the instance level.

The proposed Consul key list would also require a list size. Currently the list size is dictated by the number of available nodes, with failing nodes falling out of this list and changing its size. The result is a reshard event, where some portion of connected device will need to reconnect. The proposal calls for a list of fixed size, where the example Talaria nodes are actual instances taking on the responsibility of the hash buckets associated with that particular index. Conceptually, this means the hashing algorithm operates on the list:

* index0
* index1
* index2
* ......

The outcome of the hashing algorithm must then be applied to list of Talaria instances via a table lookup.

* index0 --> /kv/talaria_list[0]

This also means the list size must not be larger than the number of elements in the list definition.