XMiDT

Duplicate Connected Devices Policy

This topic was prompted by this PR: https://github.com/xmidt-org/webpa-common/pull/439 because I wanted to share some of the design history & open up the discussion to a wider audience. I also want to thank @diogoViana95 for taking to time to make the PR & spark this conversation topic.


Pretty much since the start of Webpa/Xmidt we have tried to prevent duplicate devices in the system because it caused confusion & some difficult to understand behaviors.

As of the code available on Nov 4, 2019, Talaria allows a single connected device per device-id. If a device disconnects & reconnects quickly, Talaria will choose the new connection as the preferred connection and close the old connection assuming it’s dead.

If we run in a single data center (DC) configuration that means there can be 1 device connected with that device-id at a time, but there are trade-offs.

  • If by mistake or attack multiple devices attempt to use the same device-id then you can end up with devices that are constantly kicking each other off the system.
  • If you run with multiple DCs you can actually have up to 1 device-id per DC, so you end up with “some” duplication. API calls will favor the device that responds fastest, but each connected in the different DCs will get a copy of the message. Event origins can’t really be distinguished between the two.

Alternative Approach

An alternative approach would be to keep the old connections & let them expire “naturally” when the client doesn’t respond, or in the case where multiple devices choose the same name, they both exist.

With this approach there are several trade-offs. I’ll start with the ones I’m aware of, but there could be others.

  • One API call would result in multiple outcomes. Instead of 1 response, we could get any number.
    • I’m not sure how Talaria would respond to Scytale in the fan out of requests.
      • 1 response per device allows for fastest responses, but how to signal how many to expect?
      • 1 response per Talaria would slow down all responses to the slowest responding device.
  • How could the API caller specify which device to target with a request since the unique value is no longer unique on the Talaria?
  • Themis & the security check prior to allowing a connection should be able to catch this class of problem as it develops into a more robust solution.

I’m posting this because I wanted to share the design decisions we made quite a while ago, but also to provide a mechanism for change if we would like to do so.

I’m not in favor of changing this behavior at this point despite it not being perfect, but am open to discussing. Thoughts?

Best,
Wes

Hello @schmidtw, relating to the PR, I think it’s not a design discussion. It is literally a bug in the logic of disconnect of a single talaria in a single DC. If you check the PR thread, you can see that the bug causes the disconnect of both connections (the new and old one). And by design, it should be only the old one to disconnect.

Relating to the architecture design it self. I would say, that for the time being it’s a good enough approach. The truth is that the ping/pong logic of the websocket will eventually converge to a single device in a multi DC scenario (considering that the security of a device it’s not compromised, and therefore we cannot have physically two devices with the same id).

If the architecture wants to support by design a concept of “clone” devices with the same id, probably it’s a different kind of problem/solution. Do you have a real use-case for this type of scenarios?

Best,
Carlos

Oh ok. That makes more sense I’ve asked @johnabass to look at the PR since he’s really familiar with the code (he likely wrote what is there).

I don’t have any strong use cases for supporting a clone today. I thought that was what the PR was about & wanted to have a discussion that wasn’t buried in the PR code :slight_smile: .

Best,
Wes