Ideas and Notes from Brainstorming Sessions (2017-09-08) ======================================================== Protocol: ~~~~~~~~~ sender-id/mux: We already discussed the possiblity to split up the mux in order to have support for link-local OOB messages. The downside is that this reduces the number of concurrent virtual connections... New Idea: don't sub-assign parts of mux but reduce sender-id to 12 bit. This is most probably still enough for very big anycast cluster and frees up 4 bits for additional signaling. The new header would look like this: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | sequence number | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |X ? ? ?| sender ID | MUX | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ X .. key-exchange flag or unencrypted flag? ? .. reserved inline Key-exchange: Idea: Key-exchange daemons can communicate with other side via link-local IPv6 addresses (works with tun and tap, at least on linux...) If packets incoming on tun/tap interface are IPv6 and have a link-local source or destination IP, messages are sent to the other side unencrypted and with the X flag set. Idea: use crypto role (server/client, left/right, alice/bob) for addressing possible adressing scheme: role(server,left,alice) -> fe80::xx:xx00:1/64 role(client,right,bob) -> fe80::xx:xx00:/64 (xx:xx is a well known number for SATP, i.e is always '5A:DB' mind IPv6 stateless autoconfiguration will always generate adresses with xx:xx set to FF:FE, Question: what about privacy extension?) alternative addressing: make use of the link-local address that is generated by the the OS (which should be the case for any interface with IPv6 enabled) and only add a well known address on one side, the server, (which wouldn't be selected by the automatic address selection algorightm). But in this case the server needs to learn the link-local adresses of the muxes aka clients. Question: How to handle systems with IPv6 disabled? No inline key-exchange support in that case? IPv4 Link-local Adresses only have a /16 range and we would loose one mux value in that case (or 3 if we also omit network and broadcast addresses -> not too bad...) The advantage of the use of link-local addresses is that in that case the key-exchange can use TCP from OS kernel which is already resilient against packet duplication and does retransmits -> very nice for RAIL-mode which will produce a lot of duplicates and probably still has packet loss. Possible downside is that not all programs/key-exchange daemons support link-local addresses -> write proxy application for that case! An anycast receiver will send a "redirect" message when it receives a packet with the X flag set on it's anycast address. This redirect will point to a unicast address on the same host. This way key-exchanges can be sure they only talk to a single host. For some key-exchanges it should be possible to send early data with the initial packet and the "redirect" message to save some round-trips. I.e. Ikev2 needs two round trips to establish a SA. The first two messages can be in the initial packet and the "redirect" message. The remaining 2 packets will then be sent to the unicast address of the anycast host which guarantees to reach an ikev2 daemon which has already seen the first part of the handshake. Does this work together with the IPv6 Link-Local address idea from above? Question: for the first key-exchange it makes sense to update the remote address in the SA even if the received packets are unauthenticated, but during normal operation it is very bad to update the remote addresses, which are the result of authenticated packets, in favor of unauthenticated info (aka packets with X flag set). Idea: have a seperate address list for encrypted/authenticated packets and for unauthenticated packets. If key-exchange succeeds the addresses learned by it are copied to the address list for encrypted packets. Golang Implementation: ~~~~~~~~~~~~~~~~~~~~~~ Packet Handling (Marshal/Unmarshal): Encrypted- and PlainPacket have an internal buffer using fixed pre-allocated memory. This might even be 64k (the UDP maximum size) because there won't be a lot of them allocated at once (maximum one per NumCPU?!). Header, Payload and Authtag of EncryptedPacket as well as Type and Payload of PlainPacket are go slices pointing to the underlaying buffer. The Header of EncryptedPacket und Type of PlainPacket have Getter and Setter which directly encode/decode using BigEndian.(Put)?Uint(16|32). All of this shouldn't need any mallocs and would therefor be pretty fast. EncryptedPacket has a function VerifyAndDecrypt() which takes a PlainPacket to store the result. PlainPacket has a function EncryptAndAuthenticate() which takes an EncryptedPacket to store the result. The implicit copy operations of that crypto functions are free because the encrypt/decrypt process needs to read and write the memory anyway and it makes no difference whether the destination is the same or some other memory area. Both packet types implement the ReaderFrom and WriterTo interface in order to directly read-from/write-to tun/tap device and UDP sockets. Conclusion: Any packet handling goroutine holds one EncryptedPacket and one PlainPacket. Idea: Have NumCPU goroutines for receiving and NumCPU goroutines for sending. Receiving: UPD --> verify&decrypt --> tun/tap Sendung: tun/tap --> encrypt&auth --> UDP Question: How can multiple goroutines listen to multiple UDP sockets but have the overall system allow only NumCPU packets to be handled at once? There are several cases where the above scheme leads to either to few or too many concurrent operations (a lot of traffic from a single source sent only in one direction vs. all sources send a lot of data in both directions). different approach: - one goroutine listening on all udp sockets + tun/tap using select() - when dispatcher gouroutine wakes up it starts up to NumCPU goroutines for all the sockets and tun/tap device ready for read. - only if all the file descripters returned by select() are assigned to a running goroutine the dispatcher goroutine calls select() again. - if a worker goroutine is done it returns it's resources to the dispatchers pool (resources = EncryptedPacket + PlainPacket) - number of available resources (aka packets) = NumCPU Security Assoc DB: A map with mux as key with a single RW lock. Only if clients are added or removed the writers lock needs to be acquired. Any other goroutine only needs to acquire the readers lock. The values of the map have their own RW lock for locking concurrent access to them. The value struct contains: - RW-mutex (see above) - timestamp when the SA was generated/updated by key-exchange - last sequence number used for outgoing packets - a list of remote addresses, one for any socket (RAIL-mode) possibly: a second list of remote addresses for uauthenticated packets - a list of sequence windows, one for any sender-id (anycast cluster) - the master key and salt and algo for the key derivation function - the cipher and auth algo to use (might be the same -> AES-GCM) - auth tag length For sending goroutines the next sequence number to be used can be calculated using AddUint32() from sync/atomic hence only the readers lock is required. EncryptedPacket.DecryptAndVerify possibly needs to update the remote address(es) after the packet is verified. In RAIL-mode this needs to be done regardless of the packet being accepted by the sequence window. If RAIL-mode is off the remote address should only be updated if the sequnce window accepts the packet. Question: the check if remote addresses need to be changed only needs the readers lock but in case it differs the goroutine needs to release the readers lock and acquire the writers lock. Is this a problem? Shall we acquire the writers lock in any case? For IPv4 adresses we could use sync/atomic CompareAndSwapUint32 but there is no such thing for IPv6 aka 128bit values. (And we would even need to include the port!) Sequence Window: EncryptedPacket.DecryptAndVerify needs to check the squence window which is a compare and write operation. Idea: Sequence window consists of one uin64 and a number of uint32 slices. The first uint64 is split into a 32bit part for the current top sequence number and 32 bit of flags. Each flag represents one sequence number (aligned to multiples of the 32bit sequnce number). Any subsequent 32bit value contains flags for older packets. The 64bit and all subsequent 32bit slices can be modified using commands from sync/atomic. When the bitmaps need to be rotated (ie. when the new sequence number advances the window to the next 32bit boundary) the writers lock for the window needs to be held. In any other cases the readers lock is enough and the bit test & set ops are atomic. This minimizes the number of times the writers lock is held to roughly 1/32 of every incoming packet for that sequence-window (Note: there is one squence-window per mux and sender-id).