Ideas and Notes from Brainstorming Sessions (2017-09-08)
========================================================

Protocol:
~~~~~~~~~

sender-id/mux:

  We already discussed the possiblity to split up the mux in order to have
  support for link-local OOB messages. The downside is that this reduces the
  number of concurrent virtual connections...

  New Idea: don't sub-assign parts of mux but reduce sender-id to 12 bit. This
  is most probably still enough for very big anycast cluster and frees up 4 bits
  for additional signaling.
  The new header would look like this:

    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                         sequence number                       |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |X ? ? ?|       sender ID       |              MUX              |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

   X .. key-exchange flag or unencrypted flag?
   ? .. reserved


inline Key-exchange:

  Idea: Key-exchange daemons can communicate with other side via link-local IPv6
  addresses (works with tun and tap, at least on linux...)
  If packets incoming on tun/tap interface are IPv6 and have a link-local source
  or destination IP, messages are sent to the other side unencrypted and with
  the X flag set.

  Idea: use crypto role (server/client, left/right, alice/bob) for addressing
  possible adressing scheme:
     role(server,left,alice) -> fe80::xx:xx00:1/64
     role(client,right,bob)  -> fe80::xx:xx00:<mux>/64
      (xx:xx is a well known number for SATP, i.e is always '5A:DB'
       mind IPv6 stateless autoconfiguration will always generate adresses
       with xx:xx set to FF:FE, Question: what about privacy extension?)

  alternative addressing: make use of the link-local address that is generated
  by the the OS (which should be the case for any interface with IPv6 enabled)
  and only add a well known address on one side, the server, (which wouldn't be
  selected by the automatic address selection algorightm). But in this case the
  server needs to learn the link-local adresses of the muxes aka clients.

  Question: How to handle systems with IPv6 disabled? No inline key-exchange
  support in that case? IPv4 Link-local Adresses only have a /16 range and we
  would loose one mux value in that case (or 3 if we also omit network and
  broadcast addresses -> not too bad...)

  The advantage of the use of link-local addresses is that in that case the
  key-exchange can use TCP from OS kernel which is already resilient against
  packet duplication and does retransmits -> very nice for RAIL-mode which will
  produce a lot of duplicates and probably still has packet loss.
  Possible downside is that not all programs/key-exchange daemons support
  link-local addresses -> write proxy application for that case!

  An anycast receiver will send a "redirect" message when it receives a packet
  with the X flag set on it's anycast address. This redirect will point to a
  unicast address on the same host. This way key-exchanges can be sure they only
  talk to a single host. For some key-exchanges it should be possible to send
  early data with the initial packet and the "redirect" message to save some
  round-trips.
  I.e. Ikev2 needs two round trips to establish a SA. The first two messages can
  be in the initial packet and the "redirect" message. The remaining 2 packets
  will then be sent to the unicast address of the anycast host which guarantees
  to reach an ikev2 daemon which has already seen the first part of the
  handshake.
  Does this work together with the IPv6 Link-Local address idea from above?


  Question: for the first key-exchange it makes sense to update the remote
  address in the SA even if the received packets are unauthenticated, but during
  normal operation it is very bad to update the remote addresses, which are the
  result of authenticated packets, in favor of unauthenticated info (aka packets
  with X flag set).
  Idea: have a seperate address list for encrypted/authenticated packets and for
  unauthenticated packets. If key-exchange succeeds the addresses learned by it
  are copied to the address list for encrypted packets.


Golang Implementation:
~~~~~~~~~~~~~~~~~~~~~~

Packet Handling (Marshal/Unmarshal):

  Encrypted- and PlainPacket have an internal buffer using fixed pre-allocated
  memory. This might even be 64k (the UDP maximum size) because there won't be a
  lot of them allocated at once (maximum one per NumCPU?!).
  Header, Payload and Authtag of EncryptedPacket as well as Type and Payload of
  PlainPacket are go slices pointing to the underlaying buffer. The Header of
  EncryptedPacket und Type of PlainPacket have Getter and Setter which directly
  encode/decode using BigEndian.(Put)?Uint(16|32). All of this shouldn't need
  any mallocs and would therefor be pretty fast.

  EncryptedPacket has a function VerifyAndDecrypt() which takes a PlainPacket to
  store the result. PlainPacket has a function EncryptAndAuthenticate() which
  takes an EncryptedPacket to store the result. The implicit copy operations of
  that crypto functions are free because the encrypt/decrypt process needs to
  read and write the memory anyway and it makes no difference whether the
  destination is the same or some other memory area.
  Both packet types implement the ReaderFrom and WriterTo interface in order
  to directly read-from/write-to tun/tap device and UDP sockets.
  Conclusion: Any packet handling goroutine holds one EncryptedPacket and one
  PlainPacket.

  Idea: Have NumCPU goroutines for receiving and NumCPU goroutines for sending.

    Receiving:    UPD   --> verify&decrypt --> tun/tap
    Sendung:    tun/tap -->  encrypt&auth  -->   UDP


  Question: How can multiple goroutines listen to multiple UDP sockets but have
  the overall system allow only NumCPU packets to be handled at once? There
  are several cases where the above scheme leads to either to few or too many
  concurrent operations (a lot of traffic from a single source sent only in one
  direction vs. all sources send a lot of data in both directions).

  different approach:
    - one goroutine listening on all udp sockets + tun/tap using select()
    - when dispatcher gouroutine wakes up it starts up to NumCPU goroutines
      for all the sockets and tun/tap device ready for read.
    - only if all the file descripters returned by select() are assigned to
      a running goroutine the dispatcher goroutine calls select() again.
    - if a worker goroutine is done it returns it's resources to the dispatchers
      pool (resources = EncryptedPacket + PlainPacket)
    - number of available resources (aka packets) = NumCPU


Security Assoc DB:

  A map with mux as key with a single RW lock. Only if clients are added or
  removed the writers lock needs to be acquired. Any other goroutine only needs
  to acquire the readers lock. The values of the map have their own RW lock for
  locking concurrent access to them.

  The value struct contains:
    - RW-mutex (see above)
    - timestamp when the SA was generated/updated by key-exchange
    - last sequence number used for outgoing packets
    - a list of remote addresses, one for any socket (RAIL-mode)
      possibly: a second list of remote addresses for uauthenticated packets
    - a list of sequence windows, one for any sender-id (anycast cluster)
    - the master key and salt and algo for the key derivation function
    - the cipher and auth algo to use (might be the same -> AES-GCM)
    - auth tag length

  For sending goroutines the next sequence number to be used can be calculated
  using AddUint32() from sync/atomic hence only the readers lock is required.
  EncryptedPacket.DecryptAndVerify possibly needs to update the remote address(es)
  after the packet is verified. In RAIL-mode this needs to be done regardless of
  the packet being accepted by the sequence window. If RAIL-mode is off the remote
  address should only be updated if the sequnce window accepts the packet.

  Question: the check if remote addresses need to be changed only needs the
  readers lock but in case it differs the goroutine needs to release the readers
  lock and acquire the writers lock. Is this a problem? Shall we acquire the
  writers lock in any case?
  For IPv4 adresses we could use sync/atomic CompareAndSwapUint32 but there is
  no such thing for IPv6 aka 128bit values.
  (And we would even need to include the port!)


Sequence Window:

  EncryptedPacket.DecryptAndVerify needs to check the squence window which is a
  compare and write operation.
  Idea: Sequence window consists of one uin64 and a number of uint32 slices. The
  first uint64 is split into a 32bit part for the current top sequence number
  and 32 bit of flags. Each flag represents one sequence number (aligned to
  multiples of the 32bit sequnce number). Any subsequent 32bit value contains
  flags for older packets.
  The 64bit and all subsequent 32bit slices can be modified using commands from
  sync/atomic. When the bitmaps need to be rotated (ie. when the new sequence
  number advances the window to the next 32bit boundary) the writers lock for
  the window needs to be held. In any other cases the readers lock is enough and
  the bit test & set ops are atomic. This minimizes the number of times the
  writers lock is held to roughly 1/32 of every incoming packet for that
  sequence-window (Note: there is one squence-window per mux and sender-id).