For packet transport over SSH tunnels or other tunneling setups, we use TCP for packet transmission. SSH tunnels in particular only forward TCP connections so there is no gain to wrapping the packet in any oher protocols.
As TCP connections are always between two parties, serialized and gauranteed, transmision is pretty simple. We serialize the message and stream it to the other end. To allow an end node to resync on a message boundary in a stream, we prepend each message with some simple preamble. This is necessary to resync if length parsing failed or was incorrect.
In a regular experiment or a federeated experiment, there is a natural heirarchichal structure that we can leverage when transmitting information to many nodes.
We reserve some extra hardware from the available experiment nodes to provide the top layer of this tree. These are known as the control nodes. They provide a location for external TCP connections and aggregation facilities off of the regular experiment nodes. The control nodes form the top of each tree. Communication between control nodes will use stanard unicast communication schemes. It simplifies the top layer intercommunication and allows control nodes to exist in different testbeds or on different control planes as special forwarding may be needed due to security or conflicting private address spaces.
Below the control nodes on a single switched backplane are standard experiment nodes that form the first layer of the tree. An experiment node may be a single instance in an experiment or it may host multiple virtual machines that are each their own entity in the experiment. The virtual nodes become the second layer in the tree.
Using this structure we can multicast packets from the control node to the experiment nodes and broadcast from the experiment nodes to the virtual nodes. With a receiver based multicast agreement and NAK-avoidance, its possible to support many ‘nodes’ as seen by the experimenter across multiple testbeds.
As MAGI messages can be greater than 64K, the standard UDP packet size. Multicast communication will be expected to fragment the messages for transport over the network. To reduce the number of packets that we must restransmit due to packet loss, we fragment messages in sizes of 1470 or less (i.e. fits in a single IP packet over an ethernet link). IP fragmentation is cheaper for bandwidth when sending UDP packets greater than 1480 bytes but loss of any single fragment requires us to restransmit the entire frame (up to 64K). This adds additional wasted bandwidth and time to recover from a single lost packet. Furthermore, we require message level fragmentation regardless for anything over 64K.
Each MAGI message is assigned a simple incrementing ID and divided into chunks by the transport code. Each chunk is then sent in its own UDP multicast packet along with a data header (7 extra bytes). Each data chunk will be of the same size, typically set to the largest value that can fit inside the link layer MTU without IP level fragmentation. The size is deduced from the UDP packet length. The last packet of a set of packets (partnum == totalparts) may be less than the standard size. If the last chunk is received before any previous chunks, the receiver must hold onto the data until a full size chunk is received to determine placement in a buffer.
The multicast group communication scheme relies on the reciever to guarantee receipt. The sender does not attempt to maintain a list of all potential recipients so they cannot know precisely when to let go of message buffers, though group communication will generally always take place on a single switched link. The multicast transport will maintain a fixed size list of previous packets for retransmit. When a new packet arrives, it must bump out an older packet. We choose a packet based on time since last transmit (initial or retransmit) and message delivery flags. Packets that have just been transmitted remain in the queue to await any quick requests for retransmit. Packets that have higher priority such as those requesting an ACK over those specified as best effort will remain longer as their delivery is more important.
Buffers store whole MAGI messages. Most of these buffers will be simple in-memory buffers. Buffers for larger messages, such as large files, may be buffered onto disk to reduce memory usage.