Each message can be uniquely identified by the Source and Id fields in the message. These are stored temporarily and used to filter duplicate packets from a restransmission.
A common scheme in testbed control is to send data or commands to a series of nodes at the same time. The same group of nodes may be addressed repeatedly. To simplify addressing large groups and routing messages, node are allowed to join and leave named groups.
Group subscriptions are are spread through the messaging network for routing purposes. A node receiving a group subscription notes the nodes that are subscribing and then repeats informs neighbors on non-receiving transports, very much like IGMP. Each node only needs to know which direct neighbors wish to recieve the group messages and nothing beyond that.
The router messages are formatted in YAML much like most application messages and sends the messages to the dock __GROUPS__. The structure of a message includes:
{
add: [ list of groups ]
del: [ list of groups ]
set: [ list of groups ]
resend: True
counter: counter of all the groups in the full list of groups
checksum: checksum of all groups alphabetically ordered and fed through zlib's adler32
}
These are all the keys that can be present in any message. However, some of these keys are multually exclusive.
- add
- used to inform neighbors that you want to receive new groups, can be combined with del
- del
- used to inform neighbors of groups you no longer wish to received, can be combined with add
- set
- a (re)transmission of all groups that you wish to receive when resynchronization is required, cannot be combined
- resend
- a request for a Group List to another node when a lack of syncrhonization is detected, cannot be combined
Add, Del and Set must contain the counter and checksum values that help nodes determine when they are out of sync.
Most nodes in the network act as slaves to a central controller or orchestrater. In this case, it make sense for the controller to be able to construct groups from a single point. Rather than fill the application layer with extra code, we push that into the messaging layer.
Any node in the network can ‘build’ or ‘teardown’ nodes from a group. The messaging layer collects the responses from the remote nodes and sends an acknowledgment to the application when complete. The actual build and teardown messages are sent to all nodes. This reduces the need for single node route caching and takes full advantage of the broadcast nature of the topology. The return messages are sent to the originator of the build request and can be aggregated by the future automatic aggregation feature.
The build and teardown messages are formatted in YAML and send to the dock __BUILDER__.
Build and teardown messages:
{
builderid: id
group: "groupname"
build: [ list of nodes ]
teardown: [list of nodes]
}
Response messages:
{
builderid: id
complete: [ list of nodes ]
}
The initial response from any nodes contains a single entry array with its own name. This allows aggregation facilities to combine the messages.
There are a couple special group names that all nodes must understand but are not passed around in any routing messages:
- __ALL__
- this is a broadcast to all nodes in the experiment, it is processsed locally and then forwarded out all other interfaces
- __NEIGH__
- this is a broadcast to any receiving nodes on a transport, it is processed locally and routing stops
There are times when we may wish to send data to only a single node. The routing procedure used for groups is sufficient as the number of groups will probably remain in the range of 100 or perhaps up to 1000 and groups can be torn down before others are created. When dealing with single ‘nodes’, however, the number is static and could be into the millions if a large federation of testbeds along with virtual nodes is created.
There is a natural tree structure that forms in the testbed and if we could assign multipart node names based on that structure we could peform heirarchical routing where the sender just needs to know the next sub network to send to. However, this puts extra requirements on the testbed and synchronization with MAGI.
Building a dynamic routing information for this list would not take that long but the amount of storage required and lookup time for routing messages would quickly bring the system to a crawl. In this case, we take an idea from the AODV work in mobile adhoc networks and perform a request for routing using another broadcast message. If a route is not available, the node will send a ‘RouteRequest’ broadcast message for the destination node. The destination will respond with a broadcast route information message that has its hop counter incremented at each hop. Nodes cache the next hop in the path.
Using a cache lets us place an upper bound on the number of entries we have to maintain. Of course, there is still an issue if the number of required routes at any one time exceeds the cache size.
Each time the cache is queried, its access time is updated. If a new cache entry must be created but the cache is currently at its limit, we must remove some entries. One method is to maintain an ordered list such as a python deque and pull out the accessed route and move it to the front of the list, however, this means that there is a O(n) process occuring for every routed packet.
We prefer to do any heavy lifiting a limited number of times so cache purging involves removing a large chunk of entries and per packet processing should not do anything expensive. Packet processing only updates the route entry’s access time. When the cache reaches its maximum size, we retrieved a sorted list of all the access times in the cache. We pick a time in this list based on how many entries we want to remove and then create a new cache with access times newer than this. This removes x% of the cache.
It is possible for a message to be routed down one path for destination A when it is still waiting for a route for destination B that will also take the same route. This leads to each hop or group along the path to destination A creating a route request message for destination B. To squelch these requests, a node that passes on a route request will record that the request has been made and surpress any further requests for a set period of time.
To create a new group from a script or a graphical interface, we must instruct each individual node to join the group. Rather than attempting to address each individual node, we send a broadcast command to join a group to all experimental nodes (a default group) via the built-in ‘Joiner’ agent. Each node then examines the list in the requst and if they are in the list, they perform a join operation alleviating the need to perform single node routing.
At the experiment node level of the messaging network, there is a natural tree strucutre that will always be followed. At the upper levels where TCP transports are used to connect nodes, there is a possibility of a loop occuring in the network. This plays havoc with our simplified method of group announcements, unicast route responses in addition to regular unicast and mulicast routing.
To alleviate this problem, we implement an algorithm much like the spanning tree protocol in switches that runs only amongst nodes using the upper level transports like TCP. All nodes at the top level with more than one such connection exchange information about their links and build a shortest path tree with the same root (using node name with ‘smallest’ name). Each node then ignores packets on links that don’t exist in the spanning tree.