This is a brief document outlining current thinking on the WISP-in-a-box project that I (Tom) have been working on. It is meant to eventually be merged into the rest of the WISPiab articles, but for the time being it's here so new thinking can be easily seen and commented upon.
Our basic mesh setup consists of mesh nodes that provide a wireless backhaul to a central server and gateway and end-user access in the form of ethernet connections to the mesh node. We aren't explicitly concerned with running access points; if that functionality is desired then separate devices will need to be attached to the mesh nodes to act as wireless bridges. Each mesh node will have its own captive portal, supporting multiple clients, and communicate with a central AAA server.
The mantra here is reusing existing solutions on low-cost hardware. We don't want to reinvent the wheel, and we need to do everything as cheaply as possible. The cost decisions mainly influence [hardware choice]. The goal of software reuse means we want to do as little development in-house as possible, instead relying on synthesizing a single product from various existing components. At the end of the day we may need to do some of our own development, but the decisions I'm suggesting in this document are made with the intent of minimizing that kind of work.
The platform for the mesh nodes will be the Linksys WRT. On the basis of cost and availability, there's really no better platform. These needs trump the technical ones that the Broadcom chipsets fail to meet. There are some consequences of this choice, however, in terms of what software can be run on the mesh nodes. More on this in later section on the mesh software.
Network Component Layout
Other thoughts & issues: Past Concerns, Future Considerations
Positioning of the billing system
There is some question as to where the captive portal should sit. We've thus far assumed it would reside on the nodes themselves, in the form of CoovaAP. CoovaAP is no good for this purpose, however, because in order for it to run a captive portal it automatically takes over the wireless interface to run as an access point (so it can't be used for the mesh backhaul). Just using the CoovaChilli component, however, and swapping around it's normal outgoing / incoming interfaces, might work. The idea would be to run BATMAN on the wireless interface, tell CoovaChilli that the wireless interface is its outgoing interface, and then somehow bridge the LAN ports together and run a DHCP server only on them, not on the wireless. This seems easy enough, but as far as I can tell has never really been done, so this is a good point for experimentation.
An alternate strategy would be to have the captive portal reside on the server/gateway. This wouldn't allow for control of traffic inside the mesh, but that's probably all right (local traffic isn't what's going to be billed). This has the distinct advantage of only having *one* instance in the entire network, so configuration changes made to it don't need to be pushed out to all mesh nodes. The problem with this method is that it needs to be implemented at Layer 3, because clients won't be on the same Ethernet segment as the captive portal (they'll be multiple wireless hops away). CoovaChilli is a Layer 2 captive portal solution, and furthermore it *needs* to be the DHCP server for its clients (there may be ways to hack it to not require this, but at least in its stock form). Layer 3 captive portals exist, and this may be worth looking into as well. If we can find one that works with RADIUS, we can just plug it in to our existing FreeRADIUS / phpMyPrepaid billing architecture.
Update: I've gotten CoovaChilli running on a WRT running Kamikaze, configured such that the wireless is used as the default gateway (right now its in client mode tied to another AP, but could be used in ad-hoc mode for a mesh too) and the LAN hands out DHCP leases. Radius configuration is a bit tedious, so it's not quite functioning fully yet, but it seems to work no problem. This is a good thing, because it means we can run just CoovaChilli on all the mesh nodes without it affecting the wireless interface, and alongside any other software that runs on Kamikaze. There is one open problem here, however: CoovaChilli itself doesn't come with a web interface. It's interface is built in to the rest of the interface for the CoovaAP firmware. There are a few possibilities to handle this (write our own interface, rip the existing one out of the CoovaAP firmware) but none are totally desirable.
Update 2: We will write a CoovaChilli configuration page as a component of our dashboard. It's configuration will be pushed out to all nodes just like the standard network configuration. One remaining open question: the access controller (i.e. CoovaChilli) may need to monitor traffic going inside the mesh. As long as the destination is not behind the same CoovaChilli instance as the original host then this is no problem. But if they are both behind the same CoovaChilli instance (i.e. connected to the same mesh node), it may be more difficult. Doable, but it may take some configuration, I'm just not sure. This also brings up the question of how two end hosts on the network can communicate if they're both behind separate NAT's (in the form of CoovaChilli DHCP servers). Probably some kind of coordination via the server will be necessary to rendezvous (this is how Skype works), but it may limit the services we can offer.
Single-configuration of Mesh Nodes
One of the major goals of this project is to have a single interface from which we can make changes to all of the mesh nodes. There are several approaches to this that I can see:
- Orangemesh provides this functionality through ROBIN, but for the reasons mentioned above it doesn't really suit our purposes. We could try to modify it or otherwise use their update system. See below for a discussion of how
- Alternatively, I have a nascent idea that rather than updating the mesh nodes by simply modifying their configuration files, we could design a small utility that would run on the mesh node connected to the server. This mesh node would have everything we needed on it (e.g. OpenWRT, BATMAN, possibly CoovaChilli, etc.). After it was satisfactorily configured, using existing interfaces hopefully, this utility would essentially make an image of the entire node, and then upload this image to every other node in turn using private-key based SSH file transfers. Each node, upon receiving such an image, would write it to disk and reboot. This is certainly a little slower than directly modifying configuration files, but I think it has the possibility to be a simpler, more robust, and much more modular solution (one could even add whole packages or so to the master node and then have those changes propagate throughout the network). The feasibility of this approach isn't certain, but I think its worth further investigation.
Update: The need to support configuration changes across multiple devices (e.g. Ubiquity Nanostations and Mesh Potatoes) makes the flashing system untenable. Rather, sending out update messages that are received by a client running on the mesh node seems to be a better option. This is basically how ROBIN works. See below.
How a flash-based configuration solution might work
A discussion with David Johnson made it seem like this is possible. Some specifics: The master node (node connected to server) gets configured. Once it looks good, the flash distribution utility is run. It commits all changes and writes an image file based on the master node's disk. The master node then goes through the list of every node it knows about (i.e. the rest of the mesh) and using a private SSH key initiates a script on each of those nodes. That script goes and fetches the image from the master node. When it gets it, it does some basic sanity checks (checksum, etc.) on the image, then sends a message to the master node saying it's ready. When the master node has gotten a message from every node in its list saying that that node is ready it sends out another command, telling each node to execute the change. If it doesn't get a message from every node, it informs the user which ones are missing, indicating some human intervention is needed at those nodes. Every node, upon receiving the execute command, waits twice the maximum transmission time or so, then writes the change to disc and reboots. Furthermore (and this is borrowed from the FreeBSD guys) it keeps the old image around (it should have enough RAM to do so) and upon reboot has a watchdog timer running. If it can't find the mesh after a few minutes, it will revert to the old image and reboot again. This is probably the trickiest part, but the entire system seems fairly simple. It'll take a little while for the entire process to go through, but many configuration changes can be lumped into one update, so for normal operation this should be fine. A further feature that was discussed is the possibility of all nodes, at some set time (say midnight each night), switching to a certain "update" SSID and channel and checking the master node for a more recent firmware version than what it currently has. Most nodes will be running the current firmware and immediately switch back to their normal SSID and channel, resulting in little downtime. And "lost" nodes, however, will download the correct firmware, thereby "rejoining the herd" so to speak.
How ROBIN updates work, what we can reuse
ROBIN nodes periodically (via cron) run a script that contacts the dashboard (in ROBIN parlance, "update server") and sends it all of the information the dashboard requires (in the form of a URI parameter sent to a PHP script), and in return receives an update file (that same PHP script returns the update file, so for the client all it needs to do is call `wget dashboard.com?information-the-dashboard-wants` and it receives an update file). This update file is comprised of sections, each corresponding to a configuration file. The /sbin/update script sends and receives this information, writes a cleaned version of it to file, and then calls /etc/init.d/settings, which expects the file to be there already, parses it by splitting it up into its various sections, and calls a different update script for each section (e.g. update-batman for /etc/config/batman, update-mesh for /etc/config/mesh, etc). These scripts are in /usr/sbin/.
This aspect of ROBIN we might be able to make use of, if we wanted to take these scripts and adapt them slightly. The system is somewhat modular in that only config sections that get sent out in updates and written to config files, which is good for us. The format of the configuration updates is extremely simple, which is good as well. ROBIN seems to have a lot of other, yet-unexplored (by me) aspects, however. I don't yet know how crucial they are for the system to function, but compared to a basic OpenWRT Kamikaze install there are quite a few files specific to ROBIN spread out over the system. This is very bad for portability; we want to make it easy to implement clients for use on various hardware platforms (even initially we need to support 3: Linksys WRT, Ubiquity NanoStation and the yet-to-be-designed Mesh Potato). This would suggest we should have a simple, single program or script that handles all interaction with the dashboard and configuration.
Many other pieces of ROBIN are other software components (e.g. BATMAN, CoovaChilli) that we would want to install separately and have run independently anyway. We still want to be able to configure them, which ROBIN does through the above-mentioned system of update-* scripts. We aren't particularly interested in them themselves, however, because we assume that they will be installed already on a system that we're writing the client for.
Peebles Valley Visit
Thursday July 24th - Friday July 25th a large delegation from Meraka went to Peebles Valley to visit the mesh there. We toured the site, David showed everyone the basic setup and Harry, the director of the clinic there, gave us a quite interesting tour of the medical facilities as well. On Friday we spent some time doing a little network debugging, replacing one of the routers on a house there and plugging back in & resetting a router at the school. We also spoke with some local guys who are maybe interested in turning the wireless mesh into some kind of business, which is clearly crucial for the network to continue functioning; as it is, we can come out and fix things as often as we like, they'll always just fall apart again. I pitched the WISP-in-a-box system to one of the people who's interested in making the mesh a business, he seemed interested but I think without an actual product to demo it's tough to really understand whether it would be useful. This kind of entrepreneur is precisely our target audience, however, so this might be an excellent first test site.
Ndlovu Clinic Visit
Monday July 28th - Friday Aug 1st I was able to go to the Ndlovu Medical Clinic in Elansdoorn. There's a fairly sophisticated IT setup there, essentially a large Windows Active Directory setup with about 100 client PCs. Additionally, there's interest in using a wireless mesh to run telemedicine between the main clinic and some of the outlying Nutritional Unit sites. The telemedicine itself is currently tied up in political discussions, but the network is largely in place. It uses the Bokkie routers developed at Meraka, running both IPv4 and IPv6 on the mesh. I helped Henry, Ndlovu's main IT guy, to add another node to the mesh connecting the main clinic building, where most of the computer systems reside and the Internet uplink comes in, to their computer lab, a couple hundred meters away. Most of the nodes are currently off because they can't be used for telemedicine and if they're on they risk getting fried by lightning. But with just those two nodes everything worked largely as expected.
We configured the main clinic's node to act as a default gateway for the network, hooking in to the main clinic network. All of the computers in the lab are now set up to use the node there as their DHCP server. Traffic can be routed out to the Internet, but there's one remaining issue: the mesh itself has no DNS support, and so the computer lab nodes need to have DNS servers manually configured. Johann has indicated that this would be fairly easy to fix, however, simply by configuring the clinic node to be the mesh's DNS server then pointing it to the DNS server of the main clinic network. Johann and Henry should be in contact to resolve this last issue (pun not intended ;)
The main challenges in getting things set up at Ndlovu was not related to the mesh so much as the rest of the clinic's setup. As mentioned, it's entirely a Microsoft domain, and there was a Microsoft ISA 2004 firewall set up at the perimeter of the network. We added another one in between the mesh and the rest of the network, to limit access and also bandwidth-limit the computer lab so it doesn't take out the entire Internet uplink. This caused all manner of chaos on the internal network; at one point the entire Active Domain stopped working. We traced the problem to a "known issue" mentioned in an MSDN article; the solution was manually changing some registry key. This was a full day and a half's worth of headache, however. After that it was relatively smooth sailing, the only other issue was configuring the newly added firewall to route to the mesh network properly. This required some fiddling with routing tables on the Windows command line, no mean feat but doable. After that we were in good shape, save the DNS problem mentioned above.
The node itself on the computer lab is mounted on the computer lab by means of a nice mast that the Ndlovu workmen welded us. It should offer no problems, but the Ethernet cable still needs to be routed inside the building and protected from the elements.
As soon as they get permission to start doing telemedicine or have other needs for the mesh it should be quite easy to just turn on the other Bokkie routers and expand the mesh rapidly. Before this is done, however, the DNS issue should be resolved.
For the sake of completeness, I'll also link here to the instructions I wrote for Henry, Ndlovu's IT guy, on how to work with the Bokkie routers. This is based on the instructions Johann gave me. High Performance Node - Setup