Tom's research
This is a brief document outlining current thinking on the WISP-in-a-box project that I (Tom) have been working on. It is meant to eventually be merged into the rest of the WISPiab articles, but for the time being it's here so new thinking can be easily seen and commented upon.
Use Cases
Our basic mesh setup consists of mesh nodes that provide a wireless backhaul to a central server and gateway and end-user access in the form of ethernet connections to the mesh node. For the purposes of the business model and billing, we are assuming each node represents a single user. It may be possible to support multiple users on each node, but it depends on [our choice of captive portals].
Design Goals
The mantra here is reusing existing solutions on low-cost hardware. We don't want to reinvent the wheel, and we need to do everything as cheaply as possible. The cost decisions mainly influence [hardware choice]. The goal of software reuse means we want to do as little development in-house as possible, instead relying on synthesizing a single product from various existing components. At the end of the day we may need to do some of our own development, but the decisions I'm suggesting in this document are made with the intent of minimizing that kind of work.
Hardware Choice
The platform for the mesh nodes will be the Linksys WRT. On the basis of cost and availability, there's really no better platform. These needs trump the technical ones that the Broadcom chipsets fail to meet. There are some consequences of this choice, however, in terms of what software can be run on the mesh nodes. More on this in Software Components.
Software Components
Mesh Components
Because our hardware platform is the WRT, we cannot use Orangemesh (at least in its current manifestation). The Orangemesh / Openmesh dashboard relies on the mesh nodes running ROBIN, and ROBIN's current form uses a single radio for two purposes -- both the mesh backhaul and an open access point. This requires a single radio to have two distinct SSIDs, which the Broadcom chipsets are not capable of doing (see ROBIN forum discussion on this). While we could possibly modify Orangemesh to suit our purposes, that project's development goals don't really "mesh" with ours, since their goal is single-device, single-radio access point / mesh systems. This isn't something we're explicitly interested in covering (or, rather, it would be nice to cover it, but because the Linksys can't do it, we're not going to). Still, we may decide it's worth it to modify Orangemesh. I'm going to continue looking into that possibility.
If we don't go with Orangemesh, I think a good solution might be for each mesh node to run Kamikaze + X-WRT + BATMAN + possibly Coova, then use the flashing-based config distribution system mentioned below.
Server Components
The server will have a unified dashboard, probably a simple frame system that has options across the top for looking at the network (Nagios or a modified Orangemesh dash), looking at billing options (Coovachilli or other, [see below]), vouchers (phpMyPrepaid), the state of the server (Webmin or similar), and possibly some further advanced options. Each of these options would load in the main frame that component's interface, which we will have skinned with a unified CSS style sheet to all look the same.
Open Questions
Positioning of the billing system
There is some question as to where the captive portal should sit. We've thus far assumed it would reside on the nodes themselves, in the form of CoovaAP. CoovaAP is no good for this purpose, however, because in order for it to run a captive portal it automatically takes over the wireless interface to run as an access point (so it can't be used for the mesh backhaul). Just using the CoovaChilli component, however, and swapping around it's normal outgoing / incoming interfaces, might work. The idea would be to run BATMAN on the wireless interface, tell CoovaChilli that the wireless interface is its outgoing interface, and then somehow bridge the LAN ports together and run a DHCP server only on them, not on the wireless. This seems easy enough, but as far as I can tell has never really been done, so this is a good point for experimentation.
An alternate strategy would be to have the captive portal reside on the server/gateway. This wouldn't allow for control of traffic inside the mesh, but that's probably all right (local traffic isn't what's going to be billed). This has the distinct advantage of only having *one* instance in the entire network, so configuration changes made to it don't need to be pushed out to all mesh nodes. The problem with this method is that it needs to be implemented at Layer 3, because clients won't be on the same Ethernet segment as the captive portal (they'll be multiple wireless hops away). CoovaChilli is a Layer 2 captive portal solution, and furthermore it *needs* to be the DHCP server for its clients (there may be ways to hack it to not require this, but at least in its stock form). Layer 3 captive portals exist, and this may be worth looking into as well. If we can find one that works with RADIUS, we can just plug it in to our existing FreeRADIUS / phpMyPrepaid billing architecture.
Update: I've gotten CoovaChilli running on a WRT running Kamikaze, configured such that the wireless is used as the default gateway (right now its in client mode tied to another AP, but could be used in ad-hoc mode for a mesh too) and the LAN hands out DHCP leases. Radius configuration is a bit tedious, so it's not quite functioning fully yet, but it seems to work no problem. This is a good thing, because it means we can run just CoovaChilli on all the mesh nodes without it affecting the wireless interface, and alongside any other software that runs on Kamikaze. There is one open problem here, however: CoovaChilli itself doesn't come with a web interface. It's interface is built in to the rest of the interface for the CoovaAP firmware. There are a few possibilities to handle this (write our own interface, rip the existing one out of the CoovaAP firmware) but none are totally desirable.
Single-configuration of Mesh Nodes
One of the major goals of this project is to have a single interface from which we can make changes to all of the mesh nodes. There are several approaches to this that I can see:
- There has been the suggestion of simply designing our own web interface that directly modifies the config files of every node. I am a little concerned about how robust and modular this solution would be. If we want to change a component, we need to redesign the web interface, and if a component's config file changes (say, between versions, and we want to upgrade to get some new feature or patch a security hole) then the web interface could potentially break. Furthermore, the task of this is a bit tedious, although certainly not impossible. In sticking with the design goal of reusing existing tools (not reinventing the wheel), I think we may be better off with a solution that doesn't require us to develop as heavily for a specific task like this unless we absolutely need to.
- Alternatively, Orangemesh provides this functionality through ROBIN, but for the reasons mentioned above it doesn't really suit our purposes. We could try to modify it or otherwise use their update system (which they must have, but I'm unfamiliar with).
- Lastly, I have a nascent idea that rather than updating the mesh nodes by simply modifying their configuation files, we could design a small utility that would run on the mesh node connected to the server. This mesh node would have everything we needed on it (e.g. OpenWRT, BATMAN, possibly CoovaChilli, etc.). After it was satisfactorily configured, using existing interfaces hopefully, this utility would essentially make an image of the entire node, and then upload this image to every other node in turn using private-key based SSH file transfers. Each node, upon receiving such an image, would write it to disk and reboot. This is certainly a little slower than directly modifying configuration files, but I think it has the possibility to be a simpler, more robust, and much more modular solution (one could even add whole packages or so to the master node and then have those changes propagate throughout the network). The feasibility of this approach isn't certain, but I think its worth further investigation.
How a flash-based configuration solution might work
A discussion with David Johnson made it seem like this is possible. Some specifics: The master node (node connected to server) gets configured. Once it looks good, the flash distribution utility is run. It commits all changes and writes an image file based on the master node's disk. The master node then goes through the list of every node it knows about (i.e. the rest of the mesh) and using a private SSH key initiates a script on each of those nodes. That script goes and fetches the image from the master node. When it gets it, it does some basic sanity checks (checksum, etc.) on the image, then sends a message to the master node saying it's ready. When the master node has gotten a message from every node in its list saying that that node is ready it sends out another command, telling each node to execute the change. If it doesn't get a message from every node, it informs the user which ones are missing, indicating some human intervention is needed at those nodes. Every node, upon receiving the execute command, waits twice the maximum transmission time or so, then writes the change to disc and reboots. Furthermore (and this is borrowed from the FreeBSD guys) it keeps the old image around (it should have enough RAM to do so) and upon reboot has a watchdog timer running. If it can't find the mesh after a few minutes, it will revert to the old image and reboot again. This is probably the trickiest part, but the entire system seems fairly simple. It'll take a little while for the entire process to go through, but many configuration changes can be lumped into one update, so for normal operation this should be fine. A further feature that was discussed is the possibility of all nodes, at some set time (say midnight each night), switching to a certain "update" SSID and channel and checking the master node for a more recent firmware version than what it currently has. Most nodes will be running the current firmware and immediately switch back to their normal SSID and channel, resulting in little downtime. And "lost" nodes, however, will download the correct firmware, thereby "rejoining the herd" so to speak.