So what can be done to tackle the network monitoring challenges?

In the network monitoring is a commodity myth, I argued that network monitoring is far from being a commodity and on the contrary needs innovation to cope with the increasing complexity.

As cote mentioned in the comments of that post, there has been some fresh blood in the IT management industry. Several open source companies/projects are tackling the monitoring problem, which is a good thing, yet I feel we’re still missing some pieces. AFAIK, most of the monitoring solutions seem to be following existing paradigms :

  • monitoring the devices (nodes) through SNMP agent
  • synthetic transcations to determine the status of services running on nodes

The understanding of the network topology is missing in both paradigms. In other words, nodes are what’s being monitored. Not the network. The network topology (except layer 3) is largely unknown. This limits the effectiveness of the monitoring. Monitoring tools (or rather functionality offered by the tools) can be categorized broadly as the following:

  • Polling the devices: Most common approach in IP networks. Most IP networking devices have an SNMP agent that supports at least MIBII so basic availability and performance information can be obtained. For more detailed information however, use proprietary MIBs is needed. Many IT management guys spent long hours trying to understand these MIBs, which data is where, compile them to be used by their monitoring tools, etc.
  • Listening for exceptions: Not every network device has an available agent that can be polled, especially in lower layers below IP. And when available, ability to listen for information is useful as it can be more immediate. In IP networks, these are typically SNMP traps or syslog events. In others, there are often element managers that convey messages. Again, IT management folks spent countless, often frustrating hours, trying to make sense of the traps, syslog events, etc. normalizing them, translate them into human language, identifying what is important and what’s not etc.
  • Listening to the pipes: It is possible to learn a lot by listening to what goes on the network. Flow tools (Netflow and its kin cFlow, J-Flow, netstream, sflow, etc.) generate end to end traffic statistics based on flow of data through the network device that support it. Another approach seems to be analyzing the traffic going through a device using a span port. Although it seems this method is popular to analyze application traffic. I don’t have a lot of personal experience with these tools so I’ll leave it to others to explain it better or correct me. From what I see these tools often require hardware distributed throughout the network to get full visibility which may be a hurdle for adoption.

IMHO, all of the approaches I’ve tried to summarize above have some shortcomings. As far as I can see, the situation may improve in two ways:

  • someone may come up with a new technology, a clever way to monitor the network and identifytthe problems, may be discover & represent the network etc. IMO, this can only happen if some of the investment and attention in tools that target “business users” with sexy, shiny UIs flow back to the muck. When the payoff is so low (who wants to tackle a “commodity” problem?) significant investment is not likely.
  • The power of the community is harvested to solve tedious problems once and share rather than each user struggling to solve the same problems over and over independently. There are already some examples of this splunk is attempting to create a repository of log events and what they mean. ZipTie open source project is working on solving device configuration through collaboration of vendors and customers (how come they are not a member?)

There is a lot more that can be done in the monitoring realm, if we can manage to setup the right collaboration platform (commercially, legally as well as technically) to facilitate sharing, which is sorely lacking in IT management for whatever the reasons may be.

From what I can see, ZipTie model is particularly interesting and suitable. Ability to collaborate and share is potentially a major competitive advantage for open source projects. I believe there are opportunities here for collaboration among open source projects/companies and their users/customers.

For example, in the case of discovery and representation of the network topology, how to get the topology data out of vast number of different type of devices is can be shared. If a common model can be defined to represent the topology, adapters to populate the model for each device can be developed.

In case of trap and event log processing, the knowhow of what each trap may mean, what the varbinds are can be shared. And again if a commong model can be defined to represent the traps/events, adapters to convert the traps into the common model can be developed.

I think these activities are naturally conducive to be solved through collaboration, and the life in the trenches would improve significantly if we were tackling them together instead of drowning in them alone.

blog comments powered by Disqus