Scaling telemetry monitoring with InfluxDB
User expectations for software applications keep rising. Nowadays, services are expected to be highly reliable and perform well 24/7. Any kind of downtime is going to result in frustrated users and hurt your business long-term.
A key component in improving reliability is monitoring your application. While setting up basic monitoring is easy, having the ability to scale monitoring efficiently as traffic to your service grows is a major challenge. You also want visibility into every important metric for your service and the ability to make the data you are collecting useful and actionable with the ability to query and analyze it efficiently in real time on demand.
In short, there’s a big difference between the problems you run into throwing together something for a side project or small scale system vs. deploying telemetry monitoring at scale in a production environment.
One team at Cisco experimented with InfluxDB to create an example of a scalable telemetry monitoring architecture that other companies with large-scale production environments could draw on, without having to start from scratch. This setup allowed Cisco to scale up its telemetry data ingestion to 3TB per day (or around 16GB per minute). At the core of this architecture is Cisco IOS-XR and InfluxDB.
Cisco telemetry monitoring architecture overview
There are three main components in Cisco’s telemetry architecture. The first part is the Cisco hardware running IOS-XR, which produces the telemetry data. The second part is the collector agent that takes in that data and then sending it to the final component for storage, which is accomplished with InfluxDB.
Cisco IOS-XR
One particularly relevant feature is that IOS-XR provides integrated streaming of telemetry data to increase network visibility and has APIs available for engineers to take action based on telemetry data.
For this architecture, Cisco streamed data from three different IOS-XR platforms: the NCS 5500, ASR 9000, and the 8000 series router. Cisco had the devices configured to run in dial-out mode, with self-describing GPBs (Google Protocol Buffers), over a TCP connection. One of the key factors in a telemetry monitoring architecture at this stage is making sure it doesn’t collect more data than it needs in terms of overall metrics as well as the frequency of metric collection.
Collector agent
The telemetry data from the IOS-XR hardware was sent to a load balancer, which then forwarded the data between three different collector agents. At large scale, single-threaded collector systems will not be able to handle the amount of data being sent to them. Multi-threaded collectors also have issues because they are all uploading to the database with separate connections, which creates another set of problems.
To get around these problems Cisco wrote a multi-processing collector agent, with the code being open source on GitHub. The collector agent’s main process is decoupled from the worker pool, which parses the data and uploads it to InfluxDB. The main process adds data to a queue as it is streamed in and then sends the telemetry data to the worker pool in batches. The collector agent is able to handle gigabytes of data per second, while remaining reliable due to this decoupled architecture. This can be seen in the diagram below.
InfluxDB
The final piece of the telemetry architecture is InfluxDB, which is used to store the data. For this experiment, InfluxDB was deployed with two data nodes and three meta nodes to form a cluster to support improved reliability and performance.
InfluxDB is a purpose-built time series database designed to handle massive volumes of time-stamped data, which made it a perfect fit for Cisco’s telemetry monitoring use case. InfluxDB also works great for any workload that requires being able to write large amounts of data and being able to query that data in real-time. Common use cases include IoT, analytics, and application monitoring.
For Cisco’s use case, it made a few changes to InfluxDB’s standard configuration to optimize it for their specific needs. The first was adjusting the default cache (buffer) memory size. Because they were writing data in batches from the collector agent, InfluxDB needed a larger amount of memory set aside so it would persist that data while it was being written. At the cluster level, Cisco also chose to allow out-of-order replica writes to be made between nodes. This allowed more flexibility in the relationship between data arrival order and the points’ accompanying timestamps.
Scaling telemetry data is a difficult task that many companies have tried to solve on their own. Cisco’s goal in this experiment was to provide a blueprint architecture for other companies to follow so that they don’t have to reinvent the wheel for their own use case. A core part of Cisco’s solution was InfluxDB because of its performance, ease of use, and open source code base.
Sam Dillard is senior product manager of IoT and enterprise at InfluxData.
—
New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to [email protected].
Copyright © 2021 IDG Communications, Inc.