Datawarehousing for Cloud Computing Metrics

Besides being a system engineer in the area of cloud computing I’m currently studying information technology with a focus on computer engineering. We’re a group of 16 students and need to build teams with two to five people and do our thesis as a group. I will probably lead a three man team of the project. One possible topic that I came up with is:

Datawarehousing for Cloud Computing Metrics

Description of the topic:
$company is running several linux hypervisors with many virtual machines. We want to log usage statistics like CPU Time/IO/Ram Usage/Disk Usage/Network Traffic (for example with collectd and their Virt Plugin), send these values to a database, transform the history values into trends by summarising them, and make all data accessible via an API and webinterface.


  • Provide in depth statistics for every VM/VM-Owner which would allow per-usage-billing (for example with grafana and Cyclops)
  • Create an API where you can input specs for a virtual machine that you want to create and our datawarehouse will find the most suitable node for you based on his last usage

Project scope:
we’ve got a time-frame of at least 120 hours for every group member, so 360 hours for the complete project. We will meet with $company in person for kickoff/milestone meetings and work 1-3 Weeks in their office. But for the most of the time we will work via remote because we still have our regular jobs + evening classes in Germany.

This is a huge project and we need to put our focus on special aspects of the project, these could be:

  • Which use cases exists for the datawarehouse, which metrics would suit theses cases, how do we get them, how long will we have to keep them and how often do we have to poll them
  • What is the best solution for aggregating values?
  • What are the requirements to our database and which DBMS meets them? (document based, relational, time series, graph database)
  • Which information can we get from our database and how do we use it?
  • How do we provide information?

1. There are many tools to collect information on a Linux node, I already mentioned collectd, alternative solutions are sysstat and atop, it would also be possible to write an own solution to get these information. Important point to think about is: which information do we actually need? Many people like to save everything they can get “because I may use it later”. But in a huge setup, with 35.000 virtual machines, collecting information every 10 seconds and save them for a few months or for the complete life cycle of a virtual machine will create a huge amount of data (and also slow down the database?). Depending on the storage type (cheap disks or more expensive SSDs) it is worth a thought to think about the amount of metrics and if you really need them. We also need to think about the different metrics that are possible with the different tools, all of them offer different metrics, do we need all metrics of all tools?

2. The central collecting and aggregation service is the core part of this project, there are existing solutions for this problem, for example logstash with the ganglia or udp plugin. Another solution is Riemann, this is an event stream processor which handles many different kinds of event streams, combines them and triggers actions based on them. It would also be possible to write an own service in C/C++, Ruby, Clojure. Basic requirements for all solutions are: listening on an interface to wait for incoming values or pull them from the nodes, maybe aggregate them in any way, write them into $database.

3. Things to think about: Do we need a distributed search and analytics engine like elasticsearch, or a distributed NoSQL setup with cassandra, or is it okay to work with a single relational database like postgres? Another solution would be a time series database (a time series is a flow of multiple related values, for example the CPU temperature measured every 30seconds over a period of 10 minutes) like OpenTSDB (OpenTSDB also has a built in API for retrieving the values). Network links inside of a datacenter can be considered as stable, so faults are the exception. So is it worth building a huge cluster if the possible downtime is only a few minutes high (for example during maintenance work)? Or is a distributed setup needed because of IO bottlenecks? Should the central collecting service spool the values if the database is unavailable? In a large environment capacity planning is also important, how efficiently can the database compress and save the values? Which system has the lowest memory/CPU usage for each saved and processed value (this is also important for the software that processes the values, see point 2) and how do you measure the efficiency?

4. We can get detailed usage statistics from our database and use this for per-usage billing, it is also possible to find the perfect node for a new virtual machine. Another idea is to do a forecast for usage based on old data. For example one machine always had high IO usage on Christmas, our service could inform you before Christmas that this machine probably will do high IO again and offer you an alternative node with more IO capacity. We can also analyse the current usage of the whole platform and of each node and optimize the platform by offering recommendations for virtual machine migrations to increase packing density (free up as many host systems as possible, accept an optional list of allowed source and/or migration nodes). Last idea: you often have to do maintenance work and and exchange old hardware with new one. We can create an algorithm which accepts any node as the input as the source (and an optional list of allowed destination nodes) and outputs suitable migration destinations with the goal to migrate as fast as possible.

5. We need different solutions to access the data, one mentioned webinterface is grafana, an alternative is graphite, both are fine for end users to get stats. But what about system administrators? They need a working API to interact with. Which is the best data serialisation format? Do we need to expose graphs or the raw history/trends? How does the API needs to be structured?

This entry was posted in 30in30, General, Linux, Nerd Stuff. Bookmark the permalink.

One Response to Datawarehousing for Cloud Computing Metrics

  1. Pingback: Diving into Management – How to do Meetings | the world needs more puppet!

Leave a Reply

Your email address will not be published. Required fields are marked *

Time limit is exhausted. Please reload CAPTCHA.