Historical monitoring on AWS

This is an overview of my approach on using AWS CloudWatch as a historical monitoring platform

Author's image
Tamás Sallai
6 mins

Motivation

Running several production services, there are many point-in-time metrics of interest. While these can be calculated for the present instant, their dynamic properties are hard to measure this way. For some metrics there are easy-to-use tools like Google Analytics for page visits, for some other metrics are much harder to graph. For example, counting Facebook comments or G+ +1s requires several HTTP calls.

Amazon Web Services CloudWatch provides a way to store arbitrary monitoring data and an API to retrieve them historically. In this post I'll cover an effective and portable way to collect point-in-time data, and a way to retrieve and graph it using Javascript.

Data collection

AWS provides several APIs to access CloudWatch, supporting a wide range of technology stack. For my use case I used a simple Bash script to collect and upload the data, packaged into a Docker container.

Docker setup

I've used a simple Dockerfile which installs some dependencies and the AWS CLI bundle, do some configuration, then it runs the monitoring script at every hour. These are the boilerplate stuff.

FROM ubuntu:15.04

RUN apt-get update

RUN apt-get install -y python curl unzip wget

RUN curl "https://s3.amazonaws.com/aws-cli/awscli-bundle.zip" -o "awscli-bundle.zip" \
	&& unzip awscli-bundle.zip \
	&& ./awscli-bundle/install -i /usr/local/aws -b /usr/local/bin/aws

ADD aws_config /root/.aws

ADD scripts /root

RUN chmod +x /root/*.sh

CMD while true; do /root/put_monitor.sh; sleep 3600; done

My aws_config/config is like:

[default]
output = json
region = eu-west-1

And the aws_config/credentials is:

[default]
aws_access_key_id = <<Access key>>
aws_secret_access_key = <<Secret key>>

Don't forget to generate an AWS Access Key, and attach a policy that allows putting monitoring data:

{
	...
	"Action": [
		"cloudwatch:PutMetricData"
	],
	"Effect": "Allow",
	"Resource": [
		"*"
	]
}

The monitoring script

This is the part where things get a bit more interesting. The monitoring uploading command is a simple one-liner, but how you calculate the metric value can be quite tricky. For a basic reference, this is a sample and not-so-useful script that uploads the current time:

#!/bin/bash

seconds=`date +%s`

aws cloudwatch put-metric-data --metric-name "CurrentTime" --namespace "Testing" --value "$seconds" --unit "Count"

For some practical examples, let's count some social metrics from a site with an URL-list sitemap.

To count the sum of the Facebook shares:

while read line ; do
   let fb_comments=fb_comments+`curl -s -G --data "fields=share{comment_count}" --data-urlencode "id=$line" https://graph.facebook.com/v2.2/ | jq -r '.share.comment_count'`
done < <(curl -s https://advancedweb.hu/sitemap.txt | grep -v "^$")

The same for Twitter shares:

while read line ; do
   let twitter_shares=twitter_shares+`curl -s -G --data-urlencode "url=$line" http://cdn.api.twitter.com/1/urls/count.json | jq -r '.count'`
done < <(curl -s https://advancedweb.hu/sitemap.txt | grep -v "^$")

And lastly for G+ +1s, which is a bit more tricky:

while read line ; do
   let gplus_plusones=gplus_plusones+`wget -qO- "https://plusone.google.com/_/+1/fastbutton?url=$line&count=true" | grep -o '<div id="aggregateCount[^>]*>[^<]*</div>' | grep -o '>[0-9]*<' | grep -o '[0-9]*'`
done < <(curl -s https://advancedweb.hu/sitemap.txt | grep -v "^$")

Data retrieval

As with the data collection, AWS provides many SDKs to retrieve the data, so you'll probably find a suitable library for the language of your choice. For this example, I'll be using the Javascript one. In the example code which you can find on GitHub I've used D3.js for data plotting, and hash parameters for all the config including the AWS Access and the Secret keys in order to make it a general purpose (and IFrame embeddable) tool.

Retrieving monitoring data is pretty straightforward using the official library, there aer only a handful of required parameters. First, the library must be included in the page:

<script src="https://sdk.amazonaws.com/js/aws-sdk-2.1.12.min.js"></script>

Then a few global configuration is needed, as the region and the keys are needed to be set on a global object.

AWS.config.region = region;
AWS.config.update({accessKeyId: accessKeyId, secretAccessKey:secretAccessKey});

And then the actual call:

var cloudwatch = new AWS.CloudWatch();
cloudwatch.getMetricStatistics(
	{
		Namespace:Namespace,
		MetricName:MetricName,
		EndTime: new Date(),
		StartTime: new Date(new Date()-StartTimeMsAgo),
		Statistics: [Statistics],
		Period:Period,
		Unit:Unit
	},
	function(err,data){
		if (err){
			console.error(err, err.stack);
		} else{
			// Use the data
			// The format is: data.Datapoints is an array
			// data.Datapoints.Timestamp
			// data.Datapoints.<<Statistics>>, for ex. data.Datapoints.Average is the actual data
		}
	}
)

Also don't forget that you'll need to use an AWS key that has the required permission to read the metric statistics:

{
	...
	"Action": [
		"cloudwatch:GetMetricStatistics"
	],
	"Effect": "Allow",
	"Resource": [
		"*"
	]
}

At this point, you have all the data you need, and you can plot/use it however you'd like. In my example I've plotted them to a very simple line chart.

Limitations of AWS

During this experiment, I've encountered several shortcomings in the AWS APIs that severely limits the usefulness of this monitoring approach. These are:

  1. Currently it is not possible to limit the GetMetricStatistics policy to a single metric, it is an all or nothing switch
  2. There is no way to rate limit an access key. Currently if an adversary obtains your AWS Access and Secret keys, she can flood it with requests which effectively costs you money without limits.
  3. AWS CloudWatch retains data for 14 days, so that's the oldest point you can get back.

The effect of the first two is that you can't publicly disclose your AWS keys, as you might inadverently disclose information of other metric statistics (#1), and you also open an attack surface to your account balance (#2). Of course these can be mitigated by hosting an API and enclose the keys inside it, but then you need to care about scaling it (for the data collection you already need a server, but it does not need to be scaled). Architecturally it would be far better if you could just distribute the keys and it would just work.

The current data retention limits (#3), severely impact the usefulness of historical graphing. While it allows the examination of short term dynamic properties, it makes it less useful as an analytical platform. It should be possible to retrieve older data, even for a fair price.

Conclusion

I started to examine CloudWatch hoping that it can be made to a universal and architecturally clean analytics solution. While it lives up to this promise in some ways like use of use, it eventually falls short in both aspects. Its main use is to monitor instances and service health, and while it is very good at this, there are some important features missing before it can be used for historical monitoring. That said, for visualization of short term dynamics it is still an usable, easily scriptable and versatile solution.

June 23, 2015
In this article