Historical monitoring on AWS
This is an overview of my approach on using AWS CloudWatch as a historical monitoring platform
Motivation
Running several production services, there are many point-in-time metrics of interest. While these can be calculated for the present instant, their dynamic properties are hard to measure this way. For some metrics there are easy-to-use tools like Google Analytics for page visits, for some other metrics are much harder to graph. For example, counting Facebook comments or G+ +1s requires several HTTP calls.
Amazon Web Services CloudWatch provides a way to store arbitrary monitoring data and an API to retrieve them historically. In this post I'll cover an effective and portable way to collect point-in-time data, and a way to retrieve and graph it using Javascript.
Data collection
AWS provides several APIs to access CloudWatch, supporting a wide range of technology stack. For my use case I used a simple Bash script to collect and upload the data, packaged into a Docker container.
Docker setup
I've used a simple Dockerfile which installs some dependencies and the AWS CLI bundle, do some configuration, then it runs the monitoring script at every hour. These are the boilerplate stuff.
FROM ubuntu:15.04
RUN apt-get update
RUN apt-get install -y python curl unzip wget
RUN curl "https://s3.amazonaws.com/aws-cli/awscli-bundle.zip" -o "awscli-bundle.zip" \
&& unzip awscli-bundle.zip \
&& ./awscli-bundle/install -i /usr/local/aws -b /usr/local/bin/aws
ADD aws_config /root/.aws
ADD scripts /root
RUN chmod +x /root/*.sh
CMD while true; do /root/put_monitor.sh; sleep 3600; done
My aws_config/config is like:
[default]
output = json
region = eu-west-1
And the aws_config/credentials is:
[default]
aws_access_key_id = <<Access key>>
aws_secret_access_key = <<Secret key>>
Don't forget to generate an AWS Access Key, and attach a policy that allows putting monitoring data:
{
...
"Action": [
"cloudwatch:PutMetricData"
],
"Effect": "Allow",
"Resource": [
"*"
]
}
The monitoring script
This is the part where things get a bit more interesting. The monitoring uploading command is a simple one-liner, but how you calculate the metric value can be quite tricky. For a basic reference, this is a sample and not-so-useful script that uploads the current time:
#!/bin/bash
seconds=`date +%s`
aws cloudwatch put-metric-data --metric-name "CurrentTime" --namespace "Testing" --value "$seconds" --unit "Count"
For some practical examples, let's count some social metrics from a site with an URL-list sitemap.
To count the sum of the Facebook shares:
while read line ; do
let fb_comments=fb_comments+`curl -s -G --data "fields=share{comment_count}" --data-urlencode "id=$line" https://graph.facebook.com/v2.2/ | jq -r '.share.comment_count'`
done < <(curl -s https://advancedweb.hu/sitemap.txt | grep -v "^$")
The same for Twitter shares:
while read line ; do
let twitter_shares=twitter_shares+`curl -s -G --data-urlencode "url=$line" http://cdn.api.twitter.com/1/urls/count.json | jq -r '.count'`
done < <(curl -s https://advancedweb.hu/sitemap.txt | grep -v "^$")
And lastly for G+ +1s, which is a bit more tricky:
while read line ; do
let gplus_plusones=gplus_plusones+`wget -qO- "https://plusone.google.com/_/+1/fastbutton?url=$line&count=true" | grep -o '<div id="aggregateCount[^>]*>[^<]*</div>' | grep -o '>[0-9]*<' | grep -o '[0-9]*'`
done < <(curl -s https://advancedweb.hu/sitemap.txt | grep -v "^$")
Data retrieval
As with the data collection, AWS provides many SDKs to retrieve the data, so you'll probably find a suitable library for the language of your choice. For this example, I'll be using the Javascript one. In the example code which you can find on GitHub I've used D3.js for data plotting, and hash parameters for all the config including the AWS Access and the Secret keys in order to make it a general purpose (and IFrame embeddable) tool.
Retrieving monitoring data is pretty straightforward using the official library, there aer only a handful of required parameters. First, the library must be included in the page:
<script src="https://sdk.amazonaws.com/js/aws-sdk-2.1.12.min.js"></script>
Then a few global configuration is needed, as the region and the keys are needed to be set on a global object.
AWS.config.region = region;
AWS.config.update({accessKeyId: accessKeyId, secretAccessKey:secretAccessKey});
And then the actual call:
var cloudwatch = new AWS.CloudWatch();
cloudwatch.getMetricStatistics(
{
Namespace:Namespace,
MetricName:MetricName,
EndTime: new Date(),
StartTime: new Date(new Date()-StartTimeMsAgo),
Statistics: [Statistics],
Period:Period,
Unit:Unit
},
function(err,data){
if (err){
console.error(err, err.stack);
} else{
// Use the data
// The format is: data.Datapoints is an array
// data.Datapoints.Timestamp
// data.Datapoints.<<Statistics>>, for ex. data.Datapoints.Average is the actual data
}
}
)
Also don't forget that you'll need to use an AWS key that has the required permission to read the metric statistics:
{
...
"Action": [
"cloudwatch:GetMetricStatistics"
],
"Effect": "Allow",
"Resource": [
"*"
]
}
At this point, you have all the data you need, and you can plot/use it however you'd like. In my example I've plotted them to a very simple line chart.
Limitations of AWS
During this experiment, I've encountered several shortcomings in the AWS APIs that severely limits the usefulness of this monitoring approach. These are:
- Currently it is not possible to limit the GetMetricStatistics policy to a single metric, it is an all or nothing switch
- There is no way to rate limit an access key. Currently if an adversary obtains your AWS Access and Secret keys, she can flood it with requests which effectively costs you money without limits.
- AWS CloudWatch retains data for 14 days, so that's the oldest point you can get back.
The effect of the first two is that you can't publicly disclose your AWS keys, as you might inadverently disclose information of other metric statistics (#1), and you also open an attack surface to your account balance (#2). Of course these can be mitigated by hosting an API and enclose the keys inside it, but then you need to care about scaling it (for the data collection you already need a server, but it does not need to be scaled). Architecturally it would be far better if you could just distribute the keys and it would just work.
The current data retention limits (#3), severely impact the usefulness of historical graphing. While it allows the examination of short term dynamic properties, it makes it less useful as an analytical platform. It should be possible to retrieve older data, even for a fair price.
Conclusion
I started to examine CloudWatch hoping that it can be made to a universal and architecturally clean analytics solution. While it lives up to this promise in some ways like use of use, it eventually falls short in both aspects. Its main use is to monitor instances and service health, and while it is very good at this, there are some important features missing before it can be used for historical monitoring. That said, for visualization of short term dynamics it is still an usable, easily scriptable and versatile solution.