Show me the numbers: Part 1

What are some of the phrases that you do not like hearing in a workplace?

Think about it.

For me, some of them would be:

  1. … based on this, our learnings are ...
  2. Sorry, I thought you didn’t like cake.
  3. I have a feeling these changes will work/have worked.

Today I want to focus on the third statement and how we at @fillrapp decided to simply eliminate guess work from our thinking. The process, I should clarify upfront, was not a simple wand wave and everything was neat and tidy. Trial, error, repeat till you no longer end up on error. 


The Why

Some time ago, we as a startup, had a crisis of conscience where we could no longer ignore the evidence in front of us that the way we were approaching our product was not viable. The intent was there and we did feel that the product, at its core, was brilliant, but our approach, in terms of development, design and marketing was not viable. It was a bitter pill to swallow, but kudos to all involved, for swallow it we did. After a few weeks of having a hard look at ourselves and our practices, a new approach was designed which I like to call “Show me the numbers” (even though no one else does). The idea was to pivot to a new leaner product via a metric based approach.

In Software Engineering, almost all ideas can be tested and verified by experiments. Development by experimentation used to be a very tedious affair as it required a lot of changes, which needed to be easily and quickly altered. These days however, with the advent of tools like @optimizely@installrapp etc, it has become increasingly easy to test and deploy your hypothesis. Google Analytics is now an old player in understanding the data that comes back from these hypotheses. At @fillrapp, however, we found that GA did not give us the complete insight into the usage of our app. We learnt about downloads via iTunes connect and distribution via GA, but we knew next to nothing about usage trends at all. Usage trends are the lifeline of a startup as that allows you to focus your energies in a direction that you know is working.


The How

I sat down with the team and we discussed what it was that we really wanted. Did we want all the data lumped into one place and then have someone prepare reports as a means of disseminating the information? Or did we want buckets of information where individuals could sift through data and generate reports/results/interpretations themselves? Eventually, in the interests of automation and longevity, we decided on the latter, with one rule in mind. At no point, were we to store any private/personal/sensitive data about our users. Period. All we wanted to interpret was the usage of our app! Nothing w.r.t. the users. (Fillr as a company never sees or stores any of our users personal data. Each users' personal data is stored locally on their phone not in the cloud, fully encrypted and pin protected within the Fillr app).

With the above requirements in mind, I decided to set up an ElasticSearch and Kibana engine, which I lovingly call the "Stats Engine” and everyone else calls "that Kibana thing". At this point, I had set up many systems in AWS but never had I set up a search cluster and hence did not know the finer details of setting one up. I spoke to Ross Simpson (@simpsora) who has had experience in this regard and he was kind enough to offer some very valuable insights. Initially, my thoughts focussed around how to get stats into ElasticSearch.

The way our system is set up is quite simple. The app (be it iPhone or Android) talks over HTTPS to our middleware, and post that conversation, fills the form. It is that data that we want to analyse of:

  • What is being filled?
  • At what time?
  • How many times?
  • Which geographical location fills the most?

Note that at no point do we capture who fills the form, or what they fill it with. 

I knew the end point would be an ElasticSearch and Kibana stack running in AWS, but I also wanted redundancy of data. If the EK stack went down, I did not want to lose any stats. If everything on the stats side went down, I did not want to lose any events. Not only that, if everything on the stats side did go belly up, I wanted to know immediately. These were the priorities I set myself.

So I decided to use the following:

  1. AWS for my infrastructure: We chose AWS because I was most familiar with it and almost all of our other infrastructure is already in it.
  2. An AWS AMI as my deployable artefact: When in AWS, do as AWS does.
  3. Packer to create my deployable artefact: Due to some previous work, I had come to understand it fairly well. 

We did not want the phone to do any of the stats pushing as its needs to be as fast as possible. Hence it is at the middleware level (shown in the above wonderfully crafted diagram) where all the stats pushing would happen. Now all the stats pushing also had to be asynchronous. Which meant a queue. I decided to use the AWS Simple Queue Service (SQS) as the middleware was already sits in AWS and it would easy to do so.

The middleware was redeployed with an SQS IAM (Identity and Access Management) role in order to be able to push messages into the queue. Also, to make sure that it did not fill up without us noticing, I also added CloudWatch Alarms to the Queue to ping me via PagerDuty when the queue reached a pre determined limit. The middleware, at this point, was also doing some data reformatting to make the messages simpler for ingestion into the eventual ElasticSearch setup.

Now that I had the data that I wanted to analyse into a queue, the next step is making sure that we can extract that data out of that queue and then put it into the ElasticSearch setup. I decided to use a DSL called Retire to interact with ElasticSearch as I had used it previously. A small ruby script (which would simply run as an Upstart job on the box) would then ingest the messages from the queue and push it into ElasticSearch. The EC2 instance that would run this script has an SQS IAM read role for that queue associated with it. The script looks something like:

So now we have set up pushing to a queue, and extraction from the queue. Now to set up the actual ElasticSearch and Kibana infrastructure. The Ubuntu repositories that EC2 instances point to hold old versions of both so using Packer, I obtain the latest versions of both from and use Packer to create the AMI. Kibana, ElasticSearch and NGinX are setup to be Ubuntu services. I used NGinX to ensure proper navigation from front end to their respective services on appropriate ports.

This is also the best time to provision ElasticSearch plugins. For me I chose the following:

  1. elasticsearch/elasticsearch-cloud-aws: An AWS co-ordination plugin
  2. lmenezes/elasticsearch-kopf: A Cluster Plugin
  3. lukas-vlcek/bigdesk and mobz/elasticsearch-head: Both are Index visualisation plugins. 

For setting up the AWS environment, I stuck to my friend CloudFormation. I had previously tried a few other DSL’s that were present like cfn-dsl etc but I have found them no less verbose or involved as CloudFormation. And I am not going to go near Elastic Beanstalk anytime soon. So using CloudFormation, I set up the ElasticSearch-Kibana stack as follows.

  • The stack is divided into two subnets, public and private.
  • The public subnet contains the NAT and the Bastion instances and the ELB to interface with the outside world.
  • The private subnet contains the actual search and visualisation instances which can only be accessed via the public subnet using proper authentication measures.
  • The search instance backs up its indices into an S3 bucket periodically via a cron job. If and when the ASG kicks in and creates a new instance, the instance will be initialised with the latest index set from this bucket.


Return of the how

This was the initial setup of my ElasticSearch-Kibana stack. And it worked. All of this was behind our office VPN and authentication wall, so access was restricted. However I was not sure about scaling at this point, so I had another conversation with Ross Simpson (@simpsora) and he suggested some changes.

  • Making the cluster an actual cluster (always allowing for n/2 + 1 minimum master node capability).
  • I was using the Knapsack plugin to store backups to the S3 bucket, which I have now changed to use the inbuilt snapshot utility of ElasticSearch.
  • I also did not need a ELB in front of the NGinX setup, so I moved the ELB in front of the actual cluster and the traffic is sent to it via NGinX + Kibana instance. This allows also for a separate and easy examination of the indices separate from the Kibana instance.
  • Given a failure of one node, the way I had set up the backups via Knapsack, it would have caused issues as the cluster would have rebalanced and on top of that I would be restoring a snapshot. So replacing the knapsack backups with the inbuilt snapshot utility of ElasticSearch.

Keeping all of the above points in mind, the architecture now resembles the following diagram:


In the end

It’s all about the journey. We are not yet done with refining our metrics based approach to product development, nor will we be anytime soon. This is a continuous process and developing a mechanism to view the usage of our product in real time has helped our focus and direction. It has also reinforced our initial hypothesis that an empirical approach to Software Development is the way for development.I would highly recommend using ElasticSearch for such a metrics based approach. The ease of its setup and configuration and speed of indexing is a delight and the diversity of Kibana w.r.t. visualisations makes understanding those stats very easy.

I would like to extend my sincere thanks to Ross Simpson (@simpsora) and to Andrew Humphrey (@andrewjhumphrey) in helping me understand things rather than just helping me do it and get it done.

Mujtaba Hussain (@khalidaapps)

Head of DevOPS (@fillrapp)