Disco + EC2 spot instance = WIN

October 30th, 2012

tl;dr version : This is how I spent a Saturday evening.

<blink>Warning: If you like to write Java, stop reading now. Go back to using Hadoop. It’s a much more mature project.</blink>

As a part of my job, I do a lot of number processing. Over the course of last few weeks, I shifted to doing most of it using MapReduce using Disco. Its a wonderful approach to processing big data where the time to process data is directly proportional to the amount of hardware you throw at it and the quantity of data. The amount of data to be processed can (in theory) be unlimited. While I don’t do anything of Google scale, I deal with Small Big Data. My datasets for an individual job would probably not exceed 1 GB. I can currently afford to continue not use MapReduce, but as my data set grows, I would have to do distributed computing, so better start early.

Getting started with Disco

If you, like me, had given up on MapReduce in the past after trying to deal with administrating Hadoop, now is a great time to look into Disco. Installation is pretty easy. Follow the docs. Within 5 minutes I was writing Jobs in python to process data, would have been faster if I knew before-hand that SSH daemon should be listening on port 22.

Python for user scripts + Erlang for backend == match made in heaven

Enter disposable Disco

I made a set of python scripts to launch and manage Disco clusters on EC2 where there is no need for any data to be stored. In my usecase, the input is read from Amazon S3 and output goes back into S3.

There are some issues with running disco on EC2.

  • Must have ssh/keys setup such that Master can ssh into slaves.
  • Must have a file with erlang cookie with same contents on all slaves
  • Must inform master the hostnames of the slaves. FQDN or anything with a dot gets rejected
  • The default root directories have very limited storage space, usually 8GB

disposabledisco takes care of the above things and more. Everything needed to run the cluster is defined in a config file. First generate a sample config file.

python create_config.py > config.json
	

This creates a new file with some pre-populated values. For my case the config file looks like this(some info masked)

{
    "AWS_SECRET": "SNIPPED",
    "ADDITIONAL_PACKAGES": [
        "git",
        "libwww-perl",
        "mongodb-clients",
        "python-numpy",
        "python-scipy",
        "libzmq-dev",
        "s3cmd",
        "ntp",
        "libguess1",
        "python-dnspython",
        "python-dateutil",
        "pigz"
    ],
    "SLAVE_MULTIPLIER": 1,
    "PIP_REQUIREMENTS": [
        "iso8601",
        "pygeoip"
    ],
    "MASTER_MULTIPLIER": 1,
    "MGMT_KEY": "ssh-rsa SNIPPED\n",
    "SECURITY_GROUPS": ["disco"],
    "BASE_PACKAGES": [
        "python-pip",
        "python-dev",
        "lighttpd"
    ],
    "TAG_KEY": "disposabledisco",
    "NUM_SLAVES": 30,
    "KEY_NAME": "SNIPPED",
    "AWS_ACCESS": "SNIPPED",
    "INSTANCE_TYPE": "c1.medium",
    "AMI": "ami-6d3f9704",
    "MAX_BID": "0.04",
    "POST_INIT": "echo \"[default]\naccess_key = SNIPPED\nsecret_key = SNIPPED\" > /tmp/s3cfg\ncd /tmp\ns3cmd -c /tmp/s3cfg get s3://SNIPPED/GeoIPASNum.dat.gz\ns3cmd -c /tmp/s3cfg get s3://SNIPPED/GeoIP.dat.gz\ns3cmd -c /tmp/s3cfg get s3://SNIPPED/GeoLiteCity.dat.gz\ns3cmd -c /tmp/s3cfg get s3://SNIPPED/GeoIPRegion.dat.gz\ngunzip *.gz\nchown disco:disco *.dat\n\n"
}
	

This tells disposabledisco that I want a cluster with 1 master and 30 slaces all of type c1.medium, and use ami-6d3f9704 as the starting point. It lists out the packages to be installed via apt-get and python dependencies to be installed using PIP. You can link to external tar, git repo, etc. Basically anything pip allows after pip install

The POST_INIT portion is bash script that runs as root after rest of the install. In my case I am downloading and uncompressing different GeoIP databases archived in a S3 bucket for use from within disco jobs.

Once the config file is ready run the following command many times. The output is fairly verbose.

python create_cluster.py config.json
	

Why many times? Cause there is no state stored in the system. All state is managed using EC2 tags. This is what the script does on each run

  • Check if master is running. If not request a spot instance for it (and kill any zombie slaves lying around from previous runs).
  • If master us up and running.

    • Print the ssh command needed to setup port forwarding. After running the given ssh command you can see http://localhost:8090 on the browser to see disco’s UI in all its glory.
    • print the command to export DISCO_PROXY so you can create jobs locally
    • Check inventory of slaves. A slave can have 3 statuses. 1) pending – spot instance requested. 2) running – the instance is running. 3) bootstrapped – slave is completely setup and can be added to master.
    • If total number of slaves is less than NUM_SLAVES launch the remaining
    • Try and bootstrap any running instances. If bootstrap was successful, change the EC2 tag.
  • Finally, update the master’s disco config. Telling it hostnames of instances to use and number of workers.
  • ???
  • Profit

Cloudwatch showing 31 instances

Many steps involve EC2 provisioning spot instances, waiting for instance to get initialized, etc..

To help with shipping output to S3, I made some output classes for Disco

  • S3Output – Each key, value returned creates a new file in S3 with the key as S3 key and value as String thats dumped inside it. So, one key should be yielded only once from reduce.
  • S3LineOutput Similar to S3Output, but now it stores the output, and joins the output as one big file. has options for sorting, unique, etc.

Both these functions can be configured gzip the contents before uploading.

As far as input is concerned, I send it a list of signed S3 urls. (Sidenote: It seems disco cannot handle https inputs at the moment, so I use http). A sample job run might look like..

def get_urls():
    urls = []
    for k in bucket.list(prefix="processed"):
        if k.name.endswith("gz"):
            urls += [k.generate_url(3660).replace("https", "http")]
    return urls

MyExampleJob().run(
	input=get_urls(),
	params={
		"AWS_KEY": "SNIP",
		"AWS_SECRET": "SNIP",
		"BUCKET_NAME": "SNIP",
		"gzip": True
	},
	partitions=10,
	required_files=["s3lineoutput.py"],
	reduce_output_stream=s3_line_output_stream
	).wait()

Bonus – MagicList – Memory efficient way to store and process potentially infinite lists.

We used Disco to compute numbers for a series of blogposts on CDN Planet. For this analysis it was painful process for me to manually launch Disco clusters, which lead me to create the helper scripts.

Shameless plug

Turbobytes, multi-CDN made easy

Have your static content delivered by 6 global content delivery networks, not just 1. Turbobytes’ platform closely monitors CDN performance and makes sure your content is always delivered by the fastest CDN, automatically.

4 reasons why I love my ISP

November 29th, 2011

I’ve been using True ADSL for years, and I absolutely love their service (and especially the transparent proxy). Here are some of the reasons why :-

Censorship Protecting me from bad stuff – Interwebs has a lot of “bad” things out there. My ISP takes good care of me by not letting me access things I am not supposed to see… Even sites not explicitly blocked by court-order. I don’t know what I would do without them. My head would probably explode if I saw porn, and my feelings would get hurt if I came across certain types of political messages..

Transparent proxy Web slow down machine – Buddha teaches us “The greatest prayer is patience”. A very special offering of True ISP is that it reminds us to be patient in this fast-paced world. True’s Web slow down machine sits between users’ connection to other servers. One of its features is to slow down access… It employs several brilliant methods to accomplish this :-

  • Not keeping connections alive – This is the most important factor in slowing down pageloads. True does not keep connections to remote hosts alive, thus making sure you have to establish a fresh connection with each request to a server overseas. No matter how small the file, a request to the USA will take a minimum 500ms.
  • Making TCP optimizations useless, since the slow down machine is the one that actually makes connections to remote hosts.
  • Overriding destination IP – True doesn’t care about what IP your computer wanted to connect to, your computer could be wrong. It sees the Host header from the request, does its own DNS lookups and routes you to the correct server. Even if you wanted to override this for development, True correctly sends you to production server. True knows development/staging servers are full of bugs, so requests should always go to production.
  • 512 kbps upload speed is sufficient for everyone. If you need to upload something big, why not get your lazy ass out, buy a CD and mail it!
  • Sharing is caring – My ISP oversells available bandwidth by a huge margin. Teaches us the importance of sharing

Privacy People impersonator – Your life is boring? True has a solution! It will automagically log you in as someone else so you can get a glimpse into their exciting lives.

Incompetence Motivator – Living in Thailand, I am ashamed that I don’t read Thai yet, in part due to my own laziness. True gives you an incentive.

edited by Michael van Poppel

DFP now officially supports asynchronous rendering!

October 27th, 2011

Yesterday, DFP launched asynchronous ad loading. For the past few months ive been trying to load ads in a manner where it doesn’t affect rest of the page load, this new development is like a dream come true.


(The tests above were run on webpagetest.org on IE8 at Dulles, VA)

Thank you Google! You just made my day.

Check if you are behind a transparent proxy

October 12th, 2011

Many Asian ISPs do not provide clean internet. They route all HTTP sessions thru a transparent proxy.

Here is a simple way to check if you are behind one.

sajal@sajal-laptop:~$ ping -c 4 www.cdnplanet.com
PING www.cdnplanet.com (107.20.181.99) 56(84) bytes of data.
64 bytes from ec2-107-20-181-99.compute-1.amazonaws.com (107.20.181.99): icmp_req=1 ttl=42 time=314 ms
64 bytes from ec2-107-20-181-99.compute-1.amazonaws.com (107.20.181.99): icmp_req=2 ttl=42 time=313 ms
64 bytes from ec2-107-20-181-99.compute-1.amazonaws.com (107.20.181.99): icmp_req=3 ttl=42 time=312 ms
64 bytes from ec2-107-20-181-99.compute-1.amazonaws.com (107.20.181.99): icmp_req=4 ttl=42 time=312 ms

--- www.cdnplanet.com ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3004ms
rtt min/avg/max/mdev = 312.195/313.229/314.137/0.889 ms
sajal@sajal-laptop:~$ ab http://www.cdnplanet.com/
This is ApacheBench, Version 2.3 <$Revision: 655654 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking www.cdnplanet.com (be patient).....done

Server Software:        Apache
Server Hostname:        www.cdnplanet.com
Server Port:            80

Document Path:          /
Document Length:        13084 bytes

Concurrency Level:      1
Time taken for tests:   0.944 seconds
Complete requests:      1
Failed requests:        0
Write errors:           0
Total transferred:      13296 bytes
HTML transferred:       13084 bytes
Requests per second:    1.06 [#/sec] (mean)
Time per request:       943.539 [ms] (mean)
Time per request:       943.539 [ms] (mean, across all concurrent requests)
Transfer rate:          13.76 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:       21   21   0.0     21      21
Processing:   922  922   0.0    922     922
Waiting:      611  611   0.0    611     611
Total:        944  944   0.0    944     944
sajal@sajal-laptop:~$ 

My ping time to CDN Planet is 312ms, but the connection was established in just 21ms !!!!11!!1

Reasons for doing so involve : Censorship, big brother snooping, caching, hijacking users sessions , and probably more …

Evaluating few CDN options

June 11th, 2011

Recently, I was evaluating CDN options for a client with some unique challenges. We ended up using Amazon CloudFront, but ill detail the options we looked at and what let us to this decision.

Some things to note:-

  • We would serve CSS, JS, Images referred to from stylesheets and some website images thru the CDN.
  • Due to the nature(trust me) of the site we expect a higher than normal miss rate. The access is spread across a wide number of urls, some may get lower hits.
  • Speed is important. Website should be fast(er) from everywhere. The most important regions in order of priority are US, EU, AU and ROTW with US being most important.
  • Anything on the site can change anytime, there is no build process as such. Any object anywhere can change, and change must be visible ASAP.
  • Developers/designers shouldn’t be harassed to purge a file if they change something.
  • CDN data usage : ~100 GB per month

The providers we looked at :-

MaxCDN: Almost sealed the deal.

Pros:-

  • Cheap : $40 for first TB (must use in a year) + $99 per additional TB . On current usage rates this comes to say 0.04+ /GB.
  • * Anycast/BGP routing : No way bad DNS server can mess up routing.
  • Nice control panel, has a purge all option for just in case. Purges take effect almost instantly.
  • Handles gzip well.
  • Can have separate cache timings for caching in Browser and caching at CDN. – i.e. We can say not cache a file in browser level, but cache at CDN and purge when theres a change made.

Cons:-

  • Poor global coverage. – No POP/Edge in Asia/AU – Deal Breaker
  • Pages loaded same speed when testing from AU with or without CDN.

EdgeCast: Looked good at first, but poor gziping.

Pros:-

  • Impressive list of networks
  • Highly configurable control panel.
  • Can have separate cache timings for caching in Browser and caching at CDN. – i.e. We can say not cache a file in browser level, but cache at CDN and purge when theres a change made.

Cons:-

  • Gzippable files will not be gzipped for cache misses. – Deal Breaker
  • Request from Edge server to origin is uncompressed.
  • Expensive and wants higher commitments.
  • DNS Based routing

CDNetworks: Didn’t look past price

Cons:-

  • Ridiculously high price – Dealbreaker

Amazon CloudFront: WIN

Pros:-

  • Testing showed our pages to be fastest from all regions when using cloudfront. YMMV
  • No commitments – $0.15 – $0.2/GB (depends on where user accesses from) + negligible per request fee
  • Client is already AWS user, one less account to maintain.
  • No need to send gazillion emails to gazillions of people to get started. No bargaining.

Cons:-

  • No POP/Edge in AU (but has in Singapore, Hong Kong and Tokyo)
  • DNS based routing.
  • Charges fee per request and per invalidation(purge) request.
  • No control panel, invalidation requests need to be done by API only.
  • Does not do gzipping, but honours Vary header and serves correct version based on what user asks.
  • Can’t use querystring parameters for CDN cachebusting. CloudFront ignores querystrings.

Sidenote : Requests from CloudFront to origin are HTTP 1.0 . Nginx by default does not serve gzip to 1.0 request. gzip_http_version setting must be changed in order to use nginx as origin for CloudFront.

The system we architected adds something based on the file mtime as a part of the URL, so now we don’t need to any purges at the CDN. Also now we can have far future expires on all CDN’d objects cause if something changes, the URL would automagically change.

For us, the price and features are important, but whats more important is the results. We went with the provider with lesser features just because our pages loaded fastest with them.