What To Do When Your Site Goes Viral

Thursday, May 5, 2011

My last post told the story of how I made Breakup Notifier and how it exploded in popularity across the Internet. This post is a technical expose on how I kept it from going down.

I would be lying if I told you the application as I wrote it from day one had any semblance to what it looked like just a few days later. One thing you must accept is that unless you have a crystal ball, it will be nearly impossible to forecast the issues that will arise in your application. No matter what you do, there will be bugs that you didn’t anticipate, features that should have been obvious, and money that you could have saved. Be nimble and unafraid to rewrite your code. It will happen, so be prepared.

I hope that what I write here today can help you avoid the problems I faced if you ever happen to be in the same position.

Let the nerdery begin.

The Foundations ¶

I automate as much of what I do as I possibly can. I do this to keep my development process very fast. It means that I can get a small prototype of an idea together in a small fraction of the time that it would take were I to build everything from scratch. Additionally I have a small store of idioms in my head to take things to an MVP-level very quickly.

Breakup Notifier was based on a project skeleton I have for Django projects running on Google App Engine. Hosting the project on Google’s infrastructure was the best architectural choice I made. I can’t claim too much credit though; the fact is that I didn’t want to be bothered with systems administration that weekend. Auto-scaling was just a much-appreciated added bonus.

The lessons I write about below are applicable to every site that faces a deluge of traffic. Hopefully you won’t have to make the same mistakes I did =D.

Lesson 1: If you can do something on the frontend with the same performance as doing it on the backend, do it on the frontend. ¶

Frontend code scales linearly with the number of users you have. Backend code scales linearly with the number of servers you have. It doesn’t take a rocket scientist to figure out which is less CPU-intensive from the server perspective.

I pulled out an example to illustrate. The following is a snippet of code called after a user signs in from Facebook. It’s from the first working version of the application.

    def facebook_oauth_complete(request):
        if 'access_token' in request.session:
            return HttpResponseRedirect("/")

        code = request.GET['code']
        params = {
                "client_id": settings.FB_APP_ID,
                "client_secret": settings.FB_APP_SECRET,
                "redirect_uri": settings.FB_REDIRECT_URI,
                "code": code }
        url = settings.FB_GRAPH_URI + "/oauth/access_token?" + urllib.urlencode(params)

        # Fetch the access token
        response = urlfetch.fetch(url)
        data = response.content
        attributes = cgi.parse_qs(data)

        access_token = attributes['access_token'][0]

        graph = facebook.GraphAPI(access_token)
        profile = graph.get_object("me")
        friends = graph.get_connections("me", "friends")
        friend_ids = map(lambda k: k['id'], friends['data'])

        ... code ...

        for friend_id in friend_ids:
            task = taskqueue.Task(url="/tasks/create-profile",
                    method="POST",
                    params={
                        "id": friend_id,
                        "access_token": access_token,
                        "requested_by": profile['id']})
            task.add("create-profile")

        ... code ...

Deciding to pull in profile data in the backend unnecessarily (lines 27-34) was a very bad decision. For every user that authenticated with the application, I made on average about 150 calls to Facebook! I did this so that I could fill out the user’s friend list after they authenticated. Why I decided to do this on the backend, I have no idea, but it blew me past my task queue allotment more than once. Although this didn’t bring down the site, it did prevent relationship status notification emails from going out on schedule.

Not long after this error started popping up, it occurred to me that you can easily get this information through the Facebook Javascript SDK. Instead of making Facebook API calls from App Engine, API calls are now made straight from the browser. The result is a speedier experience for the user and lower server costs.

Lesson 2: Simple is better. And faster. And easier to maintain. ¶

The first version of BN had a complicated queue system for checking whether a relationship status had changed.

Every set number of minutes, I was running a cronjob that would check for relationship status changes. It would go through every single user in the Breakup Notifier datastore (5 at a time), add a task to the task queue to check for their friends who they wanted updates for, go through each of their friends, and then call a method in the model that would call Facebook’s API and would send an email to all the people following that person saying their relationship status had changed.

The above approach was riddled with issues, very slow, and impossible to debug. The code was just a mess.

I spent about an hour figuring out how to optimize this process as much as possible. Compare the above to what it does now:

Fetch 1000 users at a time until all the users have been accessed.
For each user, append a task to a list.
When that list has 100 tasks, add them all to the task queue and make the list empty. Repeat 2 as necessary.
For each task, loop through all of the friends they want updates for.
If the current status is different than the stored status, send an email and update the friend in the database.

This cut down costs by over 90%.

Lesson 3: Minimize calls to your database, cache, or queue. ¶

This ties in a bit with the changes I made above with the relationship update scheme. If you see multiple calls to your database, use your cache first. When querying the datastore, use keys. If you see too many calls to your cache, group the calls together. If you’re using a queue, add tasks to your queue in clumps, not one at a time. Latency is a huge issue; don’t discount it. It will save you loads of issues in the future.

I’ll go over how you might make these optimizations with App Engine, however the general principles apply to all frameworks and languages. If you’re building for the web, you should do everything to minimize latency and disk access.

Memcache ¶

    # Bad
    cache.set("president:1", "nixon")
    cache.set("president:2", "obama")
    cache.set("president:3", "clinton")

    # Good
    # This is a lower latency action. Instead of calling memcache three times, you only call it once.
    cache.set_multi({"president:1": "nixon", "president:2": "obama", "president:3": "clinton"})

Datastore ¶

    # Bad
    query = User.all()
    query.filter("email =", "[email protected]")
    user = query.get()

    # Good
    # Best to retrieve objects by what they are indexed by in the datastore.
    # I'm saving the User object with the key_name as the email address.
    user = User.get_by_key_name("[email protected]")

    # Best
    # Use the cache before hitting the datastore. Reading from disk is slow,
    # reading from memory is fast.
    user = mc.get("user:[email protected]")
    if not user:
        user = User.get_by_key_name("[email protected]")
        mc.set("user:[email protected]", user)

Task Queue ¶

    # Bad
    # The user object may change during the interval from when the task is added to
    # when it executes. Can cause data loss.
    task = taskqueue.Task(url="/tasks/my-task", params={"user": user1})
    task.add("my-queue")

    task = taskqueue.Task(url="/tasks/my-task", params={"user": user2})
    task.add("my-queue")

    # Good
    # Retrieve the user object in the task, so as to not save stale data.
    task = taskqueue.Task(url="/tasks/my-task", params={"user_key": "[email protected]"})
    task.add("my-queue")

    task = taskqueue.Task(url="/tasks/my-task", params={"user_key": "[email protected]"})
    task.add("my-queue")

    # Best
    # Only make one call to the task queue instead of two. App Engine will allow
    # you to add 100 tasks at a time. Use that to your advantage.
    tasks = []
    task = taskqueue.Task(url="/tasks/my-task", params={"user_key": "[email protected]"})
    tasks.append(task)

    task = taskqueue.Task(url="/tasks/my-task", params={"user_key": "[email protected]"})
    tasks.append(task)

    queue = taskqueue.Queue("my-queue")

    # Add the tasks.
    queue.add(tasks)

Lesson 4: If all else fails, log absolutely everything. ¶

You cannot possibly know what edge cases will emerge in your application and cause you problems down the line. I don’t care how smart you think you are or how much test coverage you have. Your site will have issues, and you will not be prepared for them.

The best advice I can give you is to log profusely. Let me show you an example.

    def user_update(request):
        id = request.POST['id']
        logging.info("id=%s" % id)

        cache = memcache.Client()
        logging.debug("memcache client instantiated")

        user_key = "fb:%s" % id
        user = cache.get(user_key)
        logging.debug("check if user in cache")
        if not user:
            logging.info("user not in cache")
            user = FacebookUserV2.get_by_key_name(id)
            logging.debug("user retrieved from database")
        else:
            logging.info("user retrieved from cache")

        try:
            graph = facebook.GraphAPI(user.access_token)
            logging.debug("graph api instantiated")
            profile = graph.get_object("me")
            logging.debug("get profile data")
        except facebook.GraphAPIError:
            # Reset the access token of the user
            user.access_token = None
            logging.warning("access token invalid")
            user.put()
            logging.debug("save user")
            return JsonResponse({"status": "failure"})

        locale = profile.get('locale', None)
        if not locale:
            logging.warning("locale not provided")
        else:
            logging.info("locale=%s" % locale)
        user.locale = locale
        user.put()
        logging.debug("save user")
        cache.set(user_key, user)

        logging.debug("save user to cache")
        return JsonResponse({"status": "success"})

If you read through the code above, you can get a pretty good sense of what I mean by “log everything”. The benefit of it is that you can quickly diagnose issues. In App Engine, the logging facility is directed to the “Logs” page in your application dashboard. If you use vanilla Django on your own server, I recommend using Sentry to zero in on your application errors. I’m not too familiar with what you might use for other frameworks or languages.

If your app is throwing around 500s, it’s helpful to see what the last line logged was before it died. You can easily zone in on the broken code and diagnose the issue much more easily than it would be otherwise.

Lesson 4.1: Monitor Your Application. ¶

Sometimes it’s just too hard to guess where your bottlenecks are. I highly recommend you use whatever monitoring tool you can get your hands on.

Here are a few tools that I would recommend. Just keep in mind that the open source options will take longer to configure and in a high traffic situation might not be the best option to start off with.

Performance Monitoring ¶

Name	What You Can Monitor	Notes
AppStats	Python, Java	App Engine Only
New Relic	Ruby, Java, .NET, PHP
Scout	Rails, MySQL, Sphinx, MongoDB, Redis, Apache	Very extensible, open API.
Union Station	Ruby	Free during open beta.

Server Monitoring ¶

Name	Notes
Nagios	Open source.
Monit	Open source.
CloudKick	Free for one server

Lesson 5: Don’t check your email. ¶

This isn’t as “technical” as the above points, but I think it is extremely important and relevant to what you’ll be experiencing when your site is experiencing lots of traffic.

Yes, I know it’s awesome that your site is going viral. All your friends know about it, the media is talking about you, and a random person in Idaho farmland heard about your website on the radio (true story).

For the first day after the Breakup Notifier craziness started, I was focused way too much on what was going on around me, instead of focusing on making Breakup Notifier better. There were too many people talking about it to even keep track of. My advice is to spend a moment to reflect on the effect you’ve had, and then get back to work. Your work does not get easier because the world is watching you!

The key is getting shit done in the face of massive distraction.

If your experience is anything like mine, you will get business proposals, friend requests, direct messages, and everything else in between. These all present fantastic potential opportunities, but my advice is to deal with them later or recruit a friend to manage the business-related issues. Otherwise you will explode and will get nothing done.

At the time, all of those things may seem really time-sensitive. What’s more important though are your users, and the comments that people are giving you right now. Build them the best product you can while eyes are on you. It will make a difference.

Thanks for reading. Be sure to subscribe if you liked this post and follow me on Twitter for the latest updates.

Have a comment? Join the discussion.