Nick's Blog

I Play WoW Scaling

06 February 2009

Having too much traffic is a great problem and I'm fortunate enough to be facing it at the moment. Over the past few weeks, thanks in part to several really great articles on WoW Insider about I Play WoW, I Play WoW traffic and use has been increasing significantly and I've been starting to notice that increase in traffic in a few different ways.

The first has been in queue length. The big feature of the site is the ability to import your World of Warcraft characters and display them on your profile. After you've done that, actions like "dinging" and refreshing your character's data become available. All of these operations require requests to be made to fetch data from the World of Warcraft Armory.

What's great is that this information is freely available and I applaud and thank Blizzard for making it so. What isn't so great is that this can be a lot of traffic coming from my servers to theirs and it often comes in spikes and bursts. To avoid hammering them and drastically reducing my karma, I've implemented queues and throttling when fetching data. Using the erlang_wowarmory module, items consisting of the request type and meta-data about the request get stuffed into request queues that exist on all of the nodes used by the Facebook application. This works really well and has saved me lots of time, money and energy by having a stable and reliable armory data processing system. The caveat with this is that with lots of traffic come large queues.

The issue that I'm seeing is that during peak traffic hours, I may see the queues go up to 2,000+ items. Around the time of the first WoW Insider post the queues got up to 6,000+ items. These queues get processed and in due time everything resolves itself without issue. The real problem is three-fold.

There are two, complimenting, solutions to this problem that I've got my eye on. The first is to implement a better queueing system that is less tolerant of duplicate items. The second is to add more crawling capacity to the system.

The first solution can be seen through the development of the armory2 module as part of the erlang_wowarmory project. What I've done here is taken what I've learned about how things are queued and made several key changes. The first big change is the actual queue data structure itself. At first, I was using the 'queue' module that comes in the standard library to manage the FIFO queue. It worked really well and efficiently for small queues but as they start to hit 3,000 to 5,000 items, I'm concerned about efficiency.

Now and ets table is used to manage the queue. The queue manager creates and owns an ordered_set ets table that operates as a first-in/first-out queue based on the item type. When an item is dequeued, it attempts to find character items first, then guilds and then everything else ordered by when it was inserted into the queue. This is working really well so far and I'm pleased with the results.

Another change in the armory2 module is the way queues are distributed. In the armory module, each node would start a gen_server process that would store a local queue and would have a linked process that dequeues items and processes them. I'm experimenting with a different model right now that uses the global module to have a single queue manager that owns the ets table used to keep the entire queue. If at any point a crawler attempts to dequeue something and discovers that the master isn't running, it will attempt to start the master on the local node and the queue is recreated.

The second solution to the problem is essentially "grow with hardware". The reason that this applies here is that I can't increase the actual number of networks requests currently being made on each node. The current throttling model prevents each IP address (slice) from making more than one request against the World of Warcraft Armory per second. There isn't anything that I can do to resolve this. The only real solution here is to add more nodes and crawlers to the grid. Thankfully, Erlang makes this stupid simple. I found this out the hard [read as: easy] way when the WoW Insider article was published.

I'm using SliceHost to host my application and two of the nodes have regular backups made daily and weekly. Each of my slices are the most basic available which keeps cost down for me. When I needed to bring the additional nodes to handle the increase in traffic, it was just a matter of creating new slices based on the backups (saves time in configuration) and running /etc/init.d/ipwcore start and /etc/init.d/ipwfbfe start. When the nodes came up they detected the other nodes using net_adm:world() and starting taking traffic on immediately. I should mention that I also added them to pound, the load balancer that I use.

My experiences scaling the I Play WoW Erlang Facebook application have been great. With good design and a little bit of homework it's easy to take on problems that have traditionally been show stoppers. All of the crawling code that I've referenced in this entry is open source under the MIT license and can be found on GitHub. With the 3 additional slices/nodes (I run both an ipwcore and ipwfbfe node on each slice) things have been running very smoothly.

blog comments powered by Disqus