Scalable Social Games, Robert Zubek of Zynga (liveblog)

Social games interesting from an engineering point of view sinc ethey live at the intersection of games and web. We spend time thinking about making games fun, making players want to come back. We know those engineering challenges, but the web introduces its own set, especially around users arriving somewhat unpredictably, effects where huge populations come in suddenly. SNSes are a great example of this, with spreading network effects and unpredictable traffi fluctuations.

At Zynga we have 65m daily players, 225m monthly. And usage can vary drastically — Roller Coaster Kingdom gained 1m DAUs in one weekend going from 700k to 1.7m. Another example, Fishville grow from 0 to 6m DAUs in one week. Huge scalability challenges. And finally, Farmville grew 25m DAUs in five months. The cliff is not as steep but the order of magnitude difference adds its own challenge.

Talk outline: Introducing game developers to best practices for web development. Maybe you come from consoles or mobile or whatever, the web world introduces its own set of challenges and also a whole set of solutions that are already developed that we steal, or, uh, learn from. :) If you are alreayd an experiened web developer, you may know this stuff already.

2 server approaches and two client approaches. So you get three major types.

1. Web server stack + HTML, Mafia Wars, Vampires, et
2. Web server stack + Flash, Farmville, Fishwville, Cafe world
3. Web + MMO stack + Flash, yoVille, Zynga Poker, Roller Coaster Kingdom

Web stack based on LAMP, logic in PHP, HTTP Comms. Very well understood protocol, limitations well known.

Mixed stack has game logic in MMO server such as Java, web stack for everything else. When web stack limitations are preventing the game development. Use web for the SNS pieces.

Fishville:

DB servers
–> web stack
Cache and queue servers

yoVille:

DB    >    MMO
Cache    >    Web & CDN

Why web stack? HTTP is very scalable, very short lived requests, scales very well, and easy to load balance. Each request is atomic. Stateless, easy to add more servers. But limitations esp for games: server-initiated actions (NPCs running around, if you come to lose, the monster reacts…) are hard to do over HTTP, since it is request/response. There are some tricks, like the long poll, but fundamentally this makes it harder to scale. Load balancers will get unhappy with you, you can saturate a connection.

The other thing is storing state between requests. This is more particular to game dev than web. Say you areplaying Farmville and collecting apples. You do a bunch of actions which result in many different requests to the servers, but we want to make sure that only the first click gives you an apple, so you cannot click a dozen times on one tree. Whih means stored state and validation. If you had clients talking to many web servers, you cannot use the DB< the poor thing will fall over. If we can guaratee that the client only talks to one web server, you can store it there, and save to db later. But this is tricky to do. Ensuring that people are no allowed to break session affinity even in the presence of maliious clients and browsers… hard.

So instead, you can wrap th DB servers in a caching layer that is faster that does not hit the DB all the time, such as Network Attached Caching. This works much better.

MMO servers… minimally MMO. Persistent socket connection per lient, live game support such as chat and server side push. Keeps game state in memory. We know when a player logs in and load from DB then… session affinity by default. Very different from web! We can’t do the easy load balancing like on web.

Why do them? They are harder to scale out because of load balaning. But you get stuff like the server side push, live events, lots of game state.

Diagram:

DB servers, maybe less caching wrapping it — talks to both web server and MMO server, then those both talk to client.

On the client side, things are simpler.

Flash allows high prodution quality games, game logic on client, can keep open socket. You can talk any protocol you want.

HTML+AJAX  the game is “just” a web page, minimal system reqs, and limited graphics.

SNS integration. “Easy” but not relate dto scaling. Call the host network to get friends, etc. You do run into lateny and scaling problems, as you grow larger you need to build your infrastructure so it can support gradeful performane degradation in the face of network issues. Networks provide REST APIs and sometimes client libraries.

Architectures:

Data is shared across all three of these: database, cache, etc.

Part II: Scaling solutions

aka not blowing up as you grow.

Two approaches: scaling up or scaling out.

up means that as you hit processor or IO limits, you get a better box.
out means that you add more boxes

The difference is largely architectural. When scaling up, you do not need to change code or design. But to scale out you need an architecture that works that way. Zynga chooses scaling out, huge win for us. At some point you cannot get a box big enough fast enough. Must easier to add small boxes, but you need the app to have architectural support for it.

Rollercoaster Kingdom gained a lot of players quickly. We started with one database, 500k DAUs in a week. Bottlenecked. Short term scaled up but switched to scaling out next.

Databases, very exciting. The first to fall over. Several ways to scale them. Terms unique to mySQL here but concepts the same for other systems:

Everyone starts out with one database, which is great. But you need to keep track of two things -  the limit on queries per second, do benchmarking using standard tools like SuperSmack. You want to know your q/s ceiling, and beyon that how will it perform. There are optimizations you can use to move it. And two, you need to know the player’s query profile in terms of inserts, selets, updates, and average profile per second. It might trail your DAU number, which is nice because then you can project q/s and know when you will reach capacity.

If your app grows then you will need to scale out.

Approach one, replicating data to read only slaves. Works well for blogs and web properties but hnot games, because games have a higher modification profile so your master is still a bottleneck. But useful for redundancy.

Approach two, multiple master. Better because of split writes, but now you have consistency resolution problems, which can be dealt with but increases CPU load.

Approach three and best and push the logic for resolution up to the app layer, a standard sharding approach. The app knows which data goes to which DB.

Partition data two ways”:

vertical by table, whih is easy but does not scale with DAUs. MOve players to a different box from items.

horizontal by row. Harder to do but gives best results. Different rows on different DBs, need good mapping from row to DB. Stripe rows across different boxes. Primary key modulo # of DBs. Do it on an immutable property of the row.  A logical RAID 0. Nice side eeffect to increase capacity… to sale out a shard, you add read only slaves, sync them, then shut down, cut replication, and hook it back up. Instant double capacity.

More clever schemes exist. Interaction layers which check where to go… but the nice thing about this is how straightforward it is.  No automatic replications, no magic, robust and easy to maintain.

YoVille: partiioning both ways, lots of joins had to be broken. Data patterns had to be redesigned, with sharding you need the shard id per query. Data replication had trouble catching up with high violume usage. In sharded world  cannot do joins across shards easily, there are solutions but they are expensive. Instead, do multiple selects or denormalize your data. Say a catalog of items and inventory, and you watch to match them. If catalog is small enough, just keep it in memory.

Skip transactions and foreign key constraints. Easier to push this to the app layer. The more you keep in RAM the less you will need to do this.

Caching.

If we don’t have to talk to the DB, let’s skip it. Spend memory to buy speed. Most popular right now is memache, network attached ram cache. Not just for caching queries but storing shared game state as well, such as the apple picking example. Stores simple key value pairs. Put structured game data there, and mutexes for actions across servers. Caveat: it is an LRU (least recently used) cache, not a DB. There is no persistence! If you put too much data in it, it will start dropping old values, so you need to make sure you have written the data to DB>

bc it is so foundational, you can shard it just like the DB. Different keys on different servers, or shard it veritcally or horizontally.

Game servers.

Web server part is very well known. Load-balance. Preferred approach is to load balance with a proxy first. This is nice from a security standpoint… but it i a single point of failure, capacity limits since the proxy will have a max # of connections.

If you hit those limits you load balance the load balancers… and using DNS load balancing in front of it. It doesn’t matter if dns propagation takes a while.

The other thing that is useful is redirecting media traffic away from media servers… swfs are big, audio is big, do not serve from the same place as game comms. You will spend all yor capaity on media files. Push it through a CDN, and if you are on the cloud already you can store them there instead. CDN makes it fast, sine the assets are close to the users. Another possibility is to use lightweight web servers that only server media files. But essentially, you want big server bank to only talk game data, not serve files. Seevral orders of magnitude performance by doing this.

MMO servers, the unusual part of the setup! Scaling is easiest when servers do not need to talk to each other. DBs can shard, memcache an shard, web can load balance farms, and MMOs? well, our approach is to shard like other servers.

Remove ny knowledge they have about each other and push complexity up or down. Moving it up means load balancing it somehow. Minimize interserver comms, all participants in a live event should be on the same server. Scaling out means no direct sharing — sharing thru third parties is OK, a separate service for that event traffic.

Do not let players choose their connections. Poor man’s load balancing, is a server gets hot remove it from the LB pool, if enough servers get hot, add more instances and send new connections there. Not quite true load balancing which limits scalability.

In deployment, downtime = lost revenues. In web just copy over PHP files. Socket servers are harder. How to deploy with zero downtime? Ideally you set up shadow new servers and slowly transition players over. This can be difficult — versioning issues.

For this reason, this is all harder than web servers.

Capacity planning.

We believe in scaling out, but demand can change fast.how to provision enough servers?  Different logistics. Do you provision physical servers or go to the cloud? If you have your own machines, you have more choice and controll and higher fixed costs. With cloud lower costs, faster provisioning, canot control CPU, virtualized IO, etc. On cloud easier to scale out than up.

For a legion of servers you need a custom dashboard for health, Munin for server monitoring graphs, and Nagios for alerts. First level for drilldown is graphs for every server family separately so you can isolate it to a given layer in the system. Once you know memache usage spiked, then you can drill down to particular machines…

Nagios… SMS alerts for server load, CPU load exeeds 4, test account fails to connect after 3 retries.

Put alerts on business stats too! DAUs dropping belo daily average for example. Sometimes they react faster than server stats.

If you are deployed in cloud, network problems are more common. Dropping off net or restarting is common. Be defensive, Reduce single points of failure, program defensively. This includes on the game side.

Q&A:

q: why mySQL? Other DBs are better for scaling.
a: there are other DBs that have been around longer, have greater community, but we don’t use the features those large DBs do. Looking back at the sharding slides — we don’t do a lot of even things like transactions. Easier to move that complexity to the app layer. Once you are on that path, it is a good solution.

q: did you benchmark, that sort of thing, for the different DBs?
a: yes, of course.

q: and for data integrity, if you threw foreign key constraints, that sounds scary! Is it kind of a nightmare?
a: No, not too bad at all, actually. Esp if you do not hit the DB all the time, you ind you don’t get into those dangerous situations as often.

q: is the task when you add more tables… is it as complex?
a: not too bad, has worke well.

q: assuming browsers pick it up, are you guys looking into webGL?
a: many technologies interesting, 3d in browser, silverlight. I would be interested in using them personally… once they achieve high market penetration.

q: why flash?
a: everyone has it. Very pragmatic approach.

q: Do you back up dbs?
a: of course

q: and how?
a: onc eyou go with cloud and amazon, you have to use that approach…  we have a number of redundant backups solutions.

q: I guess many joins are across friends… they have to tlak to multiple shards. Do you try to put friends on same shard?
a: no, everyone has different friends.

q: on SNS integration, did you run into issues with PHO not supporting asynh, with delays from answers from the SNS, running out of threads?
a: you will encounter delay with SNS comms, just part of the overall insfrastruture, could be anytihng, not just PHP. You have to program around it, have to find good solutions for dealing with it when it happens bc it will.

q: So you don’t switch from PHP, delay the process?
a: we did encounter a number of places where we had to dig deep into PHP in order to make it work well on that scale.

q: did you patch PHP?
a: we, uh… yes.

q: what are you feeling on tools like the no SQL sort of thing
a: we look into those atively, one the tech matures, it will be a very good candidate fot this sort of thing. But not currently implemented.

q: on sharding, you said use modulo to distrbute load. Once you have found a bottleneck, howdo you prepare the data to be moved from one shard to another.
a: You don’t move people between shards. You just copy a shard to two machine, and both have redundant, and then remove the redundant data.

q: on partitioning, partitioning to two tables. Say item trading that goes across two DBs, transactions may break? Changing ownership on two different dbs?
a: you need to do a guarantee across multiple DBs, putting the data in a memcache layer, locking it, then doing the write, or putting it in the app layre, implementing”transactions lite”

q: being on the cloud did you have to not use a service approach and have each PHP layer write direct to the DB instead of use a service layer? Say an MMO, achievments or presence services. Do you keep the servie layer as a web servie, or write direct to the DB? Your service call time can add time… even on the cloud.
a: Yes, you want this to be nicely modular… we end up not putting it on different machines. Same box as the game logic so there is no network traffic, so there is no separate layer between. So modular, but not in terms of network topology.

Posted by 솔라리스™
:
이미 많은 분들이 페이스북 크레딧 정책에 대한 이야기를 해주셨습니다. 저 또한 소셜 서비스를 사업 아이템으로 기획하는 사람으로 이와 관련한 이야기를 해보려고 합니다. 다들 아시겠지만 오늘은 자신들이 구축한 플랫폼에서 영악하게 돈을 벌어들이고 있는 페이스북 크레딧 서비스를 소개해 보려고 합니다.

<읽기전에 아래 글을 먼저 참고해 주시기 바랍니다.>
- 페이스북의 최대 경쟁상대는 아마존?
- 이베이와 페이스북, 아마존 가상화폐 전쟁이 시작되다
- 페이스북의 정책 변화와 새로운 수익 모델들..

페이스북이 말하는 '크레딧' 서비스는 무었인가? 크레딧 서비스는 일종의 결제대행 서비스 또는 가상 화폐 서비스 정도로 정의가 가능할 것 같습니다.

2007년 12월경 처음으로 언급된 이 가상화폐는 2008~2099년사이에 페이스북 장터에서 기프트숍등에서 선물을 살 때 결제를 페이스북이 만든 크레딧 서비스를 이용하게 하는 서비스로 최근엔 써드파티 애프리케이션용 크레딧 서비스를 도입하며 정책적으로 의무 사용을 공개해 입주 업체인 징가, 팜빌등과 마찰을 빚기도 했습니다.


왜? 페이스북이 크레딧을 만들려는 것일까? 생각하는 분들이 있습니다. 그건 단순하게 보면 돈이 되기 때문입니다. 우리가 싸이월드의 도토리를 통해 학습 했듯이 사이버 머니는 더 이상 가상 공간의 통화가 아닌 실 생활에서 사용 가능한 돈으로 생각되고 있습니다.

사용자들도 이에 대해 큰 반감을 가지지 않는 상태있고 말이지요. 광파리님의 "페이스북이 3억5천만명에게 '도토리' 팔면 대박 아닐까?"의 글을 보면 기프트숍 매출이 7800만달러에 이른다고 합니다.

가상화폐 환률로보면 1센트당 10크레딧을 적용하고 있고 작년 11월 기준으로 24개국에서 이 크레딧 서비스를 적용하고 해당 국가 환률차를 고려해 차등 적용한다고 했답니다.

7800만덜러의 기프트숍 매출에 어플리케이션 스토어로 확대는 물론 여러 나라의 페이스북 이용자까지 이 서비스에 동참 할 수 있게 한다면 봉이 김선달식으로 돈을 쓸어 모을 수 있는 것입니다.

여기에 온라인 통화 결제 수수료를 30%로 계획 했었다는 이야기를 들었을 경우 수익이 장난이 아니란 사실을 바로 깨달으실 수 있으실 겁니다.


'크레딧' 정책의 핵심은 무었일까? 한마디로 말하면 자신들이 만든 오픈 장터에 쇼핑이 가능한 서비스로 만들겠다가 정답이겠네요. 즉 리바이스 청바지도 사고 게임도 구매해 이용하고 책도 사볼 수 있는 시스템이 궁극적인 크레딧 정책의 핵심입니다.

또, 결제 모듈을 오픈하여 써드파티들이 이 결제 모듈을 활용할 수 있게 함으로서 명실 상부한 전세계의 금융결제 허브를 꿈꾸는 것이 아닐까 합니다.

이 때문에 사실 온라인 게임으로 유명한 팜빌과 징가가 페이스북과 맞찰을 빚기까지 했습니다. 징가 같은 경우만 해도 독자적인 결제 서비스가 있는 상황에서 수수료를 30%나 주고 이 서비스를 이용하라고하니 맞찰을 빚을 수 밖에 없었던 것이지요.

현재는 징가의 재계약 합의로 일단락 된것으로 보이지만 아예 불씨가 꺼진 것은 아닌 것 같습니다. 저에게 많은 영감을 제공하는 DDing님의 글인 "페이스북과 Zynga의 5년 계약이 소셜게임에 주는 영향"이란 글을 통해 보면 징가 게임의 대다수 유저가 페이스북을 통해 접속하고 있어 발을 빼기 힘든 상황이고 페이스북도 징가가 빠진다면 다른 서비스들의 연쇄 이탈이 우려되 합의점을 찾은 것으로 소개되고 있습니다.

제 예상에는 가장 큰 이견이 있을 수 있는 수수료 부분을 서로 한발씩 양보하는 수준에서 합의된게 아닐까 예측됩니다.


페이스북 '크레딧'으로 공공의 적이 되다? 이미 아마존이 페이먼트, 구글 체크아웃, 이베이 페이팔, 등으로 온라인 크레딧 시장이 형성된 상황에서 페이스북과 트위터 같은 SNS 서비스들이 자체 결제 시스템을 가동함으로 인해 시장은 혼전 양상이 되고있습니다.

구글도 오늘자 기사인 "구글 공짜시대 끝나나?"란 기사를 통해 뉴스패스란 웹 컨텐츠 전용 결제 서비스를 만들어 언론사들과 협력한다는 기사를 내보내며 페이스북에 제동을 걸려하고 있습니다.

페이스북이 공공의 적으로 대두되고 있는 것은 크레딧 서비스를 제공하면서 아마존이나 이베이 같은 쇼핑 중심 서비스들에겐 큰 위기로 다가오기 때문입니다. 이용자는 더이상 아마존, 이베이를 이용하지 않고도 쇼핑을 페이스북에 이용 할 가능성이 많아지고 있기 때문입니다.

이것이 바로 리딩 서비스의 파워라고 생각할 수 있습니다.


한국의 싸이월드 도토리와의 결정적 차이는 무얼까? 궁금해 하시는 분들이 많이 있습니다. 기능자체는 똑같고 개념도 거의 똑같다고 생각하시면 될 것 같습니다.

다만, 싸이월드는 자체적인 스토어 (아이템구매, 음악 구매.. 등)을 중심으로 한 것인데 반해 페이스북은 이런 결제 시스템을 오픈해 써드파티들과 페이스북 장터에 입주한 각종 서비스 업체(게임, 음악, 이미지.. 등)를 통해 수익을 쉐어한다는 부분이 다른 것입니다.

하지만 현재 Web2.0 정신에 기초한 서비스가 성공을 거두고 있다는 관점에서 봤을때 싸이월드는 아직 막혀 있는 서비스라고 생각하실 수 있는 것입니다.

즉, 결정적 차이는 생각과 자세에 있다고 할 수 있는데 이를 모르고 있을 분들은 아니고 아마도 SKT와의 역학 관계에서 한계도 일부 작용한 것이라 애써 해석하고 싶습니다.


글을 마무리하며.. 한국의 서비스가 막혔다고들 합니다. 저도 한국 서비스를 개발하고 기획했던 사람 입장에서 또, 새롭게 일을 추진하는 사람 입장에서 한국 사람이 막혔다기 보단.. 이런 이야기를 하시는 모든분들의 사고와 기업, 국가 정책이 모두 막혀있다가 정답이란 생각을하며..

한국에도 좀 더 거시적 관점에서 새로운 신 산업이 확대되고 이런 지식 체계와 혁신이 일어나길 기대하며 이번 글을 마치겠습니다.
Posted by 솔라리스™
: