Sunday, April 28, 2013

Unique Id Generation in a distributed environment

I was looking at various ways to generate Unique ids under different constraints and system requirements. Came across Snowflake, an open source s/w published by Twitter for generating unique ids and also an article by Fickr. It gives us good idea to solve this problem.

It is a common requirement to generate unique ids in a distributed environment. Data stored across the different databases identified by ids (as primary key) should not conflict when they are queried and consolidated across the multiple databases. For ex. Twitter may want read all the tweets which were published on 1st Jan. This query will retieve the data from multiple databases and return the consolidated result in a sorted fashion. Some of these requirements drives Twitter to uniquely mark all the tweets with unique ids so that they can be stored and queried efficiently.

Flickr uses sharding to store data in multiple databases. The primary key should be unique across the databases so that in case data is moved between the databases, uniqueness is maintained.

These are some of the constrains which have the bearing on the id generation design.
1. Unique id should not contain more than 64 bits.
2. It must be atleast k-sortable. For ex. Tweets generated in a time interval should be roughly sortable.
3. Id generation service could be running on multiple machines to avoid single point of failure.


Unique Ids generation without the above constraints:

* Database can generate unique ids. For ex. in ,MySQL user can define an autoincrement column which will be filled automatically when the data is inserted in the row. Flickr uses the this approach. They have dedicated MySql instances (Flickr calls them Ticket servers) with the dedicated tables for generating the unique ids.

* Use UUID.randomUUID() to generate unique id. The generated id is 128 bit and is not sortable.

* Use combination of Timestamp + UUID, this is k-sortable but the length exceeds 128 bits.

* Use single server to generate the ids using increment of the count. This will become single point of failure and also may not scale where system requires thousands of unique ids per seconds. 


Unique Ids generation with the above constraints:
Snowflake defined a pretty generic approach. This s/w is made open source by Twitter.

Snowflake runs on distributed environment (on multiple machines) to generate unique ids which are 64 bits and sortable. It follows the following logic

unique id  = timestamp + nodeid + counter.

We can fix the bits required by each of the above variables. For nodeid can be 4 bits and counter is 8 bit. Rest 48 bits are used by timestamp.

nodeid is the unique id across the different machines running Snokflake instances. nodeid is assigned to each instance when it is started. It is required to avoid any conflict between the Unique ids generated by multiple instances. The nodeid assignment needs to done properly so that no two instances share the same nodeid. Snowflake uses ZooKeeper for this purpose.

We need counter since a server may receive multiple request within a single timestamp. For example server may receive 100 tweets at exactly 12:00:00 PM This counter is reset everytime the timestamp changes to new value. 

Snowflake can generate 1000 (millisec) * (2^8 -1) unique ids in a second.

Above all, we need to make sure that sytem clocks used by all the machines running Snowflake instances are synched  Network Time Protocol (NTP)

128 bit K-sortable unique id generation
Boundary follows a similar approach to generate 128 bit unique ids which are not dependent on ZooKeeper for assigning the nodeids.
In this timestamp uses 64 bits, 48 bits are used by MAC address of the machine hosting the instance and 16 bits for counter.
This can generate 1000 (millisec) * (2^16 – 1) unique ids in a second.