Twitter snowflake approach is cool
I was researching a solution to generate unique IDs and I liked the Twitter snowflake approach. These are my notes about this approach.
What is Twitter’s snowflake approach?
It is a solution to generate unique IDs in distributed systems. Twitter uses this approach in Tweets, DM’s, Lists and etc.
- IDs are unique and sortable
- IDs include time. (ordered by date)
- IDs fit 64-bit unsigned integers.
- Only numerical values.
Sign bit (1 bit): Reserved bit (It is always 0). This can be reserver for future requests. It can be potentially used to make the overall number positive.
Timestamp(41 bit): Epoch timestamp in a millisecond (Snowflake’s default epoch is equal to Nov 04, 2010, 01:42:54 UTC)
Machine ID(10-bit): accommodates 1024 machines
Sequence number(12-bit): It is a local counter per each machine and increments by 1. The number reset to 0 in every millisecond. Theoretically, a machine can support a max of 4096 (2¹²) new IDs per second.
Advantages & Disadvantages of the Twitter Snowflake Approach
- It is 64-bit long, it is half the size of UUIDs
- Scalable (it can accommodate 1024 machines)
- Highly available (Each machine can generate 4096 unique IDs each millisecond)
- Some of the UUID versions do not include a timestamp. In this case, Twitter Snowflake has a sortable advantage.
- Design requires Zookeeper (disadvantage)
- The generated IDs are not random like UUIDs. Future IDs can predictable.
- The maximum timestamp that can be represented in 41 bits is (~ 69 years). Need a solution after this :)
Usage Notes
- Discord uses snowflakes, with their epoch set to the first second of the year 2015.
- Instagram uses a modified version of the format, with 41 bits for a timestamp, 13 bits for a shard ID, and 10 bits for a sequence number.
- Mastodon’s modified format has 48 bits for a millisecond-level timestamp, it uses the UNIX epoch. The remaining 16 bits are for sequence data.