Bit.ly clone system design. Part 1.

https://www.hiredintech.com/classrooms/system-design/lesson/56 – this is the link to the original course.

Below is the written down information which interested me in some way.

Use cases

Speaking of the idea of the original service. We can think of two main features:

shortening – when we get a link and we need to convert it to a short version
redirecting – when we get a short version and we need to redirect the user to the original web-site.

Some second-tier features might be:

Custom URLs, high availability of the system.

South things that will be out of the scope but worth mentioning: analytics, automatic link expiration, manual link removal, access to the service via UI or API, etc.

Constraints

Once we are done with the use cases it’s time to mention what we want to achieve with the service in terms of the amount of data. It might be:

How many requests per second should we handle? (Derived from how much traffic should the system handle?)
How many new links should we handle in some definite period of time? (Derived from how much data should the system handle?)

Normally (and first of all we should ask the interviewer this question) he may not provide any exact details or the interviewer can obscure details behind some vague questions and assumptions. Eg. “let’s assume that the service will not be in the top 3 services on the market but it definitely will be in the top 10”. Again we don’t have enough information about it and the intention of the interviewer is to understand if we really can deal with such a problem on our own.

In this situation, we can utilize some public data. More specifically we should be aware of some big data hottest statistics. Such as:

total amount of Facebook users is 1.3B;
total amount of Twitter users is 650M;
total amount of new tweets a day – 500M;

For example, in terms of the tweets per day, we can use this information. Mainly because Twitter was one of the drivers who use URL shortens. And the most interesting part with some assumptions and knowing only this number we can calculate everything else.

Total amount of tweets per month

500M * 30 = 15B

Now it’s worth mentioning that not each of the tweets has shortened URL. We can assume that only each 10th does.

Amount of URLs

15B / 10 = 1.5B

But this is the total amount per market. What amount of URLs we are going to handle. We can consider the 20/80 rule. Regarding this rule, we can suggest that 80% of all URLs will be handled by the top 3 services and the rest by other companies.

Amount of URLs for non-top-tier companies

1.5B * 0.2 = 300M

We can suggest that we will not be the best but we are one of the best. Which leaves us from 10% to 1/3 of the remaining the URLs. Let’s assume that we handle 1/3

Amount of URLs handled by our company per month

300 / 1/3 = 100M per month

Requests

Knowing the number of URLs we can find out how many requests we will be handling. We need to suggest the time span of the link. Normally, news lives from one to two weeks. We can assume that the same works for links.

That’s for sure that some links will be more popular than the others. But again keeping in mind the 20/80 rule we can assume the distribution will be fairly equal. So we can say that:

100M requests * 10 days (the average) * 1 click (we assume with the distribution above the link will be clicked at most once) = 1B requests per month.

Requests per second

1B / 30 days / 24 hours / 3600 second (60 minutes * 60 seconds) ~~ 400M requests per second

Read and write

We have one billion requests some of them will be read-requests others write-requests. We can assume that 10% goes for write and the rest for the read requests.

400M * 0.1 = 40M write-requests per second

400M – 40M = 360M read-requests per second

Data

To calculate the amount of memory we need we have to consider the length of the URL and the length of the hash codes. And most importantly we need to plan ahead because we are going to save them somewhere so let’s say that we are looking at a 5 year long period.

Since we have the number of requests per month (100M)

How many URLs in 5 year range?

5 years * 12 month * 100M = 6B URLs we need to store somewhere

We can suggest that the length of the URL will be somewhere between 100 to 1000 characters. Let’s go with 500 characters. (500 bytes)

In order to calculate the length of hash, we need to understand the length of the unique hash we are looking for. Assuming that we are going to use lower and upper case alphanumerical range: 26 * 2 + 10 (digits) = 62 symbols. Now using log62(6B) = 5.45 this is the length of the unique hash for 6B URLs.

Knowing that it’s easy to calculate that:

6B * 500 = 3TB memory we need to store 6B of original URLs

6B * 6 = 36 GB to store 6B hashes.

Traffic per second

40M * (500 + 6) = 20B write data

360M * (500 + 6) = 180B read data