https://www.hiredintech.com/classrooms/system-design/lesson/56 – this is the link to the original course.
Below is the written down information which interested me in some way.
Use cases
Speaking of the idea of the original service. We can think of two main features:
- shortening – when we get a link and we need to convert it to a short version
- redirecting – when we get a short version and we need to redirect the user to the original web-site.
Some second-tier features might be:
Custom URLs, high availability of the system.
South things that will be out of the scope but worth mentioning: analytics, automatic link expiration, manual link removal, access to the service via UI or API, etc.
Constraints
Once we are done with the use cases it’s time to mention what we want to achieve with the service in terms of the amount of data. It might be:
- How many requests per second should we handle? (Derived from how much traffic should the system handle?)
- How many new links should we handle in some definite period of time? (Derived from how much data should the system handle?)
Normally (and first of all we should ask the interviewer this question) he may not provide any exact details or the interviewer can obscure details behind some vague questions and assumptions. Eg. “let’s assume that the service will not be in the top 3 services on the market but it definitely will be in the top 10”. Again we don’t have enough information about it and the intention of the interviewer is to understand if we really can deal with such a problem on our own.
In this situation, we can utilize some public data. More specifically we should be aware of some big data hottest statistics. Such as:
- total amount of Facebook users is 1.3B;
- total amount of Twitter users is 650M;
- total amount of new tweets a day – 500M;
For example, in terms of the tweets per day, we can use this information. Mainly because Twitter was one of the drivers who use URL shortens. And the most interesting part with some assumptions and knowing only this number we can calculate everything else.
Total amount of tweets per month
500M * 30 = 15B
Now it’s worth mentioning that not each of the tweets has shortened URL. We can assume that only each 10th does.
Amount of URLs
15B / 10 = 1.5B
But this is the total amount per market. What amount of URLs we are going to handle. We can consider the 20/80 rule. Regarding this rule, we can suggest that 80% of all URLs will be handled by the top 3 services and the rest by other companies.
Amount of URLs for non-top-tier companies
1.5B * 0.2 = 300M
We can suggest that we will not be the best but we are one of the best. Which leaves us from 10% to 1/3 of the remaining the URLs. Let’s assume that we handle 1/3
Amount of URLs handled by our company per month
300 / 1/3 = 100M per month
Requests
Knowing the number of URLs we can find out how many requests we will be handling. We need to suggest the time span of the link. Normally, news lives from one to two weeks. We can assume that the same works for links.
That’s for sure that some links will be more popular than the others. But again keeping in mind the 20/80 rule we can assume the distribution will be fairly equal. So we can say that:
100M requests * 10 days (the average) * 1 click (we assume with the distribution above the link will be clicked at most once) = 1B requests per month.
Requests per second
1B / 30 days / 24 hours / 3600 second (60 minutes * 60 seconds) ~~ 400M requests per second
Read and write
We have one billion requests some of them will be read-requests others write-requests. We can assume that 10% goes for write and the rest for the read requests.
400M * 0.1 = 40M write-requests per second
400M – 40M = 360M read-requests per second
Data
To calculate the amount of memory we need we have to consider the length of the URL and the length of the hash codes. And most importantly we need to plan ahead because we are going to save them somewhere so let’s say that we are looking at a 5 year long period.
Since we have the number of requests per month (100M)
How many URLs in 5 year range?
5 years * 12 month * 100M = 6B URLs we need to store somewhere
We can suggest that the length of the URL will be somewhere between 100 to 1000 characters. Let’s go with 500 characters. (500 bytes)
In order to calculate the length of hash, we need to understand the length of the unique hash we are looking for. Assuming that we are going to use lower and upper case alphanumerical range: 26 * 2 + 10 (digits) = 62 symbols. Now using log62(6B) = 5.45 this is the length of the unique hash for 6B URLs.
Knowing that it’s easy to calculate that:
6B * 500 = 3TB memory we need to store 6B of original URLs
6B * 6 = 36 GB to store 6B hashes.
Traffic per second
40M * (500 + 6) = 20B write data
360M * (500 + 6) = 180B read data