Designing a Scaleable URL Shortener
What is a URL Shortener?
let's assume we want to put a link in our resume which is very long, long URLs are hard to remember and seem ugly. URL shortener can solve this problem, given a long URL it will give us a much smaller pleasant-looking URL, that we can share and attach in places.
So a URL shortener app is basically where users can enter a long URL and get a smaller URL in return
now to design a URL shortener our main concern would be to
Shorten a URL
retrieve the original URL from shortened URL
Shortening a URL
to make a URL short we need a short code or string. so how can we get this string
we can generate a random string
we can hash the large url but since hashes are very long so take portions of it
problem with both approaches there are chances of a collision.
Say we generate a random string or hash for our original URL and take portions from the hash and let's call this string "code", every time we want to shorten a URL we would require to generate a unique code, so we will have to make sure that the string does not exist on the database which will require a search operation. Or we could just put in a unique constrain for the code column, but in this case, if the same code already exists in DB there will be a constrain violation and we would need to generate the code again, so for N collisions we would require to generate the code N times, as our database grows chances of collision become very high
Retrieve original URL
We would require to get the original URL from the code, We would require a search operation for this, to make The search operation faster we can create an index of the code column to make search operations faster
But numeric indexes are better in performance in comparison to the string-based index, so it would be great if the code column was numeric
A better approach by avoiding collision and making retrieval faster
our current code generation process does not ensure uniqueness by default we would have to check if it already exists, which is not very efficient. What if we could generate code in a way that is guaranteed to be unique, so here it is
we would generate code from a pre-determined range of numbers say from 10,000 to 90,000
we would have a counter in our server indicating the current number for code generation
we will convert the number to a higher base (base 16) to make it shorter and represent the shortened code
issues while scaling
Now what if we want to scale our app vertically, let's assume we want 3 instances of our app backend so what are the issues that we are going to face
earlier we provided the server with a server range of numbers
so now we are going to need a range of numbers for each server
what will happen when a server has exhausted all the numbers at its given range?
or if a new server instance joins how are we going to assign a number range?
And what if a server crashes and a new instance has to replace it, how is the new instance going to pick up from where the old server left
Solution
We can use a distributed config manager for coordinating the range configurations among the server, it would
store a list of ranges to distribute
assign a new range when a new server joins and lock that range
when a server crashes the range would be unlocked and available for other servers again, so the new server could pick up from where the old one left
when a server runs out of range, it would mark that range to be used and assign a new range
So how to build this config manager ??
Introducing Apache Zookeeper
It is a distributed config manager developed by the Apache foundation it is used along with other Apache products such as ApacheKafka for leader election and distributed config management
to learn more about Apache Zookeeper check out this blog link below