Short URL applications have become popular on major Weibos across the country.For example, qq Weibo's url.cn, groom's sinaurl.cn and so on.

When we posted the URL on QQ Weibo,Weibo will automatically determine the URL.And convert it,For example:http://url.cn/2hytqx

Why did you do this,I think there are several reasons:

The limit of Weibo is 140 words, so if we need to send some links,But this connection is very long,So much so that it would take up half of our content,This must not be allowed,So short URLs came into being.

Short URLs can be used to manage open URLs very well in our project.Some URLs can cover violence.Advertising, etc.This way we can report through users,Full management of this connection will not appear in our app,Should pass the encryption algorithm for the same URL,The resulting address is the same.

We can traffic a series of URLs,Clicks and other statistics,Unearthing the focus of most users,This will help us make better decisions about the follow-up work of the project.

In fact, the above three points are purely personal opinions.Because it will be applied to some of my next projects,So I learned a little bit,Let's take a look at the theory of short URL mapping algorithm (sources found online):

Generate a 32-bit signature string from the long URL md5,Divided into 4 segments, 8 bytes each;

For these four loops,Take 8 bytes, and treat it as a hexadecimal string and 0x3fffffff (30-bit 1) AND operation, that is, ignore processing more than 30 bits

These 30 digits are divided into 6 segments, and each 5 digit number is used as the index of the alphabet to obtain specific characters.Get 6-bit character strings in sequence;

The total md5 string can get 4 6-bit strings;any one of them can be used as the short URL address of this long URL;

Very simple theory,We do not necessarily say that the obtained URL is unique,But we are able to take out 4 sets of URLs so that there is almost no duplication.

Let's take a look at the program part:

public static string [] shorturl (string url)
 //You can customize the hybrid key before transmitting md5 encrypted characters
 string key="leejor";
 //To use the characters that generate the url
 string [] chars=new string [] {
 "A", "b", "c", "d", "e", "f", "g", "h", "I", "j", "k", "l", "m", "n", "o", "p", "Q", "r", "s", "t", "u", "v", "w", "x", "Y", "z", "0", "1", "2", "3", "4", "5", "6 ″," 7 ″, "8 ″," 9 ″, "a", "b", "c", "d", "E", "f", "g", "h", "i", "j", "k", "l", "M", "n", "o", "p", "q", "r", "s", "t", "U", "v", "w", "x", "y", "z"
 //md5 encryption of incoming URL
 string hex=system.web.security.formsauthentication.hashpasswordforstoringinconfigfile (key + url, "md5 ″);
 string [] resurl=new string [4];
 for (int i=0;i<4;i ++)
 //Bit-encode the encrypted character with a set of 8 hexadecimal and 0x3fffffff
 int hexint=0x3fffffff&convert.toint32 ("0x" + hex.substring (i * 8, 8), 16);
 string outchars=string .empty;
 for (int j=0;j<6;j ++)
 //Bitwise AND the obtained value with 0x0000003d,Get chars index of character array
 int index=0x0000003d&hexint;
 //add the obtained characters
 outchars +=chars [index];
 //Shift right by 5 digits per cycle
 //Save the string in the output array of the corresponding index
 resurl [i]=outchars;
 return resurl;

This method can now be used directly,You can wait until the following four sets of values:

In terms of storing data for this URL,I personally recommend ttserver. Some friends may not have heard of it.Here is an introduction to this database:

The tokyo cabinet is a dbm database developed by Japanese mikio hirabayashi (Hirabayashi Hiroshi) (note:the famous dbm database qdbm was developed by him), the database is very fast to read and write.insert:0.4sec/1000000 records (2500000qps), it takes only 0.4 seconds to write 1 million data. search:0.33sec/1000000 recordes (3000000 qps), it only takes 0.33 seconds to read 1 million data.

You can see that for the query of dictionary data key/value, this database can be said to be very efficient I have seen so far.Besides, he is so small,The pairing of short url/long url can't be better.

The system uses 6 shortcode characters to represent a URL of any length. Valid character codes are ascii "a" to "z" and "0 '" "5'", where each character contains 2 ^ 5 (32) status. 6 short code characters can be used to draw a 32 ^ 6 (1073741824) URL

First, you need a database table to store and retrieve your mapped URLs.

create table mappedurl (create table mappedurl (
shortcode char (6) not null,lognurl text not null,primary key shortcodeind (shortcode),);

Second, you need to define an algorithm that maps long URLs to short URLs. The algorithm has been described above.

Third, you need to create a web page,Find the original URL from the database's short URL mapping and redirect it.


md5 has been cracked,Therefore, the possibility that an attacker forges the same md5 URL for malicious purposes is not ruled out.If this situation is not considered,The probability of md5 collision should be extremely low,I guess you and I will never see it in my lifetime.

In addition, I do not understand what the actual use of "the same URL must be calculated every time the key value is the same".Even if the same url corresponds to different key values,Generally does not cause much waste, right?Only 6-digit alphanumeric combinations can accommodate billions of variations.

I asked this question just because I was worried about md5 collision.The same URL must correspond to the same key value because each URL address needs to uniquely correspond to a table data in the database.But it is slower to query directly with url,because:

The amount of url and related record data to be stored is very large.

And some URLs can be very long, so use the text field.

If the unique key value is hashed and stored in varchar, it will be very convenient to query based on this key value.

Just like the object hash in git, there is basically no need to consider conflicts at this time.

How are bit.ly and other URL shorter services implemented?

Need to reverse look up the URL from the hash key?If there is such a request, The url definitely needs to be stored somewhere, This allows re-hashing during conflicts

md5 is a 128-bit hash code (4 integers, 4 bytes each). Therefore, the md5 code of a url has 2 possible powers of 128 (ie 2e128). Feel free to find out the possibility that the md5 codes of the two URLs are equal,Is a fraction of 2e128, that is r=2e-128

If url is inserted into the database after md5,The first url is inserted without duplication,When the second md5 is inserted, its probability of repeating the first md5 is r. When the third url is inserted, the repeat probability is 2 × r, and so on.The probability of repetition during the nth insertion is (n-1) × r. There are n md5 codes, of which the probability of two repetitions is the sum of these probabilities.(1 + 2 + 3 +… + (n- 1)) × r=(1/2) × n × (n-1) × r

For a set of n md5 codes,The probability of duplicates is (1/2) * (n/2e64) e2

Therefore, only when n is large enough to be comparable to 2e64, it needs to consider its conflict.And the 64th power of 2 is still very large.

So, as long as it's not a malicious attack,Collision is rarely used in general applications

  • Previous C ++ Vector usage in-depth analysis
  • Next Method for dynamically generating tree menu in javascript