The fastest Algorithm- Hash Functions

Arnav
6 min readJan 3, 2024

The accidental discovery

A few weeks ago, we were given an assignment to calculate the frequency of each character in a string we were given. The standard procedure we were supposed to go through was to first make a double-dimensional array, The first column for characters and the second for their frequency and iterate through each character. Once we find a new character, we will check if it exists in the array of ours. If it does not, we simply add it there and +1 its frequency, else we add the character to the array and initialize its value to one.

Here is a different way to solve the problem. You observe that there are at most 128 different sets of characters. Each character has an ASCII value from 0 to 127. So what if we use the ASCII values of the characters as indexes in an array? This felt promising. So, I went ahead and made an array of size 128 integers, and initialized all the elements to 0. I determined that every unique character would have a different ASCII value, and hence a different index. So, I simply iterated through the array and at every character, I would simply increment the index of the array whose value was the same as the ASCII of that character. For example, ‘A’ has an ASCII value of 65, hence, I would increment arr[65] or arr[’ A’] of the array of name arr. This made it easy. At the end of the code, I could simply iterate through the 128 integer-long array and use a conditional to display only the characters whose frequency is not 0. In the form → (char)i,” has frequency ”, arr[i]., as i is the integer doing from 0 to 127. char, in pseudocode, means to display the actual character referring to that ASCII. And arr[i] is the frequency of the thing. Here is the thing in C:

void freqDisplay(char* str)//Just a function header taking a string as its input
{
int *arr = (int*)calloc(128,sizeof(int);
//Do not freak out, just make an array of 128 size and set all characters to zero
for(int i=0;i<strlen(str);i++)
{
arr[str[i]]++;//str[i] is the character at i-th index
}
for(int i=0;i<128;i++)
{
if(arr[i]>0)//We do not want to show the characters that do not exist
printf("%c is here %d times",i,arr[i]);
}
}

I thought this was really cool and moved on. After some days, I came across this amazing book, Grokking Algorithms, an introductory book to DSA. And the author introduced Hash Functions. I then realized, that what I had used in that code, was a very very basic implementation of a hash function. Before that, let me give you a moment to admire the stupidly fast hash functions.

Just imagine I give you a dictionary, a HUGE dictionary, of random words, not in any order, and every page contains one word. Your task is to find a word say Ephemeral and check if that word exists in the dictionary. Given there is no order, your only option is to go one by one, checking one page at a time. There are about a million words in the English Language, so even at the speed of going a word a second, it will take you over 11 Days to go through the book. If you are lucky and the word exists, you will probably find it earlier and call off the search. But, in a bad case, either the word is towards the end of the book, or perhaps the word simply does not exist, to confirm that, you have to scan through the entire book! that is a continuous 11 days of staring at a book. This approach of going one by one is called linear search.

If there are n problems, it will take you n seconds to do so, which means that the time taken increases linearly, i.e., in a straight line. therefore, its time complexity is given by O(n).

Well, reading a boring book for 11 days continuously sucks, so let me do you a favour. The dictionary is sorted now. Will this make our task any quicker?

Turns out, it will not only make it quicker but about 30 thousand times quicker. Here is how:

You open the book, roughly in the middle, at like M . Now, we know E comes before M, so you quite literally, tear off the right half of the book, as you do not need it. Ephemeral, if it exists, must be on the right half. So now, you are looking from A to M. Then, you look at the centre of that, which is I guess somewhere near F. You know E is lower than F, so you tear off the larger half again, Now from A to F. The centre is now C, and You see again, and tear off the first half, as E exists from C to F. You carry on this, removing the stuff, until you approach a single letter, which might be the word. If it is not, we are certain that the word does not exist in the list. This way of running for the middle is known as binary search.

For more details: CS50 2019 — Lecture 0 — Binary Search — YouTube

The time complexity for binary search is given by O(log(n)). Log(n) is often referred to as the sluggish function, and you will soon see how.

If we check its efficiency, The efficiency of the former algorithm is O(n) and the latter is O(log n), how we came to this can be learnt from a DSA class. It simply means that while the first one takes over 11 days for a million operations, the latter will take about 32 operations, i.e. 32 seconds. Yes, 32 seconds. That is 31 thousand times more efficient!

Let me show you some graphs, to show you

The graph of the former, which is called linear search is a straight line, or y=x, which is the blue line and the second one, called binary search is given by y = log(x), in red line. This might not be that great of a distance, but the second we compare larger numbers, the whole scheme changes!

At larger scales, the difference between the blue and red becomes very obvious, as the red seems like a 0 in front of this!

This all sounds amazing! O(logn) time complexity is amazing, like come on! That is probably faster than anything.

But, what if searching for something can be O(1)?

This means no matter how huge the set gets, the algorithm always takes a single second to do so, in our example. That will be 32 times faster than Binary Search and a whopping Million times faster than Linear Search! These are Hash Functions.

Here is how hash functions work, at a very fundamental level. Say you want to find if the word exists in our dictionary. So, we design a formula, a function which on putting a word in the function, will return a specific number, such that no other word will return that specific number. For example, apple will return 2342, and banana always returns 3284, just random examples. Such a function is called a hash function. Then, we make an array, let’s call it dict, a huge array whose size is larger than the largest index. Every index that exists for every word in the dictionary is assigned to 1. Every other index is assigned to be 0.

Now, our task is easy, we put the word, Ephemeral in the hash function, and that function returns a number, num. Now we check the value of dict[num] and if that is a 0 or a 1. If it returns a 1, that means that it exists. I hope you understand the idea of this logic. If we come up with a good hash function, we can essentially use O(1) algorithm.

The algorithm I used also kind of had a sort of hash function, one that would take a character and return its ASCII value as the number, and we would use that as its index. It was extremely simplified but serves the same functions. Instead of searching for a number, just summon it! A more sophisticated function would be required to tackle other issues.

The good news, though, we do not need to come up with a good function. Many languages like Java and Python support using hash functions, or dictionaries, as it is called in Python. This powerful data structure shows how amazingly efficient computer programs can get with some curiosity and brainpower.

{1: 'Thanks', 2: 'For', 3: 'Reading'}

--

--