A new method to boost the speed of online databases
- by 7wData
Hashing is a core operation in most online databases, like a library catalogue or an e-commerce website. A hash function generates codes that replace data inputs. Since these codes are shorter than the actual data, and usually a fixed length, this makes it easier to find and retrieve the original information.
However, because traditional hash functions generate codes randomly, sometimes two pieces of data can be hashed with the same value. This causes collisions — when searching for one item points a user to many pieces of data with the same hash value. It takes much longer to find the right one, resulting in slower searches and reduced performance.
Certain types of hash functions, known as perfect hash functions, are designed to sort data in a way that prevents collisions. But they must be specially constructed for each dataset and take more time to compute than traditional hash functions.
Since hashing is used in so many applications, from database indexing to data compression to cryptography, fast and efficient hash functions are critical. So, researchers from MIT and elsewhere set out to see if they could use machine learning to build better hash functions.
They found that, in certain situations, using learned models instead of traditional hash functions could result in half as many collisions. Learned models are those that have been created by running a machine-learning algorithm on a dataset. Their experiments also showed that learned models were often more computationally efficient than perfect hash functions.
“What we found in this work is that in some situations we can come up with a better tradeoff between the computation of the hash function and the collisions we will face. We can increase the computational time for the hash function a bit, but at the same time we can reduce collisions very significantly in certain situations,” says Ibrahim Sabek, a postdoc in the MIT Data Systems Group of the Computer Science and Artificial Intelligence Laboratory (CSAIL).
Their research, which will be presented at the International Conference on Very Large Databases, demonstrates how a hash function can be designed to significantly speed up searches in a huge database. For instance, their technique could accelerate computational systems that scientists use to store and analyze DNA, amino acid sequences, or other biological information.
Sabek is co-lead author of the paper with electrical engineering and computer science (EECS) graduate student Kapil Vaidya. They are joined by co-authors Dominick Horn, a graduate student at the Technical University of Munich; Andreas Kipf, an MIT postdoc; Michael Mitzenmacher, professor of computer science at the Harvard John A. Paulson School of Engineering and Applied Sciences; and senior author Tim Kraska, associate professor of EECS at MIT and co-director of the Data Systems and AI Lab.
Given a data input, or key, a traditional hash function generates a random number, or code, that corresponds to the slot where that key will be stored. To use a simple example, if there are 10 keys to be put into 10 slots, the function would generate a random integer between 1 and 10 for each input. It is highly probable that two keys will end up in the same slot, causing collisions.
[Social9_Share class=”s9-widget-wrapper”]
Upcoming Events
From Text to Value: Pairing Text Analytics and Generative AI
21 May 2024
5 PM CET – 6 PM CET
Read More