The truth about in-memory computing
- by 7wData
A few weeks back one of my favourite analysts, Merv Adrian tweeted the following:
““Just move it to memory and it will speed up.” Not so fast (pun intended.) Serious engineering required – even for a KV store. ”
I could not help but smile when I saw this. I’ve spent years telling anyone who would listen that putting data into memory doesn’t instantly transform software, originally written for disk-based data, to “in-memory”.
In 1988, at White Cross Systems (a pioneer in MPP in-memory systems, which later evolved into Kognitio) we set out to use the concept of MPP to build a database that would support, what we now call data analytics, but was then called Decision Support. Most databases at that time were designed and optimised for transaction processing, rather than decision support and so we were effectively starting from scratch. We wanted to build a system that was fast enough to support train of thought analysis and could scale linearly to support large and growing data volumes.
We never set-out to build an in-memory system, but it became clear to us early on, that if we wanted to exploit massive parallelisation, then we could not be limited by disk IO speeds. Reading data from slow physical disks seriously limits the amount of parallelisation you can effectively deploy to any task, as the CPUs (processors) very quickly became starved of data as everything became disk IO bound.
This is the most basic and important point that is often missed when talking about in-memory. It’s not the putting of data “in memory” that makes things faster. Memory, like disk, it is just another place to park the data. It’s the Processors or CPUs that run the actual data analysis code. Keeping the data in memory allows the CPUs fast access to the data, keeping them fed with data and enabling parallelisation.
For this reason we decided to build a system which kept the data of interest in fast computer memory or RAM (Random Access Memory). In retrospect this was a brave decision to make in the late 80s. Memory was still very expensive, but because we were rather young and naïve, we believed that the price would fall relatively quickly making the holding of large data sets in-memory, an economical proposition. Ultimately we were right, even if it did take a couple of decades longer than we thought!
The point I’m making is this. When we took the decision to go in-memory, it dramatically changed our code design philosophy. Not being disk IO bound meant we became CPU bound, so code efficiency became hugely important. Every CPU cycle was precious and needed to be used as effectively as possible. For example, in the mid 90s, we incorporated “dynamic code generation” into the software, a technique that involves dynamically turning the execution phase of any query into low level machine code, which is then distributed across all of the CPUs in the system. This technique reduced code path lengths by 10-100 times. I am not saying that advanced techniques like machine code generation are essential components of an in-memory system but I am saying that using an efficient programming language is important when machine cycles matter. So probably not JAVA.
Designing code specifically for in-memory also has another important benefit because, besides being faster, RAM is also accessed in a different way to disk.
[Social9_Share class=”s9-widget-wrapper”]
Upcoming Events
From Text to Value: Pairing Text Analytics and Generative AI
21 May 2024
5 PM CET – 6 PM CET
Read More