Startup Dremio emerges from stealth, launches memory-based BI query engine
- by 7wData
When the open source Apache Arrow project was launched early last year, I covered it with great interest. The project's active contributors hailed from 13 other open source projects as wide-ranging as Cassandra, Impala, Pandas, Spark and Hadoop itself. All of these projects have occasion to place data in memory in a column-oriented fashion, and they've all done it their own way. The Arrow project is all about creating a standard that the other projects can share, so that they can also share data between themselves, without having to convert its in-memory representation.
In addition to the many companies, like Hortonworks, Cisco and LinkedIn, who lent personnel to this project, a new startup, called Dremio, was the major force behind it. Though the company has been in stealth mode until today, its support of, and focus on, Arrow was explicit. Two of Dremio's founders, Tomer Shiran (Dremio's CEO) and Jaques Nadeau (Dremio's CTO and Program Management Committee Chair of Arrow), both hailed from MapR (where Shiran was VP of product) and, significantly, from the Apache Drill project as well.
Also read: SQL and Hadoop: It's complicated
Drill acts as a single SQL engine that, in turn, can query and join data from among several other systems. Drill can certainly make use of an in-memory columnar data standard. But while Dremio was still in stealth, it wasn't immediately obvious what Drill's strong intersection with Arrow might be. That made it hard to guess what Dremio was up to.
Introducing Dremio, the product But with Dremio emerging from stealth today, the association is more clear, because today the company is launching a namesake product that alsoacts as a single SQL engine that can query and join data from among several other systems, and it accelerates those queries using Apache Arrow.
Let's back off the comparison with Drill though, and understand Dremio in its own right. It all stems from Dremio's credo that BI today involves too many layers. Source systems, via ETL processes, feed into data warehouses, which may then feed into OLAP cubes. BI tools themselves may add another layer, building their own in-memory models in order to accelerate query performance. Dremio thinks that's a huge mess.
Data lingua franca Dremio disintermediates things by providing a direct bridge between BI tools and the source system they're querying. The BI tools connect to Dremio as if it were a primary data source, and query it via SQL. Dremio then delegates the query work to the true back-end systems through push-down queries that it issues. Dremio can connect to relational databases (both commercial and open source), NoSQL stores, Hadoop, cloud blob stores and ElasticSearch, among others.
In an interview last week, Shiran and Nadeau told me that Dremio does not materialize its own data store in between the BI tool and the physical back-end databases, and yet it makes queries against that back-end data -- even when it's true Big Data -- perform like queries against "small data" that a BI tool might have in its own local model.
[Social9_Share class=”s9-widget-wrapper”]
Upcoming Events
From Text to Value: Pairing Text Analytics and Generative AI
21 May 2024
5 PM CET – 6 PM CET
Read More