Rethinking Data Marts in the Cloud
- by 7wData
Many of us are all too familiar with the traditional way enterprises operate when it comes to on-premises data warehousing and data marts: the enterprise data warehouse (EDW) is often the center of the universe. Frequently, the EDW is treated a bit like Fort Knox; it’s a protected resource, with strict regulations and access rules. This setup translates into lengthy times to get new data sets into an EDW (weeks, if not months) as well as the inability to do exploratory analysis on large data sets because an EDW is an expensive platform and computational processing is shared and prioritized across all users. Friction associated with getting a data sandbox has also resulted in the proliferation of spreadmarts, unmanaged data marts, or other data extracts used for siloed data analysis. The good news is these restrictions can be lifted in the public cloud.
Business intelligence (BI) and analytics in the cloud is an area that has gained the attention of many organizations looking to provide a better user experience for their data analysts and engineers. The reason frequently cited for the consideration of BI in the cloud is that it provides flexibility and scalability. Organizations find they have much more agility with analytics in the cloud and can operate at a lower cost point than has been possible with legacy on-premises solutions.
These capabilities make for an amazing one-two-three punch.
Because the cloud offers the ability to decouple storage and compute, all of an Organization’s data can now live in a single place, thus eliminating data silos, and departments and teams can provision computes to run analytics for their use cases as needed. This new arrangement means self-service BI and analytics are a reality for those who adopt such a model. And with an open architecture, there are no worries about technology lock-ins.
Now that we’ve discussed what technology options there are for BI in the cloud, what are the considerations an Organization should think about?
Generally speaking, there are two common use cases for BI and analytics in the cloud that map to the two main architecture patterns: long-lived clusters and short-lived (or transient) clusters. Let’s discuss each in more detail.
Often, data analysts, data scientists, and data engineers want to investigate new and potentially interesting data sets, but would like to avoid as much friction as possible in doing so. It’s quite common for data sets to originate in the cloud, so storing and analyzing them in the cloud is a no-brainer. Such data sets can easily be brought into S3 or ADLS in their raw form as a first step.
Next, a cluster can easily be provisioned with the instance type and configuration of choice, including potentially using spot instances to reduce cost. Generally, instances for transient clusters need only minimal local disk space, since data processing runs directly on the data in the cloud storage. There are tools, like Cloudera Director, that can assist with the instance provisioning and software deployment, making it as easy as a few clicks to provision and launch a cluster. Once the cluster is ready, data exploration can take place, allowing the data analyst to perform an analysis. If a new data set will be created as part of the work, it can be saved back to the cloud storage.
[Social9_Share class=”s9-widget-wrapper”]
Upcoming Events
From Text to Value: Pairing Text Analytics and Generative AI
21 May 2024
5 PM CET – 6 PM CET
Read More