Dremio and its eponymous platform have always been focused on high-performance data virtualization. Such platforms are centralized brokers that connect to and query multiple data sources on a user's or application's behalf. A hallmark of data virtualization, and one of its biggest hurdles, is performing federated queries that return a single result set composed of data from more than one of the connected sources. But today, Dremio is turning its attention to a single data source: the cloud data lake.
Jumping in the Lake
Rather than obsessing on the performance of querying multiple sources, Dremio is introducing technology that optimizes access to cloud data lakes. There are good reasons for this. Cloud data lakes are seeing increasing adoption in the Enterprise, acting as the de facto collection point for corporate Big Data. But the very architectural premise that makes the cloud data lake economically efficient—its basis in cloud object storage—also makes for an often-unspoken downside: Data access is slow.
Rather than leave that flaw an obfuscated detail, Dremio is taking it head on and working to solve the problem. The company has introduced new Data Lake Engine technology, as part of its Dremio 4.0 release, that significantly increases query performance against data in Amazon Web Services' (AWS') Simple Storage Service (S3) and Azure Data Lake Storage (ADLS) Gen2.
Think Globally, Cache Locally
In a briefing with ZDNet, Dremio chief executive officer and cofounder Tomer Shiran explained the new architectural approach, which is fairly straightforward (with hindsight). Because Dremio itself is deployed on a cluster of cloud virtual machines (VMs), and because VM instance types that use high-speed Non-Volatile Memory Express (NVMe) solid state drive (SSD) technology are plentiful, Dremio's Data Lake Engine caches data from the cloud storage-based lake into those much faster SSDs for better performance. Dremio's name for this technology is "C3" (the Columnar Cloud Cache).
This is a smart approach based on robust precedent: Various cloud data warehouse platforms, including Snowflake and Azure SQL Data Warehouse Gen2, use this very same technique of SSD caching to maximize performance while still leveraging economical cloud object storage as the persistent backing layer. In the cloud data warehouse world, this enables the clusters to be paused and resumed, while keeping bulk storage costs low.