What is Snowflake?Snowflake is a SaaS cloud computing platform that offers storage and analytics commonly termed “data warehouse- as-a-service”. Snowflake provides features generally associated with data warehouses such as data lakes, data engineering, data application development, and real-time/shared data. Consumption is built on the three leading cloud service infrastructures (Amazon Web Services, Microsoft Azure, and Google Cloud Platform) meaning… there is no hardware or software to select, install, configure, or manage. This blog focuses on explaining the key concepts of the Snowflake storage layer.
Why Should Organizations Use Snowflake?Snowflake is the only data platform built “in and for” the cloud that can be used as both a data lake and a data warehouse. Because it is built as an application native to the cloud, it helps in scaling up and down based on computing needs while at the same time meeting performance requirements, thus organizations no longer need a separate data lake and data warehouse. Data-driven organizations have been using various tools and techniques to collect, process, store and protect proprietary data for decades. Traditionally when data engineering teams store and process data, they could either store it in a data warehouse (processed data) or in a data lake (raw data). While both of these options improved past inefficiencies, it is more difficult to retrieve data located at different sources. Snowflake eases the workloads of the data engineering team and alleviates the need to buy two different applications.
Snowflake Architectural LayersSnowflake has a unique architecture that consists of three salient layers, which include:
- Storage Layer
- Compute or Processing Layer
- Cloud Services Layer
- The PAX (Partition Attributes Across) Storage model, which is a hybrid of column-store and row-store, is used by the Snowflake storage layer.
- Once the data is available from the disc, PAX is designed to ensure high data cache performance. As a result, cache space is fully utilized.
- Once local disc space is exhausted, Snowflake uses S3 to store temp data generated by query operators (for example, massive joins), as well as large query results.
- Spilling temp data to S3 allows the system to run large queries without running out of memory or disc space.
Snowflake Storage LayerWhen data is ingested into Snowflake, it is reorganized into multiple micro partitions that are internally optimized, compressed, and stored in columnar format. Data is stored in the cloud and works as a shared-disk model (data accessible by all the clusters), thereby simplifying data management. The data objects stored by Snowflake are not directly visible nor accessible by the user; they are only accessible through performing SQL query operations. A few concepts that make Snowflake’s table structure for its faster retrieval are Micro Partitions, Data Clustering and Columnar Format. Micro Partition – In contrast to traditional static partitioning(a column name needs to be manually given to partition the data), all data in Snowflake tables are automatically divided into micro-partitions, which are contiguous storage units. Micro-partition is a physical structure in Snowflake. Each micro-partition contains between fifty MB and five hundred MB of uncompressed data. Snowflake storage layer automatically determines the most efficient compression algorithm for the columns in each micro-partition. Then, the rows in tables are mapped into individual micro-partitions organized in a columnar fashion. While inserting or loading the data, tables are transparently partitioned using the ordering of the data. All DML operations (e.g., DELETE, UPDATE) utilize the underlying micro-partition metadata to facilitate and simplify table maintenance. For example, few operations like deleting entire records from a table are metadata-only. Data Clustering – Data clustering is a critical factor in queries because table data that is not sorted or partially sorted may impact query performance, especially on huge tables. In the Snowflake storage layer, when data is inserted or loaded into a table, clustering metadata is collected and recorded for each micro-partition created during the process. Snowflake then uses this clustering information to avoid inessential scanning of micro-partitions during querying, which accelerates the performance of queries that reference these columns. Columnar Format – Data stored in columnar format has significant advantages over row-based formats.
- Data security, since data is not human-readable
- Low storage consumption
- Efficient in reading data in less time is columnar storage and minimizes latency
- Supports advanced nested data structures
- Optimized for queries that process large volumes of data
Physical Structure of Snowflake Storage Layer
Based on the above concepts, let’s deep dive into the physical structure of the storage layer.The above table in the figure consists of 24 rows stored across 4 micro-partitions, with all the records divided equally between each micro-partition. Within each micro-partition, the data is sorted and stored in columnar format, which enables Snowflake to perform the following actions for queries on the table:
- Remove micro-partitions that are no longer needed for the query operations.
- Remove by column specified in the query operation within the remaining micro-partitions.