Yesterday, I attended Snowflake’s World Summit yesterday. My experience of working for US companies has taught me some cynicism about the naming of such events, but the CTO and business founder are both French and ex-Oracle employees. They have obviously caught a mind share, the meeting was heaving and very heavily overbooked. I attended the plenary sessions, which consisted of a reference story and during the break spoke to one of their pre-sales engineers who was very helpful.
Snowflake offers a cloud native SQL, database-as-a-service optimised for data warehousing applications. Their basic theory statement is in a white paper, presented to SIGMOD 2016 called “The Snowflake Elastic Data Warehouse”. It’s key insight is that as the cloud service providers innovate storage solutions design, it becomes de-coupled from the systems compute. It is now possible to build systems, where the compute is based on shared nothing and yet the basic storage for the datasets is shared. Snowflake calls this, multi-cluster, shared-data architecture although it looks to me like a shared disk cluster. The clusters utilise the elasticity of the cloud providers and performance utilises vertical scaleability. On top of this, the implement an independent meta-data services layer, which is managed by foundation DB and architecturally holds the storage indexes, although we can assume these are structured very differently from classic relational indexes. In fact, they state that that Snowflake is a post b-tree index implementation using min-max pruning techniques also known as small materialised aggregates[1], zone maps[2] and data skipping[3]. The meta-data services also perform the query plan evaluations. This sort of distributed services design would seem to be borrowed from the industrialised NOSQL challengers such as Cloudera which itself takes this architecture from Hadoop. Snowflake uses column based tuple representation as part of its data warehousing architectural representation. The column to page design map is not documented in the white paper.
On the subject of scalability, there must be an architectural constraint here, which will be aggravated by the multi-tenancy of the solution, but so far, no-one seems to be complaining so we cannot determine if the bottleneck will be the the number of cycles applied to the query, CPU constraints within the meta-data servers or the storage interconnect bandwidth.
The other thing I ask is that in my early database days, we had problems of multiple caches, and in SMP design, there were problems of cache coherency consistency, these have not gone away, and are often designed out by selecting eventual consistency as the CAP constraint. As we decouple storage from query processing we are replicating the systems instances, storage servers have RAM, CPU and storage and connect to similarly architected servers. The duplication costs money and will cause delay. It also shows there is no magic in the hardware platform, it’s still a choice between combining SMP, and SIMD models although the CPU designers are sedimenting these architectures onto the chips.
John Ryan, was interviewed on the current state and the future of the database market in an article on odbms.org, and made some insightful comments on the changes to the market since Stonebraker’s “The End of an Architectural Era (It’s Time for a Complete Rewrite) 2007“, which looking back can be seen as prescient. Stonebraker argued that the RDBMS implementations had become too generic and that new forms of problem would create a market opportunity for specialised database systems, including one for OLTP which is where the RDBMS started. I was equally impressed with Ryan’s comments on solutions design focus,
Focus on the problem not the solution. Then (once understood), suggest a dozen solutions and pick the best one. But keep it simple.
A dozen might be a bit too many, and one has to divorce the politics from the solutions selection while remembering that skills supply is a critical selection criteria, both internal supply and external.
This was finished in June 2020 and back dated to the date it was started.
I added the description of the new indexing tools and their academic white papers, about min-max pruning techniques i.e. small materialised aggregates, zone maps and data skipping.