Snowflake Bolsters Support for Apache Iceberg Tables
Snowflake today introduced a series of enhancements for Apache Iceberg, the open table format that it added to its data platform last year. The big announcement is that Snowflake customers can treat Iceberg tables just like they treat native internal Snowflake tables, effectively eliminating the two-tiered system.
When Snowflake introduced support for Apache Iceberg last June, the company supported Iceberg tables as external tables. That gave Snowflake customers the ability to query Iceberg data within their Snowflake environment, but it left out a range of capabilities that were only available for native Snowflake database tables.
That officially changes today, said Chris Child, vice president of product management for Snowflake.
“We’re treating them [Iceberg tables] in the exact same way that we treat standard Snowflake tables,” Child said. “If you are using Apache Iceberg, from your perspective, there’s no difference between a Snowflake native table and an Apache Iceberg table.”
Iceberg users now get access to all of the same features and capabilities as customers who store data in native Snowflake tables, Child added. “This applies if you’re reading, if you’re writing,” he said. “You get access to things like dynamic tables and replication. All of this kind of just works on top of Iceberg.”
While Snowflake is keeping its native table format (which still has some performance advantages over Iceberg tables), Snowflake has gotten rid of the internal table versus external table nomenclature, Child said. “They’re all effectively internal tables.”
Reaching this point required lots of development work, which the Snowflake engineering team is sharing with the development teams for Apache Iceberg and Apache Parquet, the underlying data format that Iceberg rides on, as well as the Apache Avro and Apache Arrow teams, Child said.
Specifically, Snowflake is now allowing the same compute engine previously used for native Snowflake tables to be used against Iceberg tables. Early Snowflake adopters are also using Snowflake search optimization and query acceleration services, which will be in general availability soon, Snowflake says.
While Snowflake has done a lot of work with Iceberg, there is still work to do. For instance, it’s working with the Iceberg community to launch support for VARIANT data types within the popular table format, which is something that it has supported in its proprietary data store for years.
Another area of Iceberg development is supporting data replication and synching of Iceberg tables, which is important for ensuring the availability of data in the event of disruptions. That’s something that is currently in private preview, and could soon be in the open Iceberg spec too.
“There’s a number of capabilities that we’ve got in our native tables that aren’t available in Apache Iceberg,” Child said. “Some of this comes down to performance. So in certain ways, we’re able to design the files and design the metadata that we’re storing in a slightly different way [so]… we are still able to get better performance out of our native tables. We get, we think, better performance on Iceberg than any other engine, but the native tables are still a little bit faster.”
It’s still faster to write data using Snowflake’s native table format than to write Iceberg tables, Child said. That makes sense considering there is additional overhead that comes with the Iceberg metadata atop the Parquet format. The difference in data reads is not as big.
“We’ve made a lot of optimizations to things like those write paths,” Child said. “What we actually do on the back end is we write our metadata and then we write the Iceberg metadata as a second step. It’s very quick. It allows us to commit the writes very, very fast, but still allow kind of the Iceberg metadata to be fully reflected.”
Now that it’s gotten rid of the internal vs. external table distinction, the biggest decision that Snowflake customers have to make is who manages the metadata. They can use Snowflake’s managed Apache Polaris (incubating) metadata catalog service and let Snowflake optimize the environment, which will bring some benefits in performance and security. Alternatively, Snowflake customers can manage the metadata themselves or use another metadata catalog, such as AWS Glue or Dremio’s Nessie catalog.
Snowflake is working with the Iceberg community to bring the same security and governance capabilities to the open spec that its customers enjoy within the Snowflake environment, Child said.
“Iceberg doesn’t have support for things like row level security or column level masking or things like that today. Snowflake does,” he said. “Today you can’t create kind of the same tightly governed, very fine-grained controls that you can within Snowflake on Iceberg. There’s another thing that just doesn’t exist in the spec yet. We’re working with the Iceberg team to and with the Iceberg community to figure out how we can start to bring more of those finer grained capabilities to iceberg.”
There’s no timeline for the work with fine-grained access control and row-and-column-level masking in Iceberg, Child said.
Customers recognize that there are tradeoffs involved with Iceberg in Snowflake. Customers can query their Iceberg data using any supported query engines, thinks like Trino, Dremio, Apache Spark, and Apache Flink, which is a net benefit. However, they don’t enjoy the same level of integrated security and governance when using those engines that customers get when they’re using native Snowflake functions, Child said.
“If you’re like, hey, I’m going to run some workloads in spark and some in Flink and some in Trino and some in snowflake,’ there’s going to be some complexity in getting your governance and your security and everything else the way that you want,” he said. “But for a number of our customers, we’re seeing that they’ve decided that’s worth it, that trade off. They want to be able to use those different engines and the extra work they have to do to create the consistent environment that they want is worthwhile.”
Adoption of Iceberg has been robust since Snowflake first unveiled it last June. A minority of customers are using Iceberg, but the number is growing quickly.
Related Items:
Snowflake, AWS Warm Up to Apache Iceberg
How Apache Iceberg Won the Open Table Wars
Snowflake Embraces Open Data with Polaris Catalog
The post Snowflake Bolsters Support for Apache Iceberg Tables appeared first on BigDATAwire.