What is Hidden Partitioning in Apache Iceberg?
Iceberg - built at Netflix, succeeded in abstracting partitioning logic from users of the data, making it a powerful force in the big-data industry. How did they do it though?
Apache Iceberg was initially developed at Netflix to address limitations in the Hive table format, particularly for large-scale analytics workloads. It became an Apache top-level project in 2020 and has gained significant adoption across major companies like Netflix, Apple, and LinkedIn due to its innovative approach to table management.
Traditional data lake partitioning schemes expose partitioning to users, forcing them to understand storage layout and explicitly reference partition columns in queries. Apache Iceberg introduces hidden partitioning—a metadata-driven approach that improves query performance without burdening users with partition awareness.
What is the Conventional Data-Partitioning Approach?
In traditional database and data lake systems, partitioning is explicitly visible in both the storage layout and query interface. This model creates several significant challenges:
Directory-Based Physical Organization
Traditional systems like Hive implement partitioning through a hierarchical directory structure, where each partition value creates a new directory in the filesystem.
For example, in a date-based partitioning, directories are created at year, month and day levels.
/table/year=2025/month=04/day=01/data1.parquet /table/year=2025/month=04/day=02/data2.parquet /table/year=2025/month=05/day=01/data3.parquet
*Hive uses Parquet file format - an open source columnar storage format.
Query-Time Partition Awareness
With traditional partitioning, users must explicitly include partition columns in their queries to achieve partition pruning1.
Without partition columns specified you trigger a scan of the entire dataset.
SELECT * FROM events
WHERE event_timestamp = '2025-04-01';
With partition columns, irrelevant directories are excluded from the scan.
SELECT * FROM events
WHERE year = 2025 AND month = 4 AND day = 1;
This approach pushes the complexity of understanding the physical data layout onto users and application developers, creating a tight coupling between physical organization and logical queries.
Partitioning is a vast topic with numerous considerations and approaches. For a more comprehensive understanding, you can refer to Martin Kleppmann's in-depth treatment in Chapter 6 of "Designing Data-Intensive Applications" and the groundbreaking work on workload-aware partitioning in the Schism paper by Curino et al2.
Iceberg’s Partitioning Approach with its Metadata Architecture
Iceberg's hidden partitioning capability is built on a multi-tiered metadata architecture that separates table state from data files. This separation allows Iceberg to track partition information at the metadata level rather than through physical organization.
The architecture consists of three key components3.
Metadata Files: Store table schema, partition specifications, snapshots, and configuration
Manifest Lists: Track all manifests for a snapshot, including partition value ranges for each manifest
Manifest Files: Contain data file details, including partition values and column-level statistics
This architecture enables crucial separation between how data is physically stored and how it's logically represented. The query engine handles all partitioning logic using the metadata layer—effectively abstracting it from users.
Partition Transforms
At the heart of hidden partitioning are partition transforms—functions that convert column values into partition values. Unlike traditional partitioning where partition columns must be explicitly included in the data, Iceberg applies transforms to existing data columns when writing and reading.
Iceberg implements several transform types as captured in this table below:
Identity transform does no transformation. Bucket transform, Truncate transform & Temporal transform — each use simple functions.
Research by Novotny et al. concludes that transform-based partitioning can reduce query execution time by up to 60% compared to traditional approaches when properly aligned with query patterns.
Multi-Level Filtering Algorithm
Iceberg uses a multi-level filtering algorithm that progressively narrows down the set of files to read.
Manifest List Filtering
The first filtering stage uses partition value ranges stored in the manifest list.
This stage provides an O(1) filtering operation compared to the O(n) operation of listing all files in a traditional data lake, where n grows with table size4.
Manifest-Level Filtering
The second stage applies exact partition value matching.
Data File Selection
The final stage uses file-level statistics for further filtering.
In a performance study by Wang et al.,5 this multi-level filtering approach reduced data scan volume by up to 95% compared to traditional partitioning schemes for certain query patterns.
Partition Evolution Design
A key advantage of hidden partitioning is the ability to evolve partition schemes without rewriting data—something impossible with traditional partitioning where partition columns are embedded in directory structures.
Netflix has been able to migrate partition schemes for multi-petabyte tables without any downtime or significant performance impact using partition evolution.6
Version-Based Partition Specs
Iceberg assigns a unique ID to each partition specification and tracks which spec was used for each data file.
Handling Multiple Active Partition Schemes
The query planning process accounts for multiple partition schemes.
Optimizations
Iceberg uses caching like any other data storage offering. In addition it also does metadata size optimizations to achieve maximum efficiency in this additional layer that it introduces. Some optimizations include:
Manifest lists store partition value ranges in compressed form.
Manifests are grouped by partition to minimize the number of files.
Column statistics use appropriate precision to balance size and effectiveness.
By moving partition management to the metadata layer, Iceberg delivers significant advantages in certain query patterns and becomes an apt solution for systems seeking simpler but more dynamic big data solutions.
Are you using Apache Iceberg for your product? Share your experiences in the comments.
And if you enjoyed this take on hidden partitioning, consider subscribing to Stackgazer for more thoughtful analysis at the intersection of technology, philosophy, and human experience.
C. Curino, Y. Zhang, E. P. C. Jones, and S. Madden, “Schism: a workload-driven approach to database replication and partitioning,” Proceedings of the VLDB Endowment, Vol. 3, No. 1-2, pp. 48-57, 2010.
Y. Cheng, F. Rusu, "Scan Planning in Iceberg Tables," 2022 IEEE 38th International Conference on Data Engineering (ICDE), 2022.
X. Wang, T. Rabl, et al., "Analyzing and Comparing Open Source Data Lake Table Formats," CIDR 2023, Conference on Innovative Data Systems Research, 2023.
J. Russell, R. Blue, "Evolution of Partition Management for Netflix Data Platform," Data+AI Summit 2022, 2022.