How Good is Parquet for Wide Tables (Machine Learning Workloads) Really?

Navigate to:

In this blog post, we quantify the metadata overhead of Apache Parquet files for storing thousands of columns, as well as space and decode time using parquet-rs, implemented in Rust. We conclude that while technical concerns about Parquet metadata are valid, the actual overhead is smaller than generally recognized. In fact, optimizing writer settings and simple implementation tweaks can reduce overhead by 30-40%. With significant additional implementation optimization, decode speeds could improve by up to 4x.

Figure 1: Metadata decode time for 1000 Parquet Float64 columns using parquet-rs. Configuring the writer to omit statistics improves decode performance by 30% (9.1ms → 6.9 ms). Standard software engineering optimization techniques improve the decode performance by another 40% (6.9ms → 4.1ms / 9.1ms → 6.4ms)

Introduction

Recent assertions have suggested that Parquet is not suitable for wide tables with 1000s of columns, often found in machine learning workloads. Proposals for new file formats, such as BtrBlocks, Lance V2, and Nimble1, often accompany these assertions.

Usually, the stated rationale is that wide tables have “large” metadata, which takes a “long time” to decode, often longer than reading the data itself. Using Apache Thrift to store the metadata means the entire metadata payload must be decoded for each file, even when only a small subset of columns is required. It also appears to be common (though incorrect) to equate Parquet (the format) with a specific Parquet implementation (e.g., parquet-java) when evaluating performance.

Leaving aside the fact that many query systems cache information from the Parquet metadata in a form suited for faster processing, we wanted quantitative information on how much of the purported metadata overhead is due to limitations in the Parquet format vs. how much is due to less optimized implementations or poorly configured settings of Parquet writers.

Background

Parquet files include the metadata required to interpret the file. This metadata also instructs the reader to load only the portion of the file necessary to answer queries. More information on these techniques can be found in Querying Parquet with Millisecond Latency. Typical Parquet files are GBs in size, but many queries read only a small portion, so the metadata is often critical to quickly finding the required data.

layout-of-parquet-files

Figure 2: Layout of Parquet files. The metadata is stored in the footer (at the end of the file) and contains the location of pages within the file and optional statistics such as min/max/null counts for each column chunk.

As shown in Figure 2, the structure of Parquet metadata mirrors that of the Parquet file: It contains entries for each row group, and each entry contains information for each column chunk within that row group. This means that the metadata size is O(row_group * column) and grows linearly with both the number of row groups and the number of columns.

In addition to the information required to decode each column’s data, such as starting offset and encoding type, the metadata can optionally store min, max, and null counts for each column chunk. Query engines, such as Apache DataFusion and DuckDB, can use these statistics to skip decoding row groups and data pages entirely.

The metadata is encoded in the Apache Thrift format, which is similar to protobuf. Thrift uses variable-length encoding to achieve high space efficiency. Still, the variable-length encoding requires Parquet readers to fetch and potentially examine the entire metadata footer before reading any content. For example, it is not possible to jump directly to the location in the metadata required to read a single-row group without starting at the beginning.

Reading Parquet metadata in parquet-rs’s ArrowReader requires three steps:

  1. Load the metadata from storage to memory
  2. Decode the thrift-formatted data into in-memory structures: ParquetMetadata.
  3. Build the Arrow Schema from the Parquet SchemaDescriptor.

The time required to load the metadata from storage depends on the storage device and ranges from 100us (local SSD) to 200ms (S3)2. As shown in Figures 4 and 6, decoding from Thrift into Rust structures is by far the most time-consuming activity once the data is in memory. This makes sense as the decoding inflates a tiny compact encoding (Thrift) into point-accessible in-memory Rust structures. Transforming the SchemaDescriptor into Arrow Schema also requires a small amount of CPU time.

Testbed

Implementation: We experimented with parquet–rs3, a Rust implementation of Parquet, version 51.0.0. We repeat each experiment five times for each file and report the average time of the last four executions to exclude the impact of caching. You can find the benchmark code here. We ran the benchmark on an AMD 7600X processor clocked at 5.4 GHz with a 32 MB L3 cache.

Workload: We generated several Parquet files with between 10 to 100,0004 Float64 columns, mimicking machine learning workloads. As we focus on the metadata, we simply write the same repeated value multiple times for the data. Each Parquet file contains ten-row groups, and because each row group includes all columns, the Parquet metadata encodes 10 * column_count individual ColumnChunk structures. To study the impact of including statistics, we tested three configurations: no statistics, chunk-level statistics, and page-level statistics (the default in parquet-rs).

Results

Figure 3: Metadata decode time and size for parquet-rs vs the number of Float64 columns in the Parquet file. Note both x and y axes are log scale.

Figure 3 plots the relationship between metadata size and decode time as the number of columns increases from 10 to 100,000. As expected, the metadata size and decode time are linearly proportional to the number of columns in the Parquet file.

Figure 4: Metadata decode time and size for parquet-rs for different statistics levels. The metadata decode time chart (left) also illustrates the time breakdown between Thrift decoding and creating the Arrow Schema (see Figure 6 for a more detailed breakdown).

Figure 5: Average per-column decode time and metadata size. The x-axis shows the stats level; the y-axis shows the time and size per column.

In Figures 4 and 5, we examined the impact of statistics on metadata decode speed and size. Specifically, we configured5 the parquet-rs writer in one of three modes:

  1. none: No statistics (EnabledStatistics::None)
  2. chunk: The writer stores min value, max value, and null count statistics for each column chunk, for each row group (EnabledStatistics::Chunk)
  3. page: (the default setting of parquet-rs). In addition to the statistics written at the chunk level, the writer also writes structures from the Parquet Page Index, which can speed up query processing (EnabledStatistics::Page)

Figure 5 charts these settings’ average per-column decode time and metadata size impact. Note that we expect the impact of disabling statistics for string columns to be even more significant than our float-based measurements, as string statistics values are typically larger.

Our findings are as follows:

  1. With no statistics, metadata decodes 30% faster and is 30% smaller than the default level.
  2. Page-level statistics only add minor overhead on top of chunk-level stats6.
  3. Building the Arrow schema takes negligible time.
  4. Decoding Thrift takes twice as long as transforming Thrift structs to parquet-rs structs.
  5. With minimal metadata (stats level none), each additional column adds 5us to decode time and 700 bytes to storage requirements.
  6. Our measurements are consistent, and the error bars are small.
  7. parquet-rs 51.0.0 can decode Parquet metadata at 100MB/s (10ms to decode each megabyte of metadata).

Our findings suggest that software optimization efforts focused on improving the efficiency of Thrift decoding and Thrift to parquet-rs struct transformation will directly translate to improving overall metadata decode speed.

Figure 6: Detailed analysis of metadata decode time breakdown.

Finally, we analyzed decoding using a profiler and plotted the results in Figure 6.

  • 61% of the time is spent decoding and building FileMetaData, which includes the Parquet schema.
  • 31% of the time is spent building RowGroupMetaData, which transforms decoded Thrift data structures into parquet-rs data structures.
  • 7% of the time is spent building an Arrow schema.

Figure 7: Simple software engineering optimizations (e.g., better allocator, optimized in-memory layout, and SIMD acceleration) improved the decoding throughput by up to 75%.

Finally, we spent a few days prototyping simple engineering optimizations (e.g., better allocator, optimized in-memory layout, and SIMD acceleration) to improve the decoding performance. Figure 7 shows that with even minor code changes (less than 100 loc, no change in API), we could improve decode performance by up to 75%. Other community members have also discussed and prototyped several more involved changes, such as reducing allocations (~2x improvement) and a more optimized thrift decoder (another ~2x improvement)

Conclusion

For workloads where metadata size and decode speed are of utmost concern, configuring the Parquet writer not to write statistics7 improves speed and space by 30% with no other software changes.

While the Rust Parquet implementation is already reasonably fast for metadata decoding, the potential for significant speed improvements is within reach. By applying straightforward software engineering techniques, decoding speed can be enhanced by around a factor of 4. This investment in existing decoders is likely to yield a larger payoff than the creation of entirely new formats.

Finally, in a more extensive overall system, where it is common to read data from object storage such as S3, we believe that metadata fetch and parsing is unlikely to be a significant bottleneck. Given that first-byte access latencies of 100ms-200ms are expected in object stores, by appropriately interleaving fetch and decode, metadata parsing is likely to be a small part of the overall execution time.

Future work

There are several areas that we did not explore that deserve additional attention:

  • A similar performance comparison for other open source Parquet implementations (e.g. parquet-java and parquet-cpp)
  • A similar study of Parquet metadata size and decode speed for String / Binary columns. We expect the benefits from disabling statistics and optimized decoder implementations to be substantially higher for such columns because the values stored in the statistics are significantly larger.
  • A similar study with ‌newly proposed formats like Lance V2 Nimble would help us understand how much better they are at handling large numbers of columns and what other tradeoffs may exist. In particular, these new formats incorporate lightweight metadata/statistics (e.g., smaller, decoupled metadata) and/or allow partial decoding, i.e., decode only the projected column rather than the entire metadata, which should permit much faster decode times.

Acknowledgments

We would like to thank Raphael Taylor-Davies, Jörn Horstmann, and Paul Dix for their helpful comments on earlier versions of this post.


  1. For more, see the discussion on the [email protected] mailing list.

  2. See charts from Exploiting Cloud Object Storage for High-Performance Analytics

  3. Caveat: we are biased being contributors and maintainers of parquet-rs

  4. Note that with the default writer settings, our testbed ran out of memory when writing the 100,000 column Parquet file. We found that the issue was resolved by setting the data_page_row_count to 10,000. With the default (unlimited) data page row count, we found the Parquet writer consumed over 80GB of memory. We have started a discussion about changing this default as another common criticism of using Parquet with wide tables is that writers require a large memory buffer, but we think this may be due to the default writer settings.

  5. By setting WriterProperties::statistics_enabled

  6. Note that parquet-rs reader does not create Rust structs from the PageIndex structures by default, so the decode overhead would likely be higher if we were decoding this as well..

  7. Though of course this may impact query performance for workloads that would benefit from statistics (e.g. they have predicates on the affected columns).