GH-49392: [C++][Compute] Fix fixed-width gather byte offset overflow in list filtering#49602
Open
zanmato1984 wants to merge 1 commit intoapache:mainfrom
Open
GH-49392: [C++][Compute] Fix fixed-width gather byte offset overflow in list filtering#49602zanmato1984 wants to merge 1 commit intoapache:mainfrom
zanmato1984 wants to merge 1 commit intoapache:mainfrom
Conversation
|
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Rationale for this change
Issue #49392 reports a user-visible corruption when filtering a table that contains a
list<double>column: slicing the last row returns the expected values, while filtering the same row returns values from a different child span. The corruption only appears once the selected child-value range is large enough, which points to an overflow in the fixed-width gather path used when list filtering materializes the selected child values.What changes are included in this PR?
This patch moves fixed-width gather byte-offset scaling onto an explicit
int64_thelper before thememcpyandmemsetaddress calculations. That fixes the underlying 32-bit byte-offset overflow when auint32gather index is multiplied by the fixed value width. In the source issue's last-row case, the selected child span starts at999998000; fordoublevalues, scaling that index by 8 bytes wrapped in 32-bit arithmetic and redirected the gather to the wrong child span. Keeping the byte-offset arithmetic in 64 bits makes the fixed-width gather path, the childTake()call used under list filtering, and the final filteredTableall address the correct bytes.To validate that this is the same bug reported in the issue, I also used a local near-e2e C++ reproduction that keeps the issue's logical shape (
N=500000,ARRAY_LEN=2000, anidcolumn, and anumbers: list<double>column), filters the last row, and seeds both the true target child span and the pre-fix wrapped span with distinct sentinels. In that setup,Slice()returns the expected row, a replay of the pre-fix gather arithmetic returns the wrapped sentinel span instead, and the fixed childTake()and tableFilter()results both match the sliced row. That ties the user-visible issue and this root-cause fix back to the same overflow boundary.Are these changes tested?
Yes. The patch adds a targeted unit test that checks fixed-width gather byte offsets are computed with 64-bit arithmetic. This is intentionally smaller than an end-to-end filter regression: the original user-visible failure only shows up at very large logical offsets, while the new unit test isolates the exact overflow boundary directly without constructing a huge table or depending on a PyArrow-level reproduction. That makes it smaller, more stable, and more appropriate for regular C++ CI, while the near-e2e local reproduction was used separately to confirm that this root-cause regression test and the reported filtering corruption are exercising the same bug.
Are there any user-facing changes?
Yes. Filtering tables with large list columns backed by fixed-width child values no longer risks returning data from a wrapped byte offset.