Skip to content

GH-49392: [C++][Compute] Fix fixed-width gather byte offset overflow in list filtering#49602

Open
zanmato1984 wants to merge 1 commit intoapache:mainfrom
zanmato1984:fix/gh-49392-list-filter-overflow
Open

GH-49392: [C++][Compute] Fix fixed-width gather byte offset overflow in list filtering#49602
zanmato1984 wants to merge 1 commit intoapache:mainfrom
zanmato1984:fix/gh-49392-list-filter-overflow

Conversation

@zanmato1984
Copy link
Copy Markdown
Contributor

Rationale for this change

Issue #49392 reports a user-visible corruption when filtering a table that contains a list<double> column: slicing the last row returns the expected values, while filtering the same row returns values from a different child span. The corruption only appears once the selected child-value range is large enough, which points to an overflow in the fixed-width gather path used when list filtering materializes the selected child values.

What changes are included in this PR?

This patch moves fixed-width gather byte-offset scaling onto an explicit int64_t helper before the memcpy and memset address calculations. That fixes the underlying 32-bit byte-offset overflow when a uint32 gather index is multiplied by the fixed value width. In the source issue's last-row case, the selected child span starts at 999998000; for double values, scaling that index by 8 bytes wrapped in 32-bit arithmetic and redirected the gather to the wrong child span. Keeping the byte-offset arithmetic in 64 bits makes the fixed-width gather path, the child Take() call used under list filtering, and the final filtered Table all address the correct bytes.

To validate that this is the same bug reported in the issue, I also used a local near-e2e C++ reproduction that keeps the issue's logical shape (N=500000, ARRAY_LEN=2000, an id column, and a numbers: list<double> column), filters the last row, and seeds both the true target child span and the pre-fix wrapped span with distinct sentinels. In that setup, Slice() returns the expected row, a replay of the pre-fix gather arithmetic returns the wrapped sentinel span instead, and the fixed child Take() and table Filter() results both match the sliced row. That ties the user-visible issue and this root-cause fix back to the same overflow boundary.

Are these changes tested?

Yes. The patch adds a targeted unit test that checks fixed-width gather byte offsets are computed with 64-bit arithmetic. This is intentionally smaller than an end-to-end filter regression: the original user-visible failure only shows up at very large logical offsets, while the new unit test isolates the exact overflow boundary directly without constructing a huge table or depending on a PyArrow-level reproduction. That makes it smaller, more stable, and more appropriate for regular C++ CI, while the near-e2e local reproduction was used separately to confirm that this root-cause regression test and the reported filtering corruption are exercising the same bug.

Are there any user-facing changes?

Yes. Filtering tables with large list columns backed by fixed-width child values no longer risks returning data from a wrapped byte offset.

@github-actions
Copy link
Copy Markdown

⚠️ GitHub issue #49392 has been automatically assigned in GitHub to PR creator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant