Conversation
src/zarr/codecs/sharding.py
Outdated
| end = "end" | ||
|
|
||
|
|
||
| class SubchunkWriteOrder(Enum): |
There was a problem hiding this comment.
advantage of an enum over Literal["morton", "unordered", "lexicographic", "colexicographic"]?
There was a problem hiding this comment.
Just copied what was done for ShardingCodecIndexLocation!
There was a problem hiding this comment.
I'm not a huge fan of enums in python (including ShardingCodecIndexingLocation), so unless you object I think it would be better to use a simple Literal + a final tuple of strings, like:
SubchunkWriteOrder = Literal["morton", "unordered", "lexicographic", "colexicographic"]
SUBCHUNK_WRITE_ORDER: Final[tuple[str, str, str, str]] = ("morton", "unordered", "lexicographic", "colexicographic")
There was a problem hiding this comment.
Done (hopefully)!
Co-authored-by: Davis Bennett <davis.v.bennett@gmail.com>
|
|
||
| if self._is_complete_shard_write(indexer, chunks_per_shard): | ||
| shard_dict = dict.fromkeys(morton_order_iter(chunks_per_shard)) | ||
| shard_dict = dict.fromkeys(np.ndindex(chunks_per_shard)) |
There was a problem hiding this comment.
cc @mkitti
Here and below, I don't think there is any need to construct the dict in morton order, right? There should be no correctness or performance hit here?
@d-v-b This now ensures we only shuffle in the unordered case once so the test is nice and clean - write once + get order, create a new codec with the same seed + create the iterator from that codec, match orders
There was a problem hiding this comment.
In Python, dicts are ordered and I think the optimal iteration order may need to be encoded in the dict the last time I examined the situation. I was just trying to preserve the situation before my edits.
There was a problem hiding this comment.
So this wasn't about dictionary order, but instead in the vectorized case, the order to ShardReader.to_dict_vectorized had to match that of what ShardReader was internally generating, as it turned out morton order. So I'm glad I caught this because I think it means the data was being corrupted for the other orders (which weren't getting hypothesis-tested).
So I'm going to add something to the hyptothesis tests for this.
I had the same feeling initially that the dictionary order mattered, but it turns out the final call to _encode_shard_dict actually handles the ordering for us to the output buffer while writing to the intermediate shard_dict can be done in any order, as long as the final buffer is done in the correct order
In order to encourage ecosystem compatibility + reserve runtime setting strings/enums (see zarrs/zarrs-python#160), subchunk write order is expanded from
mortonto includelexicographic,colexicographic, andunordered(which is randomized).TODO:
docs/user-guide/*.mdchanges/