Replies: 3 comments
-
Thanks @jinqj for raising. Please check if anything discussed in #10028 helps here. If data and internal layout of your files are identical then the following should show no differences: $ h5dump file1.nc > dump1.txt
$ h5dump file2.nc > dump2.txt
$ diff dump1.txt dump2.txt For NetCDF4 files the underlying HDF5 stores creation/modification time to any objects which |
Beta Was this translation helpful? Give feedback.
-
Thank you very much @kmuehlbauer! I followed your suggestion and here is what I get:
My WRF file is in NetCDF2 format. And my netcdf related libraries are:
The creation time stored in the .nc files (which are NetCDF4 files), as you suggested, may be the culprit. Do you know how I can check the creation time of an nc file? |
Beta Was this translation helpful? Give feedback.
-
@jinqj I have to step back a bit here. Modification time etc. of a certain HDF5 object in the file are optional and a quick peek into a xarray generated file (engine="netcdf4") showed that there are none such times. But what I found is that it sometimes happens that data offsets denoted in the HDF5 OHDR headers are pointing to the "wrong" data position. As in this case the data is actually the same (we did a copy), it doesn't matter to which data the offset is pointing to. I'm not sure why this happens, but for the HDF5 OHDR this also leads to different CRC32 checksums at the end of each HDF5 OHDR. I'm adding a simple example to demonstrate the issue: def create_xarray(num):
temperature_data = np.array([10*num])
time = np.array([num])
ds = xr.Dataset(
{
"temperature": ("time", temperature_data),
},
coords={
"time": time,
},
)
ds.to_netcdf(f"test{num}.nc", format="NETCDF4")
def create_test(num, swap=False, engine="netcdf4"):
flist = ["test1.nc", "test2.nc"]
with xr.open_mfdataset(flist) as ds2:
out = xr.Dataset()
data = ds2['temperature'].copy()
numb = ds2['temperature'].copy().rename('temperature_num')
if swap:
numb[:] = np.array([20, 10])
out = xr.merge([data, numb])
out.to_netcdf(f"test{num}.nc", engine=engine)
engine = "netcdf4"
create_xarray(1)
create_xarray(2)
for i in range(3,7):
create_test(i, engine=engine)
for i in range(7,11):
create_test(i, swap=True, engine=engine) Create sha256 and hexdumps for comparison:
We can observe that our two input files are different. We've got two versions of the first iteration (with the dataset copy) and only one version of the second iteration (with the changed dataset). Let's compare the hexdumps (I've tried to make the differences bold). The a602/b602 are the data offsets and the other 4-byte differences are the CRC32 checksums.
!diff test3.hex test4.hex
175c175
< 00000ae0: 0000 0000 0301 b602 0000 0000 0000 1000 ................
---
> 00000ae0: 0000 0000 0301 a602 0000 0000 0000 1000 ................
186,187c186,187
< 00000b90: 0000 0000 0000 0000 0000 0000 0000 753a ..............u:
< 00000ba0: 8f16 4f48 4452 020d 0001 0114 0000 0000 ..OHDR..........
---
> 00000b90: 0000 0000 0000 0000 0000 0000 0000 bdb9 ................
> 00000ba0: 924c 4f48 4452 020d 0001 0114 0000 0000 .LOHDR..........
192c192
< 00000bf0: 0301 a602 0000 0000 0000 1000 0000 0000 ................
---
> 00000bf0: 0301 b602 0000 0000 0000 1000 0000 0000 ................
203c203
< 00000ca0: 0000 0000 0000 0000 0000 f295 1dcb 4f43 ..............OC
---
> 00000ca0: 0000 0000 0000 0000 0000 1e44 0999 4f43 ...........D..OC
This compares now one of the first cycle with the swapped data version. We can see that we got now the data differences added to the above differences (0a00/1400).
!diff test3.hex test7.hex
44c44
< 000002b0: 0000 0000 0000 0a00 0000 0000 0000 1400 ................
---
> 000002b0: 0000 0000 0000 1400 0000 0000 0000 0a00 ................
175c175
< 00000ae0: 0000 0000 0301 b602 0000 0000 0000 1000 ................
---
> 00000ae0: 0000 0000 0301 a602 0000 0000 0000 1000 ................
186,187c186,187
< 00000b90: 0000 0000 0000 0000 0000 0000 0000 753a ..............u:
< 00000ba0: 8f16 4f48 4452 020d 0001 0114 0000 0000 ..OHDR..........
---
> 00000b90: 0000 0000 0000 0000 0000 0000 0000 bdb9 ................
> 00000ba0: 924c 4f48 4452 020d 0001 0114 0000 0000 .LOHDR..........
192c192
< 00000bf0: 0301 a602 0000 0000 0000 1000 0000 0000 ................
---
> 00000bf0: 0301 b602 0000 0000 0000 1000 0000 0000 ................
203c203
< 00000ca0: 0000 0000 0000 0000 0000 f295 1dcb 4f43 ..............OC
---
> 00000ca0: 0000 0000 0000 0000 0000 1e44 0999 4f43 ...........D..OC
If we compare the second version of the first cycle with the swapped data, everything is in place, only the data has changed:
!diff test4.hex test7.hex
44c44
< 000002b0: 0000 0000 0000 0a00 0000 0000 0000 1400 ................
---
> 000002b0: 0000 0000 0000 1400 0000 0000 0000 0a00 ................
If we use engine="h5netcdf" this strange behaviour (offsets pointing to the wrong location) does not happen. But we still might get differing files since the modification times (Access time, Modification Time, Change Time, Birth Time) are encoded into the HDF5 OHDR. (Also the internal file order is different for the two engines). As the times are encoded as seconds since the epoch it will only have affect if the files/objects are not written in the same second. You can use the low level debug tool
Show Output
Just to mention, in the message 3 of the two temperature objects you can find the data address. This data address is which is different in the raw files.
For engine="h5netcdf":
Show Output
Here we can observe the modification times in the object header:
OK, I hope this little digression into the depths of HDF5 was not too exhausting and we all have now a bit deeper understanding why and how the binary files are sometimes different (even when the data contents are exactly the same). |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
What is your issue?
Hello, I have the following python code to write out a variable from WRF output. I accidentally found that if I run this script twice without changing anything (I rename the output files after each run, like test1.nc and test2.nc), then I use "diff test1.nc test2.nc", which says the two files differ. Based on my understanding, diff command should output nothing, i.e., the two files are exactly the same. After a deeper dig by running this script multiple times, sometimes it produced two identical files but sometimes they are different. The chance is higher to have two different files if the time between the two executions of this script is longer. Moreover, when I subtract the variable in the two files, their differences are zeroes in the entire model domain. I tested the script on two difference severs, it showed the same result. I am not sure if this is the right place to report such a weird behavior of xarray. Any suggestions/hints that could help me debug my code is appreciated.
Thanks a lot.
--Qinjian Jin
Beta Was this translation helpful? Give feedback.
All reactions