Skip to content

ZipArchive creates corrupted ZIP when writing large dataset with many repeated files #122489

@arekpalinski

Description

@arekpalinski

Description

RavenDB snapshot backups produced with ZipArchive can be unrecoverable due to ZIP header corruption. The issue is that producing a snapshot backup which is ZIP archive with System.IO.Compression.ZipArchive over a specific data set result in ZIP fails to open correctly:

  • 7‑Zip shows Extra_ERROR Zip64_ERROR: UTF8 (for entry Documents\Raven.voron), and the Packed Size looks capped at 4GB.
Image
  • System.IO.Compression.ZipFile.OpenRead(...).Entries[i].Open() throws System.IO.InvalidDataException: A local file header is corrupt.

Writing the exact same dataset and order using SharpZipLib’s ZipOutputStream produces a valid ZIP that both 7‑Zip and ZipFile.OpenRead can read.

This started affecting us after introducing a feature that creates many per-index journal files that are hard links to the same underlying file content (so multiple distinct file paths share the exact same bytes on disk). Our dataset also includes a large 30GB file (Raven.voron). The combination seems to trigger a bug.

Reproduction Steps

Repro dataset

> $RootPath = (Get-Item .).FullName; Get-ChildItem -Path . -Include *.journal -Recurse -File | Get-FileHash | Select-Object @{Name='Path'; Expression={ $_.Path.Replace($RootPath + "\", "") }}, Hash, Algorithm

Path                                                         Hash                                                             Algorithm
----                                                         ----                                                             ---------
Configuration\Journals\0000000000000000001.journal           96F77B06EBF13895A297B7182BC162B42A05CC9B444D488A87FA541CD9962516 SHA256
Indexes\@SharedJournals\Journals\0000000000000000107.journal 16BB9C3A617844EFA25254184A4AF7E0E36ED0B656C12A71952D28F3EE2C3156 SHA256
Indexes\Activity_ByMonth\Journals\0000000000000000008.jou... 16BB9C3A617844EFA25254184A4AF7E0E36ED0B656C12A71952D28F3EE2C3156 SHA256
Indexes\Questions_Search\Journals\0000000000000000004.jou... 16BB9C3A617844EFA25254184A4AF7E0E36ED0B656C12A71952D28F3EE2C3156 SHA256
Indexes\Questions_Tags\Journals\0000000000000000007.journal  16BB9C3A617844EFA25254184A4AF7E0E36ED0B656C12A71952D28F3EE2C3156 SHA256
Indexes\Questions_Tags_ByMonths\Journals\0000000000000000... 16BB9C3A617844EFA25254184A4AF7E0E36ED0B656C12A71952D28F3EE2C3156 SHA256
Indexes\Users_Registrations_ByMonth\Journals\000000000000... 16BB9C3A617844EFA25254184A4AF7E0E36ED0B656C12A71952D28F3EE2C3156 SHA256
Indexes\Users_Search\Journals\0000000000000000005.journal    16BB9C3A617844EFA25254184A4AF7E0E36ED0B656C12A71952D28F3EE2C3156 SHA256

Repro app

Single‑file console app (targets net8.0 or net10.0). It copies files from the dataset into a ZIP using ZipArchive, in the exact order RavenDB snapshot backup uses:

  • Indexes (excluding any @* folder such as @SharedJournals), then
  • Documents (root storage env), then
  • Configuration folder
// Add package: ICSharpCode.SharpZipLib
//
// Example csproj snippet:
// <ItemGroup>
//   <PackageReference Include="SharpZipLib" Version="1.4.2" />
// </ItemGroup>
//
// Usage:
//   ZipArchiveIssue <sourceDbFolder> <outputDir> [options]
//
// Options:
//   --ziparchive             Generate ZIP using System.IO.Compression.ZipArchive
//   --sharpzip               Generate ZIP using SharpZipLib ZipOutputStream
//   --level=<Optimal|Fastest|NoCompression>   Compression level (default: Optimal)
//   --nonseekable            Wrap output stream to simulate non-seekable sink (ZipArchive data-descriptor path)
//   --outname=<baseName>     Base file name (default: derived from folder name)
//   --verify                 After writing, attempt to open/read entries via ZipFile.OpenRead
//
// Mapping mirrors RavenDB snapshot shape, copying from disk:
// - Order: Indexes -> Documents -> Configuration (matches RavenDB snapshot backup)
// - Root DB env  -> Documents/
// - Configuration/ -> Configuration/
// - Indexes/<IndexName>/ -> Indexes/<IndexName>/
// - Include files: Raven.voron, headers.one, headers.two, database.metadata, Journals/*.journal
// - Skip: any Temp/ folders, and all Indexes/@* folders (e.g. @SharedJournals)

#nullable enable
using System;
using System.Buffers;
using System.Collections.Generic;
using System.IO;
using System.IO.Compression;
using System.Linq;
using System.Text;
using ICSharpCode.SharpZipLib.Zip;

internal static class Program
{
    private static int Main(string[] args)
    {
        try
        {
            if (args.Length < 2)
            {
                PrintHelp();
                return 2;
            }

            var sourceRoot = Path.GetFullPath(args[0]);
            var outDir = Path.GetFullPath(args[1]);
            var opts = ParseOptions(args.Skip(2));

            if (!Directory.Exists(sourceRoot))
            {
                Console.Error.WriteLine($"Source folder not found: {sourceRoot}");
                return 3;
            }
            Directory.CreateDirectory(outDir);

            var baseName = opts.OutName ?? new DirectoryInfo(sourceRoot).Name;

            // Enumerate entries strictly in RavenDB order: Indexes -> Documents -> Configuration
            var entries = EnumerateBackupEntriesInRavenOrder(sourceRoot).ToList();

            if (entries.Count == 0)
                Console.Error.WriteLine("No entries to add based on current mapping (check input path).");
            else
                Console.WriteLine($"Enumerated {entries.Count} entries to zip.");

            var createdAny = false;

            if (opts.UseZipArchive)
            {
                var path = Path.Combine(outDir, baseName + "-ziparchive.zip");
                Console.WriteLine($"[ZipArchive] Writing {path} ...");
                using(var fs = File.Create(path))
                {
                    using Stream
                        target = opts.NonSeekable
                            ? new NonSeekableWriteStream(fs)
                            : fs; // explicit type fixes compilation
                    WriteWithZipArchive(target, entries, opts.Level);
                    Console.WriteLine("[ZipArchive] Done");
                }
                if (opts.Verify) VerifyZip(path);
                createdAny = true;
            }

            if (opts.UseSharpZip)
            {
                var path = Path.Combine(outDir, baseName + "-sharpzip.zip");
                Console.WriteLine($"[SharpZipLib] Writing {path} ...");
                using (var fs = File.Create(path))
                {
                    using Stream
                        target = opts.NonSeekable
                            ? new NonSeekableWriteStream(fs)
                            : fs; // explicit type fixes compilation
                    WriteWithSharpZip(target, entries, opts.Level);
                    Console.WriteLine("[SharpZipLib] Done");
                }

                if (opts.Verify) VerifyZip(path);
                createdAny = true;
            }

            if (!createdAny)
            {
                Console.WriteLine("No writer selected; defaulting to both.");
                var zipArchivePath = Path.Combine(outDir, baseName + "-ziparchive.zip");
                var sharpZipPath = Path.Combine(outDir, baseName + "-sharpzip.zip");

                using (var fs = File.Create(zipArchivePath))
                {
                    using (Stream target = opts.NonSeekable ? new NonSeekableWriteStream(fs) : fs)
                    {
                        Console.WriteLine($"[ZipArchive] Writing {zipArchivePath} ...");
                        WriteWithZipArchive(target, entries, opts.Level);
                        Console.WriteLine("[ZipArchive] Done");
                    }
                }

                if (opts.Verify) VerifyZip(zipArchivePath);

                using (var fs = File.Create(sharpZipPath))
                {
                    using (Stream target = opts.NonSeekable ? new NonSeekableWriteStream(fs) : fs)
                    {
                        Console.WriteLine($"[SharpZipLib] Writing {sharpZipPath} ...");
                        WriteWithSharpZip(target, entries, opts.Level);
                        Console.WriteLine("[SharpZipLib] Done");
                    }
                }

                if (opts.Verify) VerifyZip(sharpZipPath);
            }

            Console.WriteLine("All done.");
            return 0;
        }
        catch (Exception ex)
        {
            Console.Error.WriteLine(ex);
            return 1;
        }
    }

    private static void PrintHelp()
    {
        Console.WriteLine(@"ZipRepro <sourceDbFolder> <outputDir> [options]
  --ziparchive               Use System.IO.Compression ZipArchive
  --sharpzip                 Use SharpZipLib ZipOutputStream
  --level=<Optimal|Fastest|NoCompression>
  --nonseekable              Wrap output stream so ZipArchive uses data descriptors
  --outname=<baseName>       Output base file name (without extension)
  --verify                   After writing, open the ZIP and iterate entries
");
    }

    private static void VerifyZip(string path)
    {
        try
        {
            using var zip = System.IO.Compression.ZipFile.OpenRead(path);
            Console.WriteLine($"[Verify] Opened {path}, entries: {zip.Entries.Count}");
            long total = 0;
            foreach (var e in zip.Entries)
            {
                using var s = e.Open();
                Span<byte> buf = stackalloc byte[8192];
                int read = s.Read(buf);
                total += read;
            }
            Console.WriteLine($"[Verify] Read a total of {total} bytes across entries");
        }
        catch (Exception ex)
        {
            Console.WriteLine($"[Verify] FAILED for {path}: {ex.GetType().Name}: {ex.Message}");
        }
    }

    private static void WriteWithZipArchive(Stream output, List<BackupEntry> entries, CompressionLevel level)
    {
        using var archive = new ZipArchive(output, ZipArchiveMode.Create, leaveOpen: true, entryNameEncoding: Encoding.UTF8);
        foreach (var e in entries)
        {
            Console.WriteLine($"[ZipArchive] + {e.ZipPath}");
            var entry = archive.CreateEntry(e.ZipPath, level);
            using var es = entry.Open();
            using var fs = File.Open(e.SourcePath, FileMode.Open, FileAccess.Read, FileShare.ReadWrite);
            fs.CopyTo(es);
        }
    }

    private static void WriteWithSharpZip(Stream output, List<BackupEntry> entries, CompressionLevel level)
    {
        using var zipStream = new ZipOutputStream(output) { IsStreamOwner = false };
        zipStream.SetLevel(MapSharpZipLevel(level));
        foreach (var e in entries)
        {
            Console.WriteLine($"[SharpZipLib] + {e.ZipPath}");
            var ze = new ZipEntry(e.ZipPath);
            zipStream.PutNextEntry(ze);
            using (var fs = File.Open(e.SourcePath, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
            {
                fs.CopyTo(zipStream);
            }
            zipStream.CloseEntry();
        }
        zipStream.Finish();
    }

    private static int MapSharpZipLevel(CompressionLevel level) => level switch
    {
        CompressionLevel.NoCompression => 0,
        CompressionLevel.Fastest => 1,
        _ => 9,
    };

    private sealed class Options
    {
        public bool UseZipArchive { get; set; }
        public bool UseSharpZip { get; set; }
        public CompressionLevel Level { get; set; } = CompressionLevel.Optimal;
        public bool NonSeekable { get; set; }
        public string? OutName { get; set; }
        public bool Verify { get; set; }
    }

    private static Options ParseOptions(IEnumerable<string> args)
    {
        var o = new Options();
        foreach (var a in args)
        {
            if (a.Equals("--ziparchive", StringComparison.OrdinalIgnoreCase)) o.UseZipArchive = true;
            else if (a.Equals("--sharpzip", StringComparison.OrdinalIgnoreCase)) o.UseSharpZip = true;
            else if (a.StartsWith("--level=", StringComparison.OrdinalIgnoreCase))
            {
                var v = a.Substring("--level=".Length);
                o.Level = v.Equals("NoCompression", StringComparison.OrdinalIgnoreCase) ? CompressionLevel.NoCompression :
                         v.Equals("Fastest", StringComparison.OrdinalIgnoreCase) ? CompressionLevel.Fastest :
                         CompressionLevel.Optimal;
            }
            else if (a.Equals("--nonseekable", StringComparison.OrdinalIgnoreCase)) o.NonSeekable = true;
            else if (a.StartsWith("--outname=", StringComparison.OrdinalIgnoreCase)) o.OutName = a.Substring("--outname=".Length);
            else if (a.Equals("--verify", StringComparison.OrdinalIgnoreCase)) o.Verify = true;
        }
        return o;
    }

    private static IEnumerable<BackupEntry> EnumerateBackupEntriesInRavenOrder(string sourceRoot)
    {
        // 1) Indexes (skip @*). Use alphabetical order for determinism.
        var indexesDir = Path.Combine(sourceRoot, "Indexes");
        if (Directory.Exists(indexesDir))
        {
            foreach (var indexDir in Directory.EnumerateDirectories(indexesDir).OrderBy(Path.GetFileName, StringComparer.OrdinalIgnoreCase))
            {
                var name = Path.GetFileName(indexDir);
                if (name.StartsWith("@")) // skip @SharedJournals and any @*
                    continue;
                foreach (var e in EnumerateEnv(indexDir, Path.Combine("Indexes", name)))
                    yield return e;
            }
        }

        // 2) Documents (root env)
        foreach (var e in EnumerateEnv(sourceRoot, Path.Combine("Documents")))
            yield return e;

        // 3) Configuration
        var cfgDir = Path.Combine(sourceRoot, "Configuration");
        if (Directory.Exists(cfgDir))
        {
            foreach (var e in EnumerateEnv(cfgDir, Path.Combine("Configuration")))
                yield return e;
        }
    }

    private static IEnumerable<BackupEntry> EnumerateEnv(string envDir, string zipBase)
    {
        // Include env root files (Temp is excluded by not traversing it here)
        foreach (var f in Directory.EnumerateFiles(envDir))
        {
            var name = Path.GetFileName(f);
            if (!ShouldIncludeFile(name))
                continue;
            yield return new BackupEntry(f, Path.Combine(zipBase, name).Replace('\\', '/'));
        }

        // Include journals
        var journalsDir = Path.Combine(envDir, "Journals");
        if (Directory.Exists(journalsDir))
        {
            foreach (var jf in Directory.EnumerateFiles(journalsDir, "*.journal"))
            {
                var name = Path.GetFileName(jf);
                yield return new BackupEntry(jf, Path.Combine(zipBase, name).Replace('\\', '/'));
            }
        }

        // Temp is always skipped per requirements
    }

    private static bool ShouldIncludeFile(string name)
    {
        if (name.Equals("Raven.voron", StringComparison.OrdinalIgnoreCase)) return true;
        if (name.Equals("headers.one", StringComparison.OrdinalIgnoreCase)) return true;
        if (name.Equals("headers.two", StringComparison.OrdinalIgnoreCase)) return true;
        if (name.Equals("database.metadata", StringComparison.OrdinalIgnoreCase)) return true;
        if (name.EndsWith(".journal", StringComparison.OrdinalIgnoreCase)) return true; // if any at env root
        return false;
    }

    private readonly record struct BackupEntry(string SourcePath, string ZipPath);

    private sealed class NonSeekableWriteStream : Stream
    {
        private readonly Stream _inner;
        public NonSeekableWriteStream(Stream inner) => _inner = inner;
        public override bool CanRead => false;
        public override bool CanSeek => false;
        public override bool CanWrite => true;
        public override long Length => throw new NotSupportedException();
        public override long Position { get => throw new NotSupportedException(); set => throw new NotSupportedException(); }
        public override void Flush() => _inner.Flush();
        public override int Read(byte[] buffer, int offset, int count) => throw new NotSupportedException();
        public override long Seek(long offset, SeekOrigin origin) => throw new NotSupportedException();
        public override void SetLength(long value) => throw new NotSupportedException();
        public override void Write(byte[] buffer, int offset, int count) => _inner.Write(buffer, offset, count);
#if NETSTANDARD2_1_OR_GREATER || NET5_0_OR_GREATER
        public override void Write(ReadOnlySpan<byte> buffer)
        {
            var arr = ArrayPool<byte>.Shared.Rent(buffer.Length);
            try
            {
                buffer.CopyTo(arr);
                _inner.Write(arr, 0, buffer.Length);
            }
            finally
            {
                ArrayPool<byte>.Shared.Return(arr);
            }
        }
#endif
        protected override void Dispose(bool disposing)
        {
            // do not own _inner
            base.Dispose(disposing);
        }
    }
}

Repro steps

Command to reproduce, after unzipping the dataset to D:\raven-so-database:

ZipArchiveIssue.exe "D:\raven-so-database" "D:\temp" --ziparchive --level=NoCompression --verify

This creates D:\temp\raven-so-database-ziparchive.zip, then attempts to open it with ZipFile.OpenRead and read a small portion from each entry. On affected versions, it fails with:

System.IO.InvalidDataException: A local file header is corrupt.

Opening the same ZIP in 7‑Zip shows next to Documents\Raven.voron: Extra_ERROR Zip64_ERROR: UTF8
Packed Size also appears capped to 4GB for that entry, even though the file is ~30GB.

For completeness, writing with SharpZipLib succeeds:

ZipArchiveIssue.exe "D:\raven-so-database" "D:\temp" --sharpzip --level=NoCompression --verify

The resulting ZIP opens fine in both 7‑Zip and ZipFile.OpenRead.

Expected behavior

ZipArchive produces a valid ZIP64 archive that all standard tools can open.

Actual behavior

  • ZipFile.OpenRead(...) throws InvalidDataException: A local file header is corrupt.
  • 7‑Zip shows Extra_ERROR Zip64_ERROR: UTF8 on Documents\Raven.voron and reports an incorrect Packed Size (appears limited to 4GB) for that entry.

Regression?

No response

Known Workarounds

Use SharpZipLib

Configuration

  • Reproduces on .NET 8 and .NET 10
  • Windows 11

Other information

No response

Metadata

Metadata

Type

No type
No fields configured for issues without a type.

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions