Replica recovery fails after using Solr Encryption Plugin in multi-sharded Solr collection #114

ManishGitHub · 2025-01-06T07:54:14Z

I am using the Solr encryption plugin for data and index encryption. It is
working fine for single-tenant systems. On a distributed system with two or
more tenants, the follower replica fails to start replication when a
collection has two or more replicas in a shard, Replica recovery fails,
and it continuously retries and fails.

I have tested this behaviour in a multi-sharded Solr collection with two
replicas
per shard.

On the Solr log getting this error - org.apache.solr.update.processor.D
istributedUpdateProcessor Ignoring commit while not ACTIVE - state:
BUFFERING replay: false

The replica type is: NRT and using encryption factory EncryptionDirectoryFactory extends MMapDirectoryFactory

bruno-roustant · 2025-01-09T14:11:46Z

Thanks for this issue. I will try to write a test to reproduce and then fix.

danielsason112 · 2025-01-20T07:40:34Z

Hey,

I came across this issue as well.
After triggering the encryption in a distributed SolrCloud, for a collection with replication factor greater than 1, all follower replicas gradually fail to sync index files from leaders and are infinitely stuck on recovery. Newley created replicas also cannot recover and fail to sync files from the leader.

When leader replicas try to read any of the encrypted index files to the buffer a java.io.EOFException: Read beyond EOF is thrown from DecryptingIndexInput.readBytes and the following log entry appears on the leader replicas node:

[WARN]  org.apache.solr.handler.ReplicationHandler Exception while writing response for params: generation=8&qt=/replication&file=_9.cfs&checksum=true&wt=filestream&command=filecontent
java.io.EOFException: Read beyond EOF (position=0, arrayLength=46333, fileLength=46309) in Decrypting MemorySegmentIndexInput(path="/data/test_shard2_replica_n6/data/index/_9.cfs")
        at org.apache.solr.encryption.crypto.DecryptingIndexInput.readBytes(DecryptingIndexInput.java:223) ~[solr-encryption-plugin-1.0.0.jar:?]
        at org.apache.solr.handler.ReplicationHandler$DirectoryFileStream.write(ReplicationHandler.java:1635) ~[solr-core-9.6.0.jar:9.6.0 f8e5a93c11267e13b7b43005a428bfb910ac6e57 - gus - 2024-04-22 23:20:52]
        at org.apache.solr.core.SolrCore$3.write(SolrCore.java:3056) ~[solr-core-9.6.0.jar:9.6.0 f8e5a93c11267e13b7b43005a428bfb910ac6e57 - gus - 2024-04-22 23:20:52]
...

Follower replica logs show that the download has failed.

My investigation led to believe that the root cause is the ReplicationHandler trying to read files from the EncryptionDirectory using the "full" length of the file (including the encryption header, footer etc.), while the DecryptingIndexInput actually expect to read up to the "logical" length of the file, resulting in read beyond EOF exception.

While digging into it, I noticed ReplicationHandler uses the EncryptionDirectory super class fileLength method to get the actual file size.
To try and read the "logical" length of the file during replication file sync, I have overridden the above in method in EncryptionDirectory with the following:

@Override
  public long fileLength(String name) throws IOException {
    IndexInput indexInput = null;
    try {
      indexInput = this.openInput(name, IOContext.READONCE);
        return indexInput.length();
    } finally {
      if (indexInput != null) {
        indexInput.close();
      }
    }
  }

The above patch seems to solve the EOF error, replicas that were stuck on recovery gone active, and creation of new replicas is also successful.

I was just following a hunch, and I do not deeply understand the issue and the appropriate fix for it.
@bruno-roustant I will be glad to get your input on this one.

BTW working on a test to reproduce the issue.

Many thanks!

bruno-roustant · 2025-01-22T11:00:05Z

Nice investigation @danielsason112.,this seems to be a good lead!

It reminds me the file length complexity I had to solve when initially developing the EncryptionDirectory in Lucene, to be compatible with the compound file format and other Lucene file length checks.
I'll look deeper at this file length issue to confirm.

Thanks

bruno-roustant self-assigned this Jan 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replica recovery fails after using Solr Encryption Plugin in multi-sharded Solr collection #114

Replica recovery fails after using Solr Encryption Plugin in multi-sharded Solr collection #114

ManishGitHub commented Jan 6, 2025 •

edited

Loading

bruno-roustant commented Jan 9, 2025

danielsason112 commented Jan 20, 2025 •

edited

Loading

bruno-roustant commented Jan 22, 2025 •

edited

Loading

Replica recovery fails after using Solr Encryption Plugin in multi-sharded Solr collection #114

Replica recovery fails after using Solr Encryption Plugin in multi-sharded Solr collection #114

Comments

ManishGitHub commented Jan 6, 2025 • edited Loading

bruno-roustant commented Jan 9, 2025

danielsason112 commented Jan 20, 2025 • edited Loading

bruno-roustant commented Jan 22, 2025 • edited Loading

ManishGitHub commented Jan 6, 2025 •

edited

Loading

danielsason112 commented Jan 20, 2025 •

edited

Loading

bruno-roustant commented Jan 22, 2025 •

edited

Loading