Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replica recovery fails after using Solr Encryption Plugin in multi-sharded Solr collection #114

Open
ManishGitHub opened this issue Jan 6, 2025 · 3 comments
Assignees

Comments

@ManishGitHub
Copy link

ManishGitHub commented Jan 6, 2025

I am using the Solr encryption plugin for data and index encryption. It is
working fine for single-tenant systems. On a distributed system with two or
more tenants, the follower replica fails to start replication when a
collection has two or more replicas in a shard, Replica recovery fails,
and it continuously retries and fails.

I have tested this behaviour in a multi-sharded Solr collection with two
replicas
per shard.

On the Solr log getting this error - org.apache.solr.update.processor.D
istributedUpdateProcessor Ignoring commit while not ACTIVE - state:
BUFFERING replay: false

The replica type is: NRT and using encryption factory EncryptionDirectoryFactory extends MMapDirectoryFactory

@bruno-roustant
Copy link
Contributor

Thanks for this issue. I will try to write a test to reproduce and then fix.

@bruno-roustant bruno-roustant self-assigned this Jan 9, 2025
@danielsason112
Copy link

danielsason112 commented Jan 20, 2025

Hey,

I came across this issue as well.
After triggering the encryption in a distributed SolrCloud, for a collection with replication factor greater than 1, all follower replicas gradually fail to sync index files from leaders and are infinitely stuck on recovery. Newley created replicas also cannot recover and fail to sync files from the leader.

When leader replicas try to read any of the encrypted index files to the buffer a java.io.EOFException: Read beyond EOF is thrown from DecryptingIndexInput.readBytes and the following log entry appears on the leader replicas node:

[WARN]  org.apache.solr.handler.ReplicationHandler Exception while writing response for params: generation=8&qt=/replication&file=_9.cfs&checksum=true&wt=filestream&command=filecontent
java.io.EOFException: Read beyond EOF (position=0, arrayLength=46333, fileLength=46309) in Decrypting MemorySegmentIndexInput(path="/data/test_shard2_replica_n6/data/index/_9.cfs")
        at org.apache.solr.encryption.crypto.DecryptingIndexInput.readBytes(DecryptingIndexInput.java:223) ~[solr-encryption-plugin-1.0.0.jar:?]
        at org.apache.solr.handler.ReplicationHandler$DirectoryFileStream.write(ReplicationHandler.java:1635) ~[solr-core-9.6.0.jar:9.6.0 f8e5a93c11267e13b7b43005a428bfb910ac6e57 - gus - 2024-04-22 23:20:52]
        at org.apache.solr.core.SolrCore$3.write(SolrCore.java:3056) ~[solr-core-9.6.0.jar:9.6.0 f8e5a93c11267e13b7b43005a428bfb910ac6e57 - gus - 2024-04-22 23:20:52]
...

Follower replica logs show that the download has failed.

My investigation led to believe that the root cause is the ReplicationHandler trying to read files from the EncryptionDirectory using the "full" length of the file (including the encryption header, footer etc.), while the DecryptingIndexInput actually expect to read up to the "logical" length of the file, resulting in read beyond EOF exception.

While digging into it, I noticed ReplicationHandler uses the EncryptionDirectory super class fileLength method to get the actual file size.
To try and read the "logical" length of the file during replication file sync, I have overridden the above in method in EncryptionDirectory with the following:

@Override
  public long fileLength(String name) throws IOException {
    IndexInput indexInput = null;
    try {
      indexInput = this.openInput(name, IOContext.READONCE);
        return indexInput.length();
    } finally {
      if (indexInput != null) {
        indexInput.close();
      }
    }
  }

The above patch seems to solve the EOF error, replicas that were stuck on recovery gone active, and creation of new replicas is also successful.

I was just following a hunch, and I do not deeply understand the issue and the appropriate fix for it.
@bruno-roustant I will be glad to get your input on this one.

BTW working on a test to reproduce the issue.

Many thanks!

@bruno-roustant
Copy link
Contributor

bruno-roustant commented Jan 22, 2025

Nice investigation @danielsason112.,this seems to be a good lead!

It reminds me the file length complexity I had to solve when initially developing the EncryptionDirectory in Lucene, to be compatible with the compound file format and other Lucene file length checks.
I'll look deeper at this file length issue to confirm.

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants