Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#620 Add support for shards - SolrSpout #1343

Merged
merged 16 commits into from
Nov 10, 2024
Merged

Conversation

mvolikas
Copy link
Contributor

@mvolikas mvolikas commented Oct 5, 2024

This PR (work in progress) employs the following strategy for supporting shards in status:

  1. Add a script to start Solr in cloud mode and create the collections using configsets. This script configures the status collection to use the number of shards defined in solr-conf.yaml (solr.status.routing.shards) and sets the sharding field to be the same as solr.status.routing.fieldname.
  2. Each SolrSpout instance fetches documents from the corresponding shard.
  3. Unlike OpenSearch, the StatusUpdaterBolt does not need further changes to route to the correct shard. This is done implicitly by Solr because of (1).

Pending tasks:

  • Tests should be updated to use Solr cloud and verify the shards are populated correctly.
  • Add test (spout) for the case of multiple (2) Solr shards.
  • Test with Storm 2.7.0

Copy link
Contributor Author

@mvolikas mvolikas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unlike OpenSearch, in SolrSpout we do not call markQueryReceivedNow(). Should we add this?

@mvolikas mvolikas self-assigned this Oct 5, 2024
external/solr/setup-solr.sh Outdated Show resolved Hide resolved
@jnioche
Copy link
Contributor

jnioche commented Oct 6, 2024

Unlike OpenSearch, in SolrSpout we do not call markQueryReceivedNow(). Should we add this?

yes and also set isInQuery.set(true);

external/solr/setup-solr.sh Outdated Show resolved Hide resolved
componentToTasks,
new HashMap<>(),
null,
null,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When writing the test with the 2 spouts, I manually created this Storm TopologyContext object with most of the parameters set to null. Is this ok? Should we set anything else?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no idea to be honest but it feels more complicated than what we've had to do for the other tests.
What about reusing TestUtil.getMockedTopologyContext()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This didn't work, since what I wanted to test involved the SolrSpout calling context.getComponentTasks() which in turn reads the componentToTasks for example. I started from the FileSpoutTopologyContextMock and we could in principle have something similar if we want to have more such SolrSpout tests in the future.

@rzo1
Copy link
Contributor

rzo1 commented Oct 24, 2024

@mvolikas Do you aim to include this PR in 3.1.1 - if so, do you think you can work on the open comments or would it be ok to move it to 3.1.2?

@mvolikas
Copy link
Contributor Author

@mvolikas Do you aim to include this PR in 3.1.1 - if so, do you think you can work on the open comments or would it be ok to move it to 3.1.2?

From my side, a safe estimate for having this ready and tested would be the first week of November. If the 3.1.1 release is planned earlier please move this to 3.1.2. Thanks!

@jnioche
Copy link
Contributor

jnioche commented Oct 25, 2024

@mvolikas Do you aim to include this PR in 3.1.1 - if so, do you think you can work on the open comments or would it be ok to move it to 3.1.2?

From my side, a safe estimate for having this ready and tested would be the first week of November. If the 3.1.1 release is planned earlier please move this to 3.1.2. Thanks!

I am sure we can wait. This will be a great addition to the next release

Copy link
Contributor

@jnioche jnioche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently doesn't pass the tests because of missing license headers

Running the archetype generation with

mvn archetype:generate -DarchetypeGroupId=org.apache.stormcrawler -DarchetypeArtifactId=stormcrawler-solr-archetype -DarchetypeVersion=3.1.1-SNAPSHOT

fails

Caused by: org.apache.maven.plugin.MojoFailureException: java.io.IOException: No such file or directory
    at org.apache.maven.archetype.mojos.CreateProjectFromArchetypeMojo.execute (CreateProjectFromArchetypeMojo.java:216)
    at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo (DefaultBuildPluginManager.java:126)

external/solr/README.md Outdated Show resolved Hide resolved
@mvolikas
Copy link
Contributor Author

mvolikas commented Nov 2, 2024

Running the archetype generation with

mvn archetype:generate -DarchetypeGroupId=org.apache.stormcrawler -DarchetypeArtifactId=stormcrawler-solr-archetype -DarchetypeVersion=3.1.1-SNAPSHOT

fails

@jnioche This is strange; I cannot reproduce it locally. In my case, it runs as expected by first running:
mvn clean install for incubator-stormcrawler project
Then running the mvn archetype command you wrote.

mvn --version
Apache Maven 3.8.7
Maven home: /usr/share/maven
Java version: 17.0.12, vendor: Ubuntu, runtime: /usr/lib/jvm/java-17-openjdk-amd64
Default locale: en_US, platform encoding: UTF-8
OS name: "linux", version: "6.8.0-47-generic", arch: "amd64", family: "unix"

Could it be a directory permissions issue?
Can you provide the full exception trace?

@jnioche
Copy link
Contributor

jnioche commented Nov 2, 2024

archetype generated successfully, no idea why it had failed

@jnioche
Copy link
Contributor

jnioche commented Nov 2, 2024

@mvolikas compiling the project generated from the archetype fails with

[ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.11.0:compile (default-compile) on project crawlzob: Compilation failure
[ERROR] /tmp/crawl/src/main/java/com/dipe/CrawlTopology.java:[34,8] class SolrCrawlTopology is public, should be declared in a file named SolrCrawlTopology.java

@jnioche
Copy link
Contributor

jnioche commented Nov 2, 2024

@mvolikas The Java based topologies could actually go. We don't have them in the OpenSearch module and I think the huge majority of people just rely on the Flux files.
@rzo1 any thoughts?

The README generated by the archetype mentions an injection.flux which is currently missing

@mvolikas
Copy link
Contributor Author

mvolikas commented Nov 3, 2024

@mvolikas compiling the project generated from the archetype fails with

[ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.11.0:compile (default-compile) on project crawlzob: Compilation failure
[ERROR] /tmp/crawl/src/main/java/com/dipe/CrawlTopology.java:[34,8] class SolrCrawlTopology is public, should be declared in a file named SolrCrawlTopology.java

Yes I have not yet tested with the java topologies. One thing to note is that after generating the default StormCrawler archetype with

mvn archetype:generate -DarchetypeGroupId=org.apache.stormcrawler -DarchetypeArtifactId=stormcrawler-archetype -DarchetypeVersion=3.1.0

and then running mvn clean package I also get an error:

[ERROR] COMPILATION ERROR : 
[INFO] -------------------------------------------------------------
[ERROR] /home/markos/apache/test/src/main/java/test/CrawlTopology.java:[35,36] cannot find symbol
  symbol: class ConfigurableTopology
[INFO] 1 error
[INFO] -------------------------------------------------------------
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  15.617 s
[INFO] Finished at: 2024-11-03T15:46:48+02:00
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.11.0:compile (default-compile) on project test: Compilation failure
[ERROR] /home/markos/apache/test/src/main/java/test/CrawlTopology.java:[35,36] cannot find symbol
[ERROR]   symbol: class ConfigurableTopology
[ERROR] 
[ERROR] -> [Help 1]

Can you confirm that?

@rzo1
Copy link
Contributor

rzo1 commented Nov 3, 2024

[ERROR] COMPILATION ERROR : 
[INFO] -------------------------------------------------------------
[ERROR] /home/markos/apache/test/src/main/java/test/CrawlTopology.java:[35,36] cannot find symbol
  symbol: class ConfigurableTopology
[INFO] 1 error
[INFO] -------------------------------------------------------------
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  15.617 s
[INFO] Finished at: 2024-11-03T15:46:48+02:00
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.11.0:compile (default-compile) on project test: Compilation failure
[ERROR] /home/markos/apache/test/src/main/java/test/CrawlTopology.java:[35,36] cannot find symbol
[ERROR]   symbol: class ConfigurableTopology
[ERROR] 
[ERROR] -> [Help 1]

Can you confirm that?

The actually issue is, that the template misses an import for import org.apache.stormcrawler.ConfigurableTopology; -> #1389

@jnioche
Copy link
Contributor

jnioche commented Nov 3, 2024

The actually issue is, that the template misses an import for import org.apache.stormcrawler.ConfigurableTopology; -> #1389

The fact that it has been broken for ever and no one reported it kind of suggests that hardly anyone uses the core archetype.
I would be in favour of getting rid of duplicate functionality and remove the Java topologies and haev only the Flux files.
We need one for the injection @mvolikas

@jnioche
Copy link
Contributor

jnioche commented Nov 3, 2024

Yes I have not yet tested with the java topologies

The compilation fails whether you use the Java topologies or not...

@rzo1
Copy link
Contributor

rzo1 commented Nov 3, 2024

Bascially, the Java topologies are only good for testing in local mode (IMHO) and are actually only usable, if the IDE is configured to include the provided scoped dependencies in the run configuration (if started from within an IDE). For this reason, I am fine with dropping ;-) (but from ASF perspective, this needs to be discussed on the dev@ list)

@mvolikas
Copy link
Contributor Author

mvolikas commented Nov 3, 2024

We need one for the injection @mvolikas

Ok, so I guess I will add this back.

@jnioche
Copy link
Contributor

jnioche commented Nov 3, 2024

We need one for the injection @mvolikas

Ok, so I guess I will add this back.

Sorry if I wasn't clear - we need a Flux for the injection, not the Java topology

@mvolikas
Copy link
Contributor Author

mvolikas commented Nov 3, 2024

An update from my side:

  • I have now tested in local mode with 1 and 4 shards.

  • I have updated the SolrSpout code so that the query param for shards is not added for just one shard.

  • I tested that the compilation with the Java topologies included succeeds.

  • Improved the scripts to handle the case of commented-out properties.

  • After the latest commit, the following workflow worked locally without any issues for me:

    • mvn archetype:generate -DarchetypeGroupId=org.apache.stormcrawler -DarchetypeArtifactId=stormcrawler-solr-archetype -DarchetypeVersion=3.1.1-SNAPSHOT
    • cd test && mvn clean compile
    • /opt/apache-storm-2.6.4/bin/storm local target/test-1.0-SNAPSHOT.jar org.apache.storm.flux.Flux crawler.flux --local-ttl 3600 - the provided flux topology starts from the StormCrawler webpage URL and indexes documents in Solr.

This makes running the injection topology first unnecessary if someone is getting started and wants to have a basic topology up and running out of the box.

Still to do/decide:

  • Should we keep any of the Java topologies? (probably just the SeedInjector.java one) no
  • Testing with Storm 2.7.0. (I have tested with Storm 2.6.4 and Solr 9.7.0) done
  • Review the changes and READMEs one more time after the previous 2 are done. done

@mvolikas
Copy link
Contributor Author

mvolikas commented Nov 3, 2024

We need one for the injection @mvolikas

Ok, so I guess I will add this back.

Sorry if I wasn't clear - we need a Flux for the injection, not the Java topology

No worries; I will add this too.

@jnioche
Copy link
Contributor

jnioche commented Nov 5, 2024

@mvolikas, latest comments

storm jar target/crawler-1.0-SNAPSHOT.jar org.apache.storm.flux.Flux injection.flux --local-ttl 3600

needs changing to
storm local target/crawler-1.0-SNAPSHOT.jar org.apache.storm.flux.Flux injection.flux --local-ttl 3600

  • delete the java topology classes -> there is a discussion on whether they should be removed in the core topology but here I would advocate that we shouldn't add them in the first place. The opensearch module doesn't have them.
  • remove the MemorySpout from the crawl.flux, rely on the separate injection and make sure the README reflects this, like we do in the OpenSearch module

Thanks!

@mvolikas
Copy link
Contributor Author

mvolikas commented Nov 9, 2024

Hi there!

@jnioche I think I have made the changes; also pushed some comments and minor fixes to the readme files.

I ran some more tests with a greater number of shards (e.g. 10), and everything seems ok.
From my side we could merge the changes. What do you think?

Copy link
Contributor

@jnioche jnioche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checked with a crawl in local mode, works fine
Thanks a lot @mvolikas, this is a great contribution

@jnioche jnioche merged commit f53e89a into apache:main Nov 10, 2024
3 checks passed
@jnioche jnioche added this to the 3.1.1 milestone Nov 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SOLR Status Updater - configure byDomain or byIP Add support for shards in SOLR
3 participants