#620 Add support for shards - SolrSpout #1343

mvolikas · 2024-10-05T14:34:25Z

This PR (work in progress) employs the following strategy for supporting shards in status:

Add a script to start Solr in cloud mode and create the collections using configsets. This script configures the status collection to use the number of shards defined in solr-conf.yaml (solr.status.routing.shards) and sets the sharding field to be the same as solr.status.routing.fieldname.
Each SolrSpout instance fetches documents from the corresponding shard.
Unlike OpenSearch, the StatusUpdaterBolt does not need further changes to route to the correct shard. This is done implicitly by Solr because of (1).

Pending tasks:

Tests should be updated to use Solr cloud and verify the shards are populated correctly.
Add test (spout) for the case of multiple (2) Solr shards.
Test with Storm 2.7.0

mvolikas

Unlike OpenSearch, in SolrSpout we do not call markQueryReceivedNow(). Should we add this?

external/solr/setup-solr.sh

jnioche · 2024-10-06T14:40:34Z

Unlike OpenSearch, in SolrSpout we do not call markQueryReceivedNow(). Should we add this?

yes and also set isInQuery.set(true);

external/solr/setup-solr.sh

external/solr/src/main/java/org/apache/stormcrawler/solr/persistence/SolrSpout.java

mvolikas · 2024-10-14T17:06:40Z

external/solr/src/test/java/org/apache/stormcrawler/solr/persistence/SpoutTest.java

+                componentToTasks,
+                new HashMap<>(),
+                null,
+                null,


When writing the test with the 2 spouts, I manually created this Storm TopologyContext object with most of the parameters set to null. Is this ok? Should we set anything else?

no idea to be honest but it feels more complicated than what we've had to do for the other tests.
What about reusing TestUtil.getMockedTopologyContext()?

This didn't work, since what I wanted to test involved the SolrSpout calling context.getComponentTasks() which in turn reads the componentToTasks for example. I started from the FileSpoutTopologyContextMock and we could in principle have something similar if we want to have more such SolrSpout tests in the future.

rzo1 · 2024-10-24T08:14:30Z

@mvolikas Do you aim to include this PR in 3.1.1 - if so, do you think you can work on the open comments or would it be ok to move it to 3.1.2?

mvolikas · 2024-10-24T14:40:53Z

@mvolikas Do you aim to include this PR in 3.1.1 - if so, do you think you can work on the open comments or would it be ok to move it to 3.1.2?

From my side, a safe estimate for having this ready and tested would be the first week of November. If the 3.1.1 release is planned earlier please move this to 3.1.2. Thanks!

jnioche · 2024-10-25T13:13:38Z

@mvolikas Do you aim to include this PR in 3.1.1 - if so, do you think you can work on the open comments or would it be ok to move it to 3.1.2?

From my side, a safe estimate for having this ready and tested would be the first week of November. If the 3.1.1 release is planned earlier please move this to 3.1.2. Thanks!

I am sure we can wait. This will be a great addition to the next release

jnioche

Currently doesn't pass the tests because of missing license headers

Running the archetype generation with

mvn archetype:generate -DarchetypeGroupId=org.apache.stormcrawler -DarchetypeArtifactId=stormcrawler-solr-archetype -DarchetypeVersion=3.1.1-SNAPSHOT

fails

Caused by: org.apache.maven.plugin.MojoFailureException: java.io.IOException: No such file or directory
    at org.apache.maven.archetype.mojos.CreateProjectFromArchetypeMojo.execute (CreateProjectFromArchetypeMojo.java:216)
    at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo (DefaultBuildPluginManager.java:126)

external/solr/archetype/src/main/resources/archetype-resources/pom.xml

external/solr/archetype/src/main/resources/archetype-resources/README.md

external/solr/README.md

mvolikas · 2024-11-02T12:27:16Z

Running the archetype generation with

mvn archetype:generate -DarchetypeGroupId=org.apache.stormcrawler -DarchetypeArtifactId=stormcrawler-solr-archetype -DarchetypeVersion=3.1.1-SNAPSHOT

fails

@jnioche This is strange; I cannot reproduce it locally. In my case, it runs as expected by first running:
mvn clean install for incubator-stormcrawler project
Then running the mvn archetype command you wrote.

mvn --version
Apache Maven 3.8.7
Maven home: /usr/share/maven
Java version: 17.0.12, vendor: Ubuntu, runtime: /usr/lib/jvm/java-17-openjdk-amd64
Default locale: en_US, platform encoding: UTF-8
OS name: "linux", version: "6.8.0-47-generic", arch: "amd64", family: "unix"

Could it be a directory permissions issue?
Can you provide the full exception trace?

…le shard

jnioche · 2024-11-02T18:00:20Z

archetype generated successfully, no idea why it had failed

jnioche · 2024-11-02T18:02:57Z

@mvolikas compiling the project generated from the archetype fails with

[ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.11.0:compile (default-compile) on project crawlzob: Compilation failure
[ERROR] /tmp/crawl/src/main/java/com/dipe/CrawlTopology.java:[34,8] class SolrCrawlTopology is public, should be declared in a file named SolrCrawlTopology.java

jnioche · 2024-11-02T21:21:03Z

@mvolikas The Java based topologies could actually go. We don't have them in the OpenSearch module and I think the huge majority of people just rely on the Flux files.
@rzo1 any thoughts?

The README generated by the archetype mentions an injection.flux which is currently missing

mvolikas · 2024-11-03T13:50:50Z

@mvolikas compiling the project generated from the archetype fails with

[ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.11.0:compile (default-compile) on project crawlzob: Compilation failure
[ERROR] /tmp/crawl/src/main/java/com/dipe/CrawlTopology.java:[34,8] class SolrCrawlTopology is public, should be declared in a file named SolrCrawlTopology.java

Yes I have not yet tested with the java topologies. One thing to note is that after generating the default StormCrawler archetype with

mvn archetype:generate -DarchetypeGroupId=org.apache.stormcrawler -DarchetypeArtifactId=stormcrawler-archetype -DarchetypeVersion=3.1.0

and then running mvn clean package I also get an error:

[ERROR] COMPILATION ERROR : 
[INFO] -------------------------------------------------------------
[ERROR] /home/markos/apache/test/src/main/java/test/CrawlTopology.java:[35,36] cannot find symbol
  symbol: class ConfigurableTopology
[INFO] 1 error
[INFO] -------------------------------------------------------------
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  15.617 s
[INFO] Finished at: 2024-11-03T15:46:48+02:00
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.11.0:compile (default-compile) on project test: Compilation failure
[ERROR] /home/markos/apache/test/src/main/java/test/CrawlTopology.java:[35,36] cannot find symbol
[ERROR]   symbol: class ConfigurableTopology
[ERROR] 
[ERROR] -> [Help 1]

Can you confirm that?

external/solr/archetype/src/main/resources/archetype-resources/src/main/java/CrawlTopology.java

rzo1 · 2024-11-03T14:19:38Z

[ERROR] COMPILATION ERROR : 
[INFO] -------------------------------------------------------------
[ERROR] /home/markos/apache/test/src/main/java/test/CrawlTopology.java:[35,36] cannot find symbol
  symbol: class ConfigurableTopology
[INFO] 1 error
[INFO] -------------------------------------------------------------
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  15.617 s
[INFO] Finished at: 2024-11-03T15:46:48+02:00
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.11.0:compile (default-compile) on project test: Compilation failure
[ERROR] /home/markos/apache/test/src/main/java/test/CrawlTopology.java:[35,36] cannot find symbol
[ERROR]   symbol: class ConfigurableTopology
[ERROR] 
[ERROR] -> [Help 1]

Can you confirm that?

The actually issue is, that the template misses an import for import org.apache.stormcrawler.ConfigurableTopology; -> #1389

jnioche · 2024-11-03T14:30:20Z

The actually issue is, that the template misses an import for import org.apache.stormcrawler.ConfigurableTopology; -> #1389

The fact that it has been broken for ever and no one reported it kind of suggests that hardly anyone uses the core archetype.
I would be in favour of getting rid of duplicate functionality and remove the Java topologies and haev only the Flux files.
We need one for the injection @mvolikas

jnioche · 2024-11-03T14:31:15Z

Yes I have not yet tested with the java topologies

The compilation fails whether you use the Java topologies or not...

rzo1 · 2024-11-03T14:37:28Z

Bascially, the Java topologies are only good for testing in local mode (IMHO) and are actually only usable, if the IDE is configured to include the provided scoped dependencies in the run configuration (if started from within an IDE). For this reason, I am fine with dropping ;-) (but from ASF perspective, this needs to be discussed on the dev@ list)

…files

mvolikas · 2024-11-03T15:41:34Z

We need one for the injection @mvolikas

Ok, so I guess I will add this back.

jnioche · 2024-11-03T15:50:06Z

We need one for the injection @mvolikas

Ok, so I guess I will add this back.

Sorry if I wasn't clear - we need a Flux for the injection, not the Java topology

mvolikas · 2024-11-03T16:02:30Z

An update from my side:

I have now tested in local mode with 1 and 4 shards.
I have updated the SolrSpout code so that the query param for shards is not added for just one shard.
I tested that the compilation with the Java topologies included succeeds.
Improved the scripts to handle the case of commented-out properties.
After the latest commit, the following workflow worked locally without any issues for me:
- mvn archetype:generate -DarchetypeGroupId=org.apache.stormcrawler -DarchetypeArtifactId=stormcrawler-solr-archetype -DarchetypeVersion=3.1.1-SNAPSHOT
- cd test && mvn clean compile
- /opt/apache-storm-2.6.4/bin/storm local target/test-1.0-SNAPSHOT.jar org.apache.storm.flux.Flux crawler.flux --local-ttl 3600 - the provided flux topology starts from the StormCrawler webpage URL and indexes documents in Solr.

This makes running the injection topology first unnecessary if someone is getting started and wants to have a basic topology up and running out of the box.

Still to do/decide:

~~Should we keep any of the Java topologies? (probably just the SeedInjector.java one)~~ no
~~Testing with Storm 2.7.0. (I have tested with Storm 2.6.4 and Solr 9.7.0)~~ done
~~Review the changes and READMEs one more time after the previous 2 are done.~~ done

mvolikas · 2024-11-03T16:04:41Z

We need one for the injection @mvolikas

Ok, so I guess I will add this back.

Sorry if I wasn't clear - we need a Flux for the injection, not the Java topology

No worries; I will add this too.

jnioche · 2024-11-05T07:02:51Z

@mvolikas, latest comments

need to bring in change from Bump org.apache.maven.plugins:maven-archetype-plugin from 3.3.0 to 3.3.1 #1390

storm jar target/crawler-1.0-SNAPSHOT.jar org.apache.storm.flux.Flux injection.flux --local-ttl 3600

needs changing to
storm local target/crawler-1.0-SNAPSHOT.jar org.apache.storm.flux.Flux injection.flux --local-ttl 3600

delete the java topology classes -> there is a discussion on whether they should be removed in the core topology but here I would advocate that we shouldn't add them in the first place. The opensearch module doesn't have them.
remove the MemorySpout from the crawl.flux, rely on the separate injection and make sure the README reflects this, like we do in the OpenSearch module

Thanks!

mvolikas · 2024-11-09T14:06:41Z

Hi there!

@jnioche I think I have made the changes; also pushed some comments and minor fixes to the readme files.

I ran some more tests with a greater number of shards (e.g. 10), and everything seems ok.
From my side we could merge the changes. What do you think?

jnioche

Checked with a crawl in local mode, works fine
Thanks a lot @mvolikas, this is a great contribution

mvolikas added 2 commits October 5, 2024 17:03

apache#620 update spout to fetch from the corresponding shard

7cd06f9

apache#620 add Solr scripts

e33c2bb

mvolikas commented Oct 5, 2024

View reviewed changes

mvolikas self-assigned this Oct 5, 2024

mvolikas commented Oct 6, 2024

View reviewed changes

external/solr/setup-solr.sh Outdated Show resolved Hide resolved

jnioche reviewed Oct 6, 2024

View reviewed changes

external/solr/setup-solr.sh Outdated Show resolved Hide resolved

jnioche reviewed Oct 6, 2024

View reviewed changes

external/solr/src/main/java/org/apache/stormcrawler/solr/persistence/SolrSpout.java Outdated Show resolved Hide resolved

mvolikas added 5 commits October 13, 2024 16:33

apache#620 fix tests to operate in cloud mode

8ebd0d1

apache#620 fix code format

b5bd556

apache#620 add Solr spout test

e1ee25d

apache#620 add license

27d71c9

apache#620 improve the Solr related scripts

d22c325

mvolikas commented Oct 14, 2024

View reviewed changes

external/solr/src/main/java/org/apache/stormcrawler/solr/persistence/SolrSpout.java Show resolved Hide resolved

mvolikas commented Oct 14, 2024

View reviewed changes

This was linked to issues Oct 18, 2024

Add support for shards in SOLR #620

Closed

SOLR Status Updater - configure byDomain or byIP #626

Closed

apache#620 add solr archetype, update readmes

e7add5e

jnioche requested changes Oct 27, 2024

View reviewed changes

external/solr/archetype/src/main/resources/archetype-resources/pom.xml Outdated Show resolved Hide resolved

external/solr/archetype/src/main/resources/archetype-resources/README.md Outdated Show resolved Hide resolved

jnioche reviewed Oct 27, 2024

View reviewed changes

external/solr/README.md Outdated Show resolved Hide resolved

apache#620 minor fixes

49f4556

apache#620 do not set the 'shard' query parameter when we have a sing…

686a83a

…le shard

rzo1 reviewed Nov 3, 2024

View reviewed changes

external/solr/archetype/src/main/resources/archetype-resources/src/main/java/CrawlTopology.java Outdated Show resolved Hide resolved

apache#620 fix archetype includes, improve scripts and configuration …

c581b91

…files

apache#620 fix java topologies

291013e

apache#620 add 'injection.flux' topology

99b01cd

jnioche mentioned this pull request Nov 4, 2024

Bump org.apache.maven.archetype:archetype-packaging from 3.3.0 to 3.3.1 #1395

Merged

mvolikas added 3 commits November 9, 2024 15:17

apache#620 bring in change from apache#1390

6cdffd6

apache#620 update sample flux topologies and readme

6532dc8

apache#620 minor comments and readme changes

7640e01

jnioche approved these changes Nov 10, 2024

View reviewed changes

jnioche merged commit f53e89a into apache:main Nov 10, 2024
3 checks passed

jnioche added SOLR archetype enhancement labels Nov 10, 2024

jnioche added this to the 3.1.1 milestone Nov 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

#620 Add support for shards - SolrSpout #1343

#620 Add support for shards - SolrSpout #1343

mvolikas commented Oct 5, 2024 •

edited

Loading

mvolikas left a comment

jnioche commented Oct 6, 2024

mvolikas Oct 14, 2024

jnioche Oct 15, 2024

mvolikas Nov 2, 2024

rzo1 commented Oct 24, 2024

mvolikas commented Oct 24, 2024

jnioche commented Oct 25, 2024

jnioche left a comment

mvolikas commented Nov 2, 2024

jnioche commented Nov 2, 2024

jnioche commented Nov 2, 2024 •

edited

Loading

jnioche commented Nov 2, 2024 •

edited

Loading

mvolikas commented Nov 3, 2024

rzo1 commented Nov 3, 2024 •

edited

Loading

jnioche commented Nov 3, 2024

jnioche commented Nov 3, 2024

rzo1 commented Nov 3, 2024

mvolikas commented Nov 3, 2024

jnioche commented Nov 3, 2024

mvolikas commented Nov 3, 2024 •

edited

Loading

mvolikas commented Nov 3, 2024

jnioche commented Nov 5, 2024

mvolikas commented Nov 9, 2024

jnioche left a comment

#620 Add support for shards - SolrSpout #1343

#620 Add support for shards - SolrSpout #1343

Conversation

mvolikas commented Oct 5, 2024 • edited Loading

mvolikas left a comment

Choose a reason for hiding this comment

jnioche commented Oct 6, 2024

mvolikas Oct 14, 2024

Choose a reason for hiding this comment

jnioche Oct 15, 2024

Choose a reason for hiding this comment

mvolikas Nov 2, 2024

Choose a reason for hiding this comment

rzo1 commented Oct 24, 2024

mvolikas commented Oct 24, 2024

jnioche commented Oct 25, 2024

jnioche left a comment

Choose a reason for hiding this comment

mvolikas commented Nov 2, 2024

jnioche commented Nov 2, 2024

jnioche commented Nov 2, 2024 • edited Loading

jnioche commented Nov 2, 2024 • edited Loading

mvolikas commented Nov 3, 2024

rzo1 commented Nov 3, 2024 • edited Loading

jnioche commented Nov 3, 2024

jnioche commented Nov 3, 2024

rzo1 commented Nov 3, 2024

mvolikas commented Nov 3, 2024

jnioche commented Nov 3, 2024

mvolikas commented Nov 3, 2024 • edited Loading

mvolikas commented Nov 3, 2024

jnioche commented Nov 5, 2024

mvolikas commented Nov 9, 2024

jnioche left a comment

Choose a reason for hiding this comment

mvolikas commented Oct 5, 2024 •

edited

Loading

jnioche commented Nov 2, 2024 •

edited

Loading

jnioche commented Nov 2, 2024 •

edited

Loading

rzo1 commented Nov 3, 2024 •

edited

Loading

mvolikas commented Nov 3, 2024 •

edited

Loading