-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Writing Data to Elasticsearch Storage Engine #225
Conversation
Using Either[SolrClient,RestHighLevelClient] leads to "Overriding type String => SolrClient does not conform to base type String => Either[SolrClient, RestHighLevelClient]" Co-Authored-By: Kevin Yan <[email protected]>
Extract SolrRDD and SolrDeepRDD; Cast getClient() result to SolrClient in 2 RDDs and SolrUpsert; Add getRDD and getDeepRDD to StorageProxyFactory; Add 3 add resource methods to StorageProxy and cast parameter to SolrInputDocument in SolrProxy; Add 2 dummy Elasticsearch RDDs
We’ve been duplicating the *RDD.scala files (in this directory) and modifying them into Elasticsearch variants. Just confirming, is this the correct approach? NOTE: this is still highly a work in progress. We would just like to confirm that we're working in the right direction and see if you have any suggestions. Thanks. |
Co-authored-by: Felix Loesing <[email protected]> Co-authored-by: Mingyu Cui <[email protected]>
…c testing Co-authored-by: Felix Loesing <[email protected]> Co-authored-by: Mingyu Cui <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking great folks. Please see my comments ans keep up the good work.
sparkler-core/sparkler-app/src/main/scala/edu/usc/irds/sparkler/CrawlDbRDD.scala
Outdated
Show resolved
Hide resolved
sparkler-core/sparkler-app/src/main/scala/edu/usc/irds/sparkler/model/SparklerJob.scala
Outdated
Show resolved
Hide resolved
sparkler-core/sparkler-app/src/main/scala/edu/usc/irds/sparkler/pipeline/Crawler.scala
Outdated
Show resolved
Hide resolved
sparkler-core/sparkler-app/src/main/scala/edu/usc/irds/sparkler/pipeline/Crawler.scala
Outdated
Show resolved
Hide resolved
sparkler-core/sparkler-app/src/main/scala/edu/usc/irds/sparkler/pipeline/CrawlerRunner.scala
Outdated
Show resolved
Hide resolved
sparkler-core/sparkler-app/src/main/scala/edu/usc/irds/sparkler/storage/StorageProxy.scala
Outdated
Show resolved
Hide resolved
sparkler-core/sparkler-app/src/main/scala/edu/usc/irds/sparkler/storage/StorageProxy.scala
Outdated
Show resolved
Hide resolved
...ler-core/sparkler-app/src/main/scala/edu/usc/irds/sparkler/storage/StorageProxyFactory.scala
Outdated
Show resolved
Hide resolved
[Docker] Update run script with relative paths and docker-compose file
Co-authored-by: Felix Loesing <[email protected]> Co-authored-by: Mingyu Cui <[email protected]> Co-authored-by: Nikhil Handyal <[email protected]>
Co-authored-by: Miles Phan <[email protected]>
…es but not tested
Co-authored-by: Miles Phan <[email protected]> Co-authored-by: Mingyu Cui <[email protected]>
…pdatetransformer classes. Update crawler as well
…into ISSUE-224
Co-authored-by: Felix Loesing <[email protected]> Co-authored-by: Mingyu Cui <[email protected]>
…error but data in db is not updated
Co-authored-by: Miles Phan <[email protected]>
Issue 224 2
Co-authored-by: Felix Loesing <[email protected]> Co-authored-by: Mingyu Cui <[email protected]> Co-authored-by: Miles Phan <[email protected]> Co-authored-by: Nikhil Handyal <[email protected]>
Co-authored-by: Felix Loesing <[email protected]> Co-authored-by: Miles Phan <[email protected]>
Co-authored-by: Felix Loesing <[email protected]>
Co-authored-by: Miles Phan <[email protected]>
Co-authored-by: Miles Phan <[email protected]>
Co-authored-by: Miles Phan <[email protected]>
I need ES support and I need to merge this into the mainline due to Github giving dodgy merge instructions and I'd rather not lose it. So I'm going to merge this in, then sync it with my mammoth mvn2sbt dev branch, clean up the integration and then merge the whole lot back into master |
What changes were proposed in this pull request?
We have implemented the Factory Pattern to extract storage components (Solr and Elasticsearch) from Sparkler implementation. Currently, classes for Elasticsearch are placeholders and we are starting to implement those classes. We are also testing to make sure Solr can still run with the Factory.
We moved Solr related classes into sparkler-app/src/main/scala/edu/usc/irds/sparkler/storage/solr, including the original MemexDeepCrawlDbRDD and MemexCrawlDbRDD, and renamed them to SolrDeepRDD and SolrRDD to reflect their usage on Solr. Let us know if you think the naming convention deviates from the purpose and if we should change it again.
Is this related to an already existing issue on sparkler?
#224
#229