Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HtmlCollectionScraper gives 500 error #1266

Open
OmnipotentEntity opened this issue Dec 27, 2024 · 5 comments
Open

HtmlCollectionScraper gives 500 error #1266

OmnipotentEntity opened this issue Dec 27, 2024 · 5 comments

Comments

@OmnipotentEntity
Copy link

I get the following error:

{"message":"Uncaught PHP Exception ArgumentCountError: \"Too few arguments to function App\\Service\\Scraper\\HtmlScraper::extract(), 3 passed in /var/www/koillection/src/Service/Scraper/HtmlCollectionScraper.php on line 18 and exactly 4 expected\" at HtmlScraper.php line 48","context":{"exception":{"class":"ArgumentCountError","message":"Too few arguments to function App\\Service\\Scraper\\HtmlScraper::extract(), 3 passed in /var/www/koillection/src/Service/Scraper/HtmlCollectionScraper.php on line 18 and exactly 4 expected","code":0,"file":"/var/www/koillection/src/Service/Scraper/HtmlScraper.php:48"}},"level":500,"level_name":"CRITICAL","channel":"request","datetime":"2024-12-27T03:27:24.772326-06:00","extra":{}}

I was able to hunt it down to commit 432f476 which seems to had added image scraping, which required an API change, but this API change wasn't added to HtmlCollectionScraper.php and also line 22.

It seems like $scraping as a variable is available in this context, so it might be as simple as simply adding this variable to the 4th argument position in both locations. However, I'm not familiar enough with the project to feel confident in creating a PR.

Thank you for your hard work!

@OmnipotentEntity
Copy link
Author

OmnipotentEntity commented Dec 27, 2024

I have attempted to modify these files in place and restart the service and I have the following new error which seems to be related to the image not being scraped properly. This probably has something to do with the fact that I only very barely attempted to understand what's going on here, and there's probably a few other changes that needed to happen to emulate the referenced commit.

The new error is:

{"message":"Warning: file_get_contents(): SSL operation failed with code 1. OpenSSL Error messages:\nerror:0A000086:SSL routines::certificate verify failed","context":{"exception":{"class":"ErrorException","message":"Warning: file_get_contents(): SSL operation failed with code 1. OpenSSL Error messages:\nerror:0A000086:SSL routines::certificate verify failed","code":0,"file":"/var/www/koillection/src/Service/Scraper/HtmlCollectionScraper.php:23"}},"level":400,"level_name":"ERROR","channel":"php","datetime":"2024-12-27T03:51:51.994482-06:00","extra":{}}
{"message":"Warning: file_get_contents(): Failed to enable crypto","context":{"exception":{"class":"ErrorException","message":"Warning: file_get_contents(): Failed to enable crypto","code":0,"file":"/var/www/koillection/src/Service/Scraper/HtmlCollectionScraper.php:23"}},"level":400,"level_name":"ERROR","channel":"php","datetime":"2024-12-27T03:51:51.994630-06:00","extra":{}}
{"message":"Warning: file_get_contents(https://s4.anilist.co/file/anilistcdn/media/manga/cover/large/bx30703-iRLjKRnSwCFP.jpg): Failed to open stream: operation failed","context":{"exception":{"class":"ErrorException","message":"Warning: file_get_contents(https://s4.anilist.co/file/anilistcdn/media/manga/cover/large/bx30703-iRLjKRnSwCFP.jpg): Failed to open stream: operation failed","code":0,"file":"/var/www/koillection/src/Service/Scraper/HtmlCollectionScraper.php:23"}},"level":400,"level_name":"ERROR","channel":"php","datetime":"2024-12-27T03:51:51.994706-06:00","extra":{}}
{"message":"Uncaught PHP Exception TypeError: \"base64_encode(): Argument #1 ($string) must be of type string, false given\" at HtmlCollectionScraper.php line 23","context":{"exception":{"class":"TypeError","message":"base64_encode(): Argument #1 ($string) must be of type string, false given","code":0,"file":"/var/www/koillection/src/Service/Scraper/HtmlCollectionScraper.php:23"}},"level":500,"level_name":"CRITICAL","channel":"request","datetime":"2024-12-27T03:51:51.994961-06:00","extra":{}}

For completeness sake, here is my scraper:

Name: Anilist - Manga Series
Url Pattern: https://anilist.co/manga/
Name Path: #//div[@class="type"][text()="English"]/following-sibling::div/text()#
Image Path: #//img[@class="cover"]/@src#
Volume Count: (Text) #//div[@class="type"][text()="Volumes"]/following-sibling::div/text()#
Status: (Text) #//div[@class="type"][text()="Status"]/following-sibling::div/text()#

@OmnipotentEntity
Copy link
Author

With this patch the scrap finishes successfully, but the thumbnail isn't scraped properly. So it's not a full solution yet.

--- HtmlCollectionScraper.php.old       2024-12-27 09:49:20.107123727 +0000
+++ HtmlCollectionScraper.php.new       2024-12-27 19:36:08.045680868 +0000
@@ -15,12 +15,12 @@
         $crawler = $this->getCrawler($scraping);
         $scraper = $scraping->getScraper();
 
-        $image = $scraping->getScrapImage() ? $this->extract($scraper->getImagePath(), DatumTypeEnum::TYPE_TEXT, $crawler) : null;
+        $image = $scraping->getScrapImage() ? $this->extract($scraper->getImagePath(), DatumTypeEnum::TYPE_TEXT, $crawler, $scraper) : null;
         $image = $this->guessHost($image, $scraping);
 
         return [
-            'name' => $scraping->getScrapName() ? $this->extract($scraper->getNamePath(), DatumTypeEnum::TYPE_TEXT, $crawler) : null,
-            'base64Image' => 'data:image/png;base64,' . base64_encode(file_get_contents($image)),
+            'name' => $scraping->getScrapName() ? $this->extract($scraper->getNamePath(), DatumTypeEnum::TYPE_TEXT, $crawler, $scraper) : null,
+            'image' => $image,
             'data' => $this->scrapData($scraping, $crawler, ScraperTypeEnum::TYPE_COLLECTION),
             'scrapedUrl' => $scraping->getUrl()
         ];

@benjaminjonard
Copy link
Owner

I had a quick look today and did a quick fix but as you noticed the image can't be properly scrapped.

I'm looking into new ways to scrap urls, like this method suggested here #1263.
While it works better than the current implementation, I still can't make it work with your example. The website returns a blank page saying javascript is required.

I may have another solution but I'm having a hard time making it work with Docker (https://github.com/symfony/panther)

It's going to take some time but I hope I can push a better implementation for the scrapper in the next release

@OmnipotentEntity
Copy link
Author

OmnipotentEntity commented Dec 27, 2024

That's interesting, because the same scraper seems to work as an Item scraper rather than a collection scraper. Unless something changed with the website overnight (which is possible.)

@TaylanTatli
Copy link

I've tried only for Wish scraper and it gives the same error. I tried the patch, it didn't solve my problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants