FTS document fetching reliability #5
iansinnott
started this conversation in
Ideas
Replies: 1 comment 1 reply
-
I wonder, could you read the cookie stores of each browser and pass that to the electron worker (ideally only the cookies for the domain (and sudomains) being read, would that fix the paywall issues with browser extension? |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
(working title... )
FTS doesn't always get the full text of a page. Notably:
The current system will go out and fetch full documents from the internet. This is necessary because no browser stores full page content for you, so we have to go fetch it. We currently fetch it via a simple clojure fetch, akin to using
curl
. However, this approach leads to the drawbacks listed above.Electron loading
The app runs inside electron, a full web browser, so... it seems reasonable to do a full browser load of a web page via electron. I've not confirmed yet whether or not we can modify things like the user agent string or other headers, but even without such modification the first two points above (bot blocking and js-based pages) should see big improvement with this approach.
This would also mean many more web requests and more memory usage though, since each page would load it's resources. Maybe the electron load could be used as a fallback in cases where a plain fetch didn't return hmtl content.
Browser plugin
Another approach, and potentially a better one, is to have a browser plugin. The plugin would grab full text from pages as you browse and ping the data over to bp for indexing. This would solve all issues listed above, including pages that require login.
The downside is you'd have to install a browser plugin. We aim to require as little friction as possible.
Beta Was this translation helpful? Give feedback.
All reactions