FTS document fetching reliability #5

iansinnott · 2022-06-24T22:05:56Z

iansinnott
Jun 24, 2022
Maintainer

(working title... )

FTS doesn't always get the full text of a page. Notably:

some websites block bp b/c they think it's a bot (it is, but a friendly one 🤗)
js based sites do not get indexed
content behind auth-walls do not get indexed (whether or not this is desirable is open for discussion)

The current system will go out and fetch full documents from the internet. This is necessary because no browser stores full page content for you, so we have to go fetch it. We currently fetch it via a simple clojure fetch, akin to using curl. However, this approach leads to the drawbacks listed above.

Electron loading

The app runs inside electron, a full web browser, so... it seems reasonable to do a full browser load of a web page via electron. I've not confirmed yet whether or not we can modify things like the user agent string or other headers, but even without such modification the first two points above (bot blocking and js-based pages) should see big improvement with this approach.

This would also mean many more web requests and more memory usage though, since each page would load it's resources. Maybe the electron load could be used as a fallback in cases where a plain fetch didn't return hmtl content.

Browser plugin

Another approach, and potentially a better one, is to have a browser plugin. The plugin would grab full text from pages as you browse and ping the data over to bp for indexing. This would solve all issues listed above, including pages that require login.

The downside is you'd have to install a browser plugin. We aim to require as little friction as possible.

nfcampos · 2022-06-25T06:30:23Z

nfcampos
Jun 25, 2022

I wonder, could you read the cookie stores of each browser and pass that to the electron worker (ideally only the cookies for the domain (and sudomains) being read, would that fix the paywall issues with browser extension?

1 reply

iansinnott Jun 25, 2022
Maintainer Author

It might, and that is another option, I think. I saw at least one project in the past that did something like export all your cookies from the browser profile on disk. This would be seamless, and if combined with the electron rendering mentioned above it would probably work almost all the time.

Running a headless tab via electron would likely run afoul of some bot detection still, but less than the current approach.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

FTS document fetching reliability #5

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

FTS document fetching reliability #5

Uh oh!

iansinnott Jun 24, 2022 Maintainer

Electron loading

Browser plugin

Replies: 1 comment · 1 reply

Uh oh!

nfcampos Jun 25, 2022

Uh oh!

iansinnott Jun 25, 2022 Maintainer Author

iansinnott
Jun 24, 2022
Maintainer

Replies: 1 comment 1 reply

nfcampos
Jun 25, 2022

iansinnott Jun 25, 2022
Maintainer Author