You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: .actor/input_schema.json
+9Lines changed: 9 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -136,6 +136,15 @@
136
136
"description": "If enabled, the Actor attempts to close or remove cookie consent dialogs to improve the quality of extracted text. Note that this setting increases the latency.",
137
137
"default": true
138
138
},
139
+
"scrapingTool": {
140
+
"title": "Which scraping tool to use",
141
+
"type": "string",
142
+
"description": "Choose what scraping tool to use for extracting the target web pages. The Browser tool is more powerful and can handle JavaScript heavy websites. While the Plain HTML tool is about two times faster.",
Copy file name to clipboardExpand all lines: README.md
+1Lines changed: 1 addition & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -116,6 +116,7 @@ The `/search` GET HTTP endpoint accepts the following query parameters:
116
116
|`dynamicContentWaitSecs`| number |`10`| The maximum time in seconds to wait for dynamic page content to load. The Actor considers the web page as fully loaded once this time elapses or when the network becomes idle. |
117
117
|`removeCookieWarnings`| boolean |`true`| If enabled, removes cookie consent dialogs to improve text extraction accuracy. This might increase latency. |
118
118
|`removeElementsCssSelector`| string |`see input`| A CSS selector matching HTML elements that will be removed from the DOM, before converting it to text, Markdown, or saving as HTML. This is useful to skip irrelevant page content. The value must be a valid CSS selector as accepted by the `document.querySelectorAll()` function. \n\nBy default, the Actor removes common navigation elements, headers, footers, modals, scripts, and inline image. You can disable the removal by setting this value to some non-existent CSS selector like `dummy_keep_everything`. |
119
+
|`scrapingTool`| string |`browser-playwright`| Selects which scraping tool is used to extract the target websits. `browser-playwright` uses browser and can handle complex Javascript heavy website. Meanwhile `raw-http` uses simple HTTP request to fetch the HTML provided by the URL, it can't handle websites that rely on Javascript but it's about two times faster. |
119
120
|`debugMode`| boolean |`false`| If enabled, the Actor will store debugging information in the dataset's debug field. |
120
121
121
122
<!-- TODO: we should probably add proxyConfiguration -->
0 commit comments