Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 6 additions & 6 deletions sources/academy/tutorials/apify_scrapers/cheerio_scraper.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,14 +47,14 @@ Before we start, let's do a quick recap of the data we chose to scrape:
5. **Last modification date** - When the Actor was last modified.
6. **Number of runs** - How many times the Actor was run.

![$1](https://raw.githubusercontent.com/apify/actor-scraper/master/docs/img/scraping-practice.webp)
![Actor detail page in Apify Store showing the data points to scrape](./images/scraping-practice.jpg)

We've already scraped numbers 1 and 2 in the [Getting started with Apify scrapers](/academy/apify-scrapers/getting-started)
tutorial, so let's get to the next one on the list: title.

### Title

![$1](https://raw.githubusercontent.com/apify/actor-scraper/master/docs/img/title.webp)
![DevTools showing the title h1 element inside the header tag](./images/title.jpg)

By using the element selector tool, we find out that the title is there under an `<h1>` tag, as titles should be.
Maybe surprisingly, we find that there are actually two `<h1>` tags on the detail page. This should get us thinking.
Expand Down Expand Up @@ -84,7 +84,7 @@ async function pageFunction(context) {
Getting the Actor's description is a little more involved, but still pretty straightforward. We cannot search for a `<p>` tag, because there's a lot of them in the page. We need to narrow our search down a little. Using the DevTools we find that the Actor description is nested within
the `<header>` element too, same as the title. Moreover, the actual description is nested inside a `<span>` tag with a class `actor-description`.

![$1](https://raw.githubusercontent.com/apify/actor-scraper/master/docs/img/description.webp)
![DevTools showing the description span element inside the header](./images/description.jpg)

```js
async function pageFunction(context) {
Expand Down Expand Up @@ -275,7 +275,7 @@ the Network tab of the Chrome DevTools.
We want to know what happens when we click the **Show more** button, so we open the DevTools **Network** tab and clear it.
Then we click the **Show more** button and wait for incoming requests to appear in the list.

![$1](https://raw.githubusercontent.com/apify/actor-scraper/master/docs/img/inspect-network.webp)
![DevTools Network tab showing requests after clicking Show more](./images/inspect-network.jpg)

Now, this is interesting. It seems that we've only received two images after clicking the button and no additional
data. This means that the data about Actors must already be available in the page and the **Show more** button only displays it. This is good news.
Expand All @@ -288,7 +288,7 @@ few hits do not provide any interesting information, but in the end, we find our
with the ID `__NEXT_DATA__` that seems to hold a lot of information about Web Scraper. In DevTools,
you can right click an element and click **Store as global variable** to make this element available in the **Console**.

![$1](https://raw.githubusercontent.com/apify/actor-scraper/master/docs/img/find-data.webp)
![DevTools Elements tab showing the NEXT_DATA script tag with Actor data](./images/find-data.jpg)

A `temp1` variable is now added to your console. We're mostly interested in its contents and we can get that using
the `temp1.textContent` property. You can see that it's a rather large JSON string. How do we know?
Expand All @@ -302,7 +302,7 @@ const data = JSON.parse(temp1.textContent);
After entering the above command into the console, we can inspect the `data` variable and see that all the information
we need is there, in the `data.props.pageProps.items` array. Great!

![$1](https://raw.githubusercontent.com/apify/actor-scraper/master/docs/img/inspect-data.webp)
![DevTools Console showing parsed Actor data from the NEXT_DATA object](./images/inspect-data.jpg)

> It's obvious that all the information we set to scrape is available in this one data object,
so you might already be wondering, can I make one request to the store to get this JSON
Expand Down
6 changes: 3 additions & 3 deletions sources/academy/tutorials/apify_scrapers/getting_started.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ Depending on how you arrived at this tutorial, you may already have your first t

> This tutorial covers the use of **Web**, **Cheerio**, and **Puppeteer** scrapers, but a lot of the information here can be used with all Actors. For this tutorial, we will select **Web Scraper**.

![$1](https://raw.githubusercontent.com/apify/actor-scraper/master/docs/img/actor-selection.webp)
![Selecting the Web Scraper Actor in Apify Store](./images/actor-selection.jpg)

### Running a task

Expand All @@ -47,7 +47,7 @@ After clicking **Save & Run**, the window will change to the run detail. Here, y

Now that the run has `SUCCEEDED`, click on the glowing **Results** card to see the scrape's results. This takes you to the **Dataset** tab, where you can display or download the results in various formats. For now, click the **Preview** button. Voila, the scraped data!

![$1](https://raw.githubusercontent.com/apify/actor-scraper/master/docs/img/the-run-detail.webp)
![Run detail page showing scraped results in the Dataset tab](./images/the-run-detail.jpg)

Good job! We've run our first task and got some results. Let's learn how to change the default configuration to scrape something more interesting than the page's `<title>`.

Expand Down Expand Up @@ -204,7 +204,7 @@ The DevTools window will pop up and display a lot of, perhaps unfamiliar, inform

You'll see that the Element tab jumps to the first `<title>` element of the current page and that the title is **Store · Apify**. It's always good practice to do your research using the DevTools before writing the `pageFunction` and running your task.

![$1](https://raw.githubusercontent.com/apify/actor-scraper/master/docs/img/using-devtools.webp)
![Chrome DevTools Elements tab with search results for title](./images/using-devtools.jpg)

> For the sake of brevity, we won't go into the details of using the DevTools in this tutorial. If you're just starting out with DevTools, this [Google tutorial](https://developer.chrome.com/docs/devtools/) is a good place to begin.

Expand Down
10 changes: 5 additions & 5 deletions sources/academy/tutorials/apify_scrapers/puppeteer_scraper.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,14 +62,14 @@ Before we start, let's do a quick recap of the data we chose to scrape:
5. **Last modification date** - When the Actor was last modified.
6. **Number of runs** - How many times the Actor was run.

![$1](https://raw.githubusercontent.com/apify/actor-scraper/master/docs/img/scraping-practice.webp)
![Actor detail page in Apify Store showing the data points to scrape](./images/scraping-practice.jpg)

We've already scraped numbers 1 and 2 in the [Getting started with Apify scrapers](/academy/apify-scrapers/getting-started)
tutorial, so let's get to the next one on the list: title.

### Title

![$1](https://raw.githubusercontent.com/apify/actor-scraper/master/docs/img/title.webp)
![DevTools showing the title h1 element inside the header tag](./images/title.jpg)

By using the element selector tool, we find out that the title is there under an `<h1>` tag, as titles should be.
Maybe surprisingly, we find that there are actually two `<h1>` tags on the detail page. This should get us thinking.
Expand Down Expand Up @@ -108,7 +108,7 @@ is automatically passed back to the Node.js context, so we receive an actual `st
Getting the Actor's description is a little more involved, but still pretty straightforward. We cannot search for a `<p>` tag, because there's a lot of them in the page. We need to narrow our search down a little. Using the DevTools we find that the Actor description is nested within
the `<header>` element too, same as the title. Moreover, the actual description is nested inside a `<span>` tag with a class `actor-description`.

![$1](https://raw.githubusercontent.com/apify/actor-scraper/master/docs/img/description.webp)
![DevTools showing the description span element inside the header](./images/description.jpg)

```js
async function pageFunction(context) {
Expand Down Expand Up @@ -426,7 +426,7 @@ div.show-more > button

> Don't forget to confirm our assumption in the DevTools finder tool (CTRL/CMD + F).

![$1](https://raw.githubusercontent.com/apify/actor-scraper/master/docs/img/waiting-for-the-button.webp)
![DevTools confirming the Show more button selector](./images/waiting-for-the-button.jpg)

Now that we know what to wait for, we plug it into the `waitFor()` function.

Expand Down Expand Up @@ -579,7 +579,7 @@ through all the Actors and then scrape all of their data. After it succeeds, ope
You've successfully scraped Apify Store. And if not, no worries, go through the code examples again,
it's probably just a typo.

![$1](https://raw.githubusercontent.com/apify/actor-scraper/master/docs/img/plugging-it-into-the-pagefunction.webp)
![Dataset preview showing all scraped Actor data](./images/plugging-it-into-the-pagefunction.jpg)

## Downloading the scraped data

Expand Down
10 changes: 5 additions & 5 deletions sources/academy/tutorials/apify_scrapers/web_scraper.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,14 +45,14 @@ Before we start, let's do a quick recap of the data we chose to scrape:
5. **Last modification date** - When the Actor was last modified.
6. **Number of runs** - How many times the Actor was run.

![$1](https://raw.githubusercontent.com/apify/actor-scraper/master/docs/img/scraping-practice.webp)
![Actor detail page in Apify Store showing the data points to scrape](./images/scraping-practice.jpg)

We've already scraped numbers 1 and 2 in the [Getting started with Apify scrapers](/academy/apify-scrapers/getting-started)
tutorial, so let's get to the next one on the list: title.

### Title

![$1](https://raw.githubusercontent.com/apify/actor-scraper/master/docs/img/title.webp)
![DevTools showing the title h1 element inside the header tag](./images/title.jpg)

By using the element selector tool, we find out that the title is there under an `<h1>` tag, as titles should be.
Maybe surprisingly, we find that there are actually two `<h1>` tags on the detail page. This should get us thinking.
Expand Down Expand Up @@ -83,7 +83,7 @@ async function pageFunction(context) {
Getting the Actor's description is a little more involved, but still pretty straightforward. We cannot search for a `<p>` tag, because there's a lot of them in the page. We need to narrow our search down a little. Using the DevTools we find that the Actor description is nested within
the `<header>` element too, same as the title. Moreover, the actual description is nested inside a `<span>` tag with a class `actor-description`.

![$1](https://raw.githubusercontent.com/apify/actor-scraper/master/docs/img/description.webp)
![DevTools showing the description span element inside the header](./images/description.jpg)

```js
async function pageFunction(context) {
Expand Down Expand Up @@ -322,7 +322,7 @@ div.show-more > button

> Don't forget to confirm our assumption in the DevTools finder tool (CTRL/CMD + F).

![$1](https://raw.githubusercontent.com/apify/actor-scraper/master/docs/img/waiting-for-the-button.webp)
![DevTools confirming the Show more button selector](./images/waiting-for-the-button.jpg)

Now that we know what to wait for, we plug it into the `waitFor()` function.

Expand Down Expand Up @@ -455,7 +455,7 @@ through all the Actors and then scrape all of their data. After it succeeds, ope
You've successfully scraped Apify Store. And if not, no worries, go through the code examples again,
it's probably just a typo.

![$1](https://raw.githubusercontent.com/apify/actor-scraper/master/docs/img/plugging-it-into-the-pagefunction.webp)
![Dataset preview showing all scraped Actor data](./images/plugging-it-into-the-pagefunction.jpg)

## Downloading the scraped data

Expand Down
Loading