Skip to content

Commit

Permalink
Pinterest data source (#478)
Browse files Browse the repository at this point in the history
* Pinterest data source

* Upd8s

* Squashed commit of the following:

commit 513a589
Author: Dale Wahl <[email protected]>
Date:   Mon Jan 27 17:01:52 2025 +0100

    bsky: ensure interrupt

commit 0badee7
Author: Dale Wahl <[email protected]>
Date:   Mon Jan 27 15:19:42 2025 +0100

    bsky: no progress bar if no max_posts

commit 115a3c1
Author: Dale Wahl <[email protected]>
Date:   Mon Jan 27 14:18:12 2025 +0100

    bsky datasource

commit 836a235
Author: Dale Wahl <[email protected]>
Date:   Thu Jan 23 11:47:51 2025 +0100

    post_topic_matrix: rename column when tokenizer created multiple documents per post

commit 977d887
Author: Dale Wahl <[email protected]>
Date:   Thu Jan 23 11:37:52 2025 +0100

    rank_attribute: convert to str to lower()

commit a1cdd4c
Author: Dale Wahl <[email protected]>
Date:   Wed Jan 22 09:40:29 2025 +0100

    fix: allow None for columns.default; remove debug log statement

    fix occasional error that appears particularly on new processors with no expected default, i.e.:
    <option value="{{ choice }}"{% if choice in option_settings.default %} selected="selected"{% endif %}>{{ option_settings.options[choice] }}</option>
    TypeError: argument of type 'NoneType' is not iterable

* Handle both JSON and HTML-sourced Pinterest data

* Squashed commit of the following:

commit dd2ab72
Author: Stijn Peeters <[email protected]>
Date:   Tue Feb 4 18:16:59 2025 +0100

    Highlight missing fields in CSV preview

commit 204ab8a
Author: Stijn Peeters <[email protected]>
Date:   Tue Feb 4 18:16:47 2025 +0100

    Add a 'missing fields' key to mapped dataset items

commit 11457e0
Author: Stijn Peeters <[email protected]>
Date:   Tue Feb 4 18:16:26 2025 +0100

    Add a 'missing_fields' column to mapped objects

commit a3e4f77
Author: Stijn Peeters <[email protected]>
Date:   Tue Feb 4 18:15:06 2025 +0100

    Prevent tooltips from falling (partially) outside the viewport

commit 16be136
Author: Dale Wahl <[email protected]>
Date:   Tue Feb 4 17:18:27 2025 +0100

    docker build action possible fix

commit b330339
Author: Stijn Peeters <[email protected]>
Date:   Tue Feb 4 16:48:38 2025 +0100

    Parse Markdown in dataset status

commit 79cb297
Author: Stijn Peeters <[email protected]>
Date:   Tue Feb 4 15:43:27 2025 +0100

    Indicate whether like amount is hidden for Instagram posts

commit 3c62f37
Author: Stijn Peeters <[email protected]>
Date:   Tue Feb 4 15:43:00 2025 +0100

    Do not consider missing geotags in Instagram posts 'missing' fields

commit 6d3f9d4
Author: Dale Wahl <[email protected]>
Date:   Tue Feb 4 13:45:47 2025 +0100

    consolidate_urls: better logging/status, better url split

commit a84c63b
Author: Dale Wahl <[email protected]>
Date:   Tue Feb 4 13:45:05 2025 +0100

    revert 0983a36

commit bf7fe14
Author: Dale Wahl <[email protected]>
Date:   Tue Feb 4 12:36:26 2025 +0100

    consolidate_urls: hide unused settings based on requirements

commit ccaf114
Author: Dale Wahl <[email protected]>
Date:   Tue Feb 4 12:27:57 2025 +0100

    consolidate_urls: validate URL before parsing

commit 638413a
Author: Dale Wahl <[email protected]>
Date:   Tue Feb 4 11:08:17 2025 +0100

    check results exist then delete; error message include dataset key when unable to delete

    sometimes log files are left behind because FileNotFoundError was raised on the results_path

commit 0983a36
Author: Dale Wahl <[email protected]>
Date:   Tue Feb 4 10:28:51 2025 +0100

    possibly address github action build fail issue

commit 855d34e
Author: Stijn Peeters <[email protected]>
Date:   Mon Feb 3 18:44:14 2025 +0100

    Fix Gephi Lite link

commit 4e5752d
Author: Stijn Peeters <[email protected]>
Date:   Mon Feb 3 17:57:16 2025 +0100

    Nicer numbers in network processor statuses

commit 1e0a24c
Author: Stijn Peeters <[email protected]>
Date:   Mon Feb 3 17:43:20 2025 +0100

    Never assume fields are non-null in Telegram data...

commit 66d60e9
Author: Stijn Peeters <[email protected]>
Date:   Mon Feb 3 17:40:09 2025 +0100

    Fix forward username mapping in some cases for Telegram

commit 8034d1c
Merge: 59a1546 9bccdf1
Author: Stijn Peeters <[email protected]>
Date:   Mon Feb 3 12:05:48 2025 +0100

    Merge branch 'master' of https://github.com/digitalmethodsinitiative/4cat

commit 59a1546
Author: Stijn Peeters <[email protected]>
Date:   Mon Feb 3 12:05:44 2025 +0100

    Add progress indicators to 'Count values' and 'Thread metadata' processors

commit 9bccdf1
Author: Dale Wahl <[email protected]>
Date:   Fri Jan 31 13:05:29 2025 +0100

    Update docker_latest.yml 6.13.0?

commit e826283
Author: Dale Wahl <[email protected]>
Date:   Fri Jan 31 13:01:18 2025 +0100

    Same but different

commit 54d10cb
Author: Dale Wahl <[email protected]>
Date:   Fri Jan 31 12:59:41 2025 +0100

    Update GitHub action to use latest docker

commit 2600e55
Author: Dale Wahl <[email protected]>
Date:   Fri Jan 31 12:34:59 2025 +0100

    python 3.11 for Docker

    Have been using this all winter and have had no issues. Enjoying the better error messages too.

commit 513a589
Author: Dale Wahl <[email protected]>
Date:   Mon Jan 27 17:01:52 2025 +0100

    bsky: ensure interrupt

commit 0badee7
Author: Dale Wahl <[email protected]>
Date:   Mon Jan 27 15:19:42 2025 +0100

    bsky: no progress bar if no max_posts

commit 115a3c1
Author: Dale Wahl <[email protected]>
Date:   Mon Jan 27 14:18:12 2025 +0100

    bsky datasource

commit 836a235
Author: Dale Wahl <[email protected]>
Date:   Thu Jan 23 11:47:51 2025 +0100

    post_topic_matrix: rename column when tokenizer created multiple documents per post

commit 977d887
Author: Dale Wahl <[email protected]>
Date:   Thu Jan 23 11:37:52 2025 +0100

    rank_attribute: convert to str to lower()

commit a1cdd4c
Author: Dale Wahl <[email protected]>
Date:   Wed Jan 22 09:40:29 2025 +0100

    fix: allow None for columns.default; remove debug log statement

    fix occasional error that appears particularly on new processors with no expected default, i.e.:
    <option value="{{ choice }}"{% if choice in option_settings.default %} selected="selected"{% endif %}>{{ option_settings.options[choice] }}</option>
    TypeError: argument of type 'NoneType' is not iterable
  • Loading branch information
stijn-uva authored Feb 5, 2025
1 parent d3fec55 commit 4205610
Show file tree
Hide file tree
Showing 4 changed files with 186 additions and 0 deletions.
13 changes: 13 additions & 0 deletions datasources/pinterest/DESCRIPTION.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
The Pinterest data source can be used to manipulate data collected from [Pinterest](https://pinterest.com/) with
[Zeeschuimer](https://github.com/digitalmethodsinitiative/zeeschuimer). Data is collected with the browser extension; 4CAT cannot collect data on its own. After collecting
data with Zeeschuimer it can be uploaded to 4CAT for further processing and analysis. See the Zeeschuimer documentation
for more information on how to collect data with it.

Data is collected as it is formatted internally by Pinterest's website. Posts are stored as (large) JSON objects; it
will usually be easier to make sense of the data by downloading it as a CSV file from 4CAT instead. The JSON structure
is relatively straightforward and contains some data not included in the CSV exports.

## Missing data

Pinterest does not always include all metadata in its JSON objects; on some pages, the time the post was made is missing
from a post, for example. 4CAT will warn you about this when importing data.
12 changes: 12 additions & 0 deletions datasources/pinterest/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
"""
Initialize Pinterest data source
"""

# An init_datasource function is expected to be available to initialize this
# data source. A default function that does this is available from the
# backend helpers library.
from common.lib.helpers import init_datasource

# Internal identifier for this data source
DATASOURCE = "pinterest"
NAME = "Pinterest"
142 changes: 142 additions & 0 deletions datasources/pinterest/search_pinterest.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,142 @@
"""
Import scraped Pinterest data
It's prohibitively difficult to scrape data from Pinterest within 4CAT itself due
to its aggressive rate limiting. Instead, import data collected elsewhere.
"""
from datetime import datetime

from backend.lib.search import Search
from common.lib.item_mapping import MappedItem, MissingMappedField


class SearchPinterest(Search):
"""
Import scraped Pinterest data
"""
type = "pinterest-search" # job ID
category = "Search" # category
title = "Import scraped Pinterest data" # title displayed in UI
description = "Import Pinterest data collected with an external tool such as Zeeschuimer." # description displayed in UI
extension = "ndjson" # extension of result file, used internally and in UI
is_from_zeeschuimer = True

# not available as a processor for existing datasets
accepts = [None]
references = [
"[Zeeschuimer browser extension](https://github.com/digitalmethodsinitiative/zeeschuimer)",
"[Worksheet: Capturing TikTok data with Zeeschuimer and 4CAT](https://tinyurl.com/nmrw-zeeschuimer-tiktok)"
]

def get_items(self, query):
"""
Run custom search
Not available for Pinterest
"""
raise NotImplementedError("Pinterest datasets can only be created by importing data from elsewhere")

@staticmethod
def map_item(post):
"""
Map Pinterest object to 4CAT item
Depending on whether the object was captured from JSON or HTML, treat it
differently. A lot of data is missing from HTML objects.
:param post:
:return:
"""
if post.get("_zs-origin") == "html":
return SearchPinterest.map_item_from_html(post)
else:
return SearchPinterest.map_item_from_json(post)

@staticmethod
def map_item_from_json(post):
"""
Map Pinterest object to 4CAT item
Pretty simple, except posts sometimes don't have timestamps :| but at
least these objects are more complete than the HTML data usually
:param dict post: Pinterest object
:return MappedItem: Mapped item
"""
try:
# there are often no timestamps :'(
timestamp = datetime.strptime(post.get("created_at", post.get("createdAt")), "%a, %d %b %Y %H:%M:%S %z")
unix_timestamp = int(timestamp.timestamp())
str_timestamp = timestamp.strftime("%Y-%m-%d %H:%M:%S")
except (ValueError, TypeError):
unix_timestamp = str_timestamp = MissingMappedField("")

post_id = post.get("entityId", post["id"])

if "imageSpec_orig" in post:
image_url = post["imageSpec_orig"]["url"]
else:
image_url = post["images"]["orig"]["url"]

return MappedItem({
"id": post_id,
"author": post["pinner"]["username"],
"author_fullname": post["pinner"].get("fullName", post["pinner"].get("full_name", "")),
"author_original": post["nativeCreator"]["username"] if post.get("nativeCreator") else post["pinner"]["username"],
"body": post["description"].strip(),
"subject": post["title"].strip(),
"ai_description": post.get("auto_alt_text", ""),
"pinner_original": post["originPinner"]["fullName"] if post.get("originPinner") else "",
"pinner_via": post["viaPinner"]["fullName"] if post.get("viaPinner") else "",
"board": post["board"]["name"],
"board_pins": post["board"].get("pinCount", post["board"].get("pin_count")),
"board_url": f"https://www.pinterest.com{post['board']['url']}",
"timestamp": str_timestamp,
"idea_tags": ",".join(post["pinJoin"]["visualAnnotation"]) if post.get("pinJoin") else "",
"url": f"https://www.pinterest.com/pin/{post_id}",
# these are not always available (shame)
# "is_repin": "yes" if post["isRepin"] else "no",
# "is_unsafe": "yes" if post["isUnsafe"] else "no",
# "total_saves": post["aggregatedPinData"]["aggregatedStats"]["saves"],
"is_video": "yes" if post.get("isVideo", post.get("videos")) else "no",
"image_url": image_url,
"dominant_colour": post.get("dominantColor", post.get("dominant_color")),
"unix_timestamp": unix_timestamp
})

@staticmethod
def map_item_from_html(post):
"""
Map Pinterest object to 4CAT item
These are from the HTML and have even less data than JSON objects...
but enough to be useful in some cases.
:param dict post: Pinterest object
:return MappedItem: Mapped item
"""
return MappedItem({
"id": int(post["id"]),
"author": MissingMappedField(""),
"author_fullname": MissingMappedField(""),
"author_original": MissingMappedField(""),
"body": post["body"].strip(),
"subject": post["title"].strip(),
"ai_description": MissingMappedField(""),
"pinner_original": MissingMappedField(""),
"pinner_via": MissingMappedField(""),
"board": MissingMappedField(""),
"board_pins": MissingMappedField(""),
"board_url": MissingMappedField(""),
"timestamp": MissingMappedField(""), # there are no timestamps :(
"idea_tags": ",".join(post["tags"]),
"url": f"https://www.pinterest.com/pin/{post['id']}",
# these are not always available (shame)
# "is_repin": "yes" if post["isRepin"] else "no",
# "is_unsafe": "yes" if post["isUnsafe"] else "no",
# "total_saves": post["aggregatedPinData"]["aggregatedStats"]["saves"],
"is_video": MissingMappedField(""),
"image_url": post["image"],
"dominant_colour": MissingMappedField(""),
"unix_timestamp": MissingMappedField("")
})
19 changes: 19 additions & 0 deletions webtool/lib/template_filters.py
Original file line number Diff line number Diff line change
Expand Up @@ -121,6 +121,25 @@ def _jinja2_filter_httpquery(data):
except TypeError:
return ""

@app.template_filter("add_colour")
def _jinja2_add_colours(data):
"""
Add colour preview to hexadecimal colour values.
Cute little preview for #FF0099-like strings. Used (at time of writing) for
Pinterest data, which has a "dominant colour" field.
Only works on strings that are *just* the value, to avoid messing up HTML
etc
:param str data: String
:return str: HTML
"""
if type(data) is not str or not re.match(r"#([A-Fa-f0-9]{6}|[A-Fa-f0-9]{3})\b", data):
return data

return f'<span class="colour-preview"><i style="background:{data}" aria-hidden="true"></i> {data}</span>'

@app.template_filter("add_ahref")
def _jinja2_filter_add_ahref(content, ellipsiate=0):
"""
Expand Down

0 comments on commit 4205610

Please sign in to comment.