Skip to content

Conversation

@Arrmlet
Copy link
Collaborator

@Arrmlet Arrmlet commented Jul 15, 2025

Unified and enhanced X/Twitter scraping system to collect comprehensive data fields while implementing bulletproof validation for trustless Web3 environment.

Enhanced Data Fields Added

Static Tweet Metadata

  • language - Tweet language code ("en")
  • full_text - Complete untruncated tweet text
  • is_retweet - Retweet flag
  • possibly_sensitive - Content sensitivity flag
  • in_reply_to_username - Replied username
  • quoted_tweet_id - Quoted tweet ID

Dynamic Engagement Metrics

  • like_count - Number of likes
  • retweet_count - Number of retweets
  • reply_count - Number of replies
  • quote_count - Number of quote tweets
  • view_count - Number of views
  • bookmark_count - Number of bookmarks

Enhanced User Profile Data

  • user_blue_verified - Blue checkmark status
  • user_description - User bio text
  • user_location - User location string
  • profile_image_url - Avatar URL
  • cover_picture_url - Banner image URL
  • user_followers_count - Follower count
  • user_following_count - Following count

Enhanced Media Support

  • media_urls - List of media URLs
  • media_types - List of media types (photo/video/gif)

Trustless Validation Rules

Field Categories

  • REQUIRED_FIELDS: Perfect match required (username, text, url, hashtags)
  • STATIC_IMMUTABLE: Must match exactly (tweet_id, conversation_id, media_urls)
  • DYNAMIC_BOUNDED: Range validation prevents exploits:
    • like_count: 0-100M (prevents fake engagement)
    • view_count: 0-1B (realistic view limits)
    • user_followers_count: 0-500M (prevents follower inflation)
  • PROFILE_FLEXIBLE: Collected but validation-flexible (display names, descriptions)

Exploit Protection

  • Range validation prevents astronomical fake numbers
  • Variance checking allows up to 10x difference for viral content
  • Profile fields can change between scraping and validation
  • Static immutable facts must match perfectly

Code Consolidation

  • ✅ Removed EnhancedApiDojoTwitterScraper (6 usages replaced)
  • ✅ Unified into single ApiDojoTwitterScraper with enhanced parsing
  • ✅ Updated _best_effort_parse_dataset to extract all available fields
  • ✅ Maintained backward compatibility with existing miners

@Arrmlet Arrmlet requested a review from ewekazoo July 16, 2025 15:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants