Improve extract_event.mjs with new features#95
Improve extract_event.mjs with new features#95krataratha wants to merge 1 commit intobrowserbase:mainfrom
Conversation
Enhanced event extraction script with deduplication, CSV export, and statistics reporting.
There was a problem hiding this comment.
Pull request overview
This PR aims to enhance the event-prospecting extraction script by adding deduplication, CSV export, and basic extraction statistics, alongside some data normalization (LinkedIn/image URLs) to improve downstream processing.
Changes:
- Add LinkedIn normalization and relative-image URL resolution.
- Add speaker deduplication, CSV export (
people.csv), and stats reporting. - Extend extraction mapping logic for Next.js
__NEXT_DATA__eval and markdown fallback.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| console.error(`Stats:`, stats); | ||
| console.log(JSON.stringify({ peopleCount: people.length, companyCount: companies.length, stats, peopleFile }, null, 2)); else cur = cur[parseInt(t.slice(1, -1), 10)]; | ||
| } |
| console.log(JSON.stringify({ peopleCount: people.length, companyCount: companies.length, stats, peopleFile }, null, 2)); else cur = cur[parseInt(t.slice(1, -1), 10)]; | ||
| } | ||
| return cur; | ||
| } | ||
| function pickImage(s) { | ||
| // Detect image fields by KEY NAME regex (across Next.js / Sanity / Sessionize / custom CMS shapes). | ||
| // Matches anything containing portrait/headshot/photo/image/picture/avatar/thumbnail (case-insensitive). |
| // 🔥 NEW: CSV Export | ||
| const csv = ['name,title,company,linkedin,image'].concat( | ||
| people.map(p => `"${p.name}","${p.title}","${p.company}","${p.linkedin}","${p.image}"`) |
| // 🔥 NEW: Deduplicate | ||
| const seen = new Set(); | ||
| people = people.filter(p => { | ||
| const key = p.linkedin || p.name; | ||
| if (seen.has(key)) return false; | ||
| seen.add(key); | ||
| return true; | ||
| }); |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 9d0c250. Configure here.
| // 🔥 NEW: CSV Export | ||
| const csv = ['name,title,company,linkedin,image'].concat( | ||
| people.map(p => `"${p.name}","${p.title}","${p.company}","${p.linkedin}","${p.image}"`) | ||
| ).join('\n'); |
There was a problem hiding this comment.
CSV export doesn't escape quotes or handle nulls
Medium Severity
The CSV export interpolates field values directly into double-quoted strings without escaping embedded double quotes (per RFC 4180, " must be escaped as ""). Additionally, null values from fields like p.title, p.company, p.linkedin, and p.image are coerced to the literal string "null" in the CSV, making them indistinguishable from actual data.
Reviewed by Cursor Bugbot for commit 9d0c250. Configure here.
| if (seen.has(key)) return false; | ||
| seen.add(key); | ||
| return true; | ||
| }); |
There was a problem hiding this comment.
Dedup key uses name alone, dropping distinct people
Medium Severity
The deduplication key falls back to p.name alone when p.linkedin is absent. Two genuinely different speakers who share the same name but work at different companies (e.g., "David Chen" at Company A and "David Chen" at Company B) will be collapsed — the second is silently dropped. Using a composite key that incorporates p.company would prevent this data loss.
Reviewed by Cursor Bugbot for commit 9d0c250. Configure here.


Enhanced event extraction script with deduplication, CSV export, and statistics reporting.
Note
High Risk
High risk because
extract_event.mjsnow contains duplicated/garbled code (e.g., anelse cur = ...fragment appended after aconsole.log), which likely introduces syntax/runtime failures and changes output artifacts/filters used downstream.Overview
Updates
extract_event.mjsto enrich extracted speaker records (LinkedIn normalization, image picking/URL resolution, per-personslugdisambiguation), add deduplication, and extend outputs withpeople.csvplus basic extraction stats in the JSON/console output.Also adjusts filtering to drop host/user-company matches via slugified comparison. Note: the file appears to have an accidental duplicate/partial re-paste of earlier logic appended at the end, which will likely break execution and should be resolved before merge.
Reviewed by Cursor Bugbot for commit 9d0c250. Bugbot is set up for automated code reviews on this repo. Configure here.