Skip to content

Improve extract_event.mjs with new features#95

Open
krataratha wants to merge 1 commit intobrowserbase:mainfrom
krataratha:patch-1
Open

Improve extract_event.mjs with new features#95
krataratha wants to merge 1 commit intobrowserbase:mainfrom
krataratha:patch-1

Conversation

@krataratha
Copy link
Copy Markdown

@krataratha krataratha commented May 5, 2026

Enhanced event extraction script with deduplication, CSV export, and statistics reporting.


Note

High Risk
High risk because extract_event.mjs now contains duplicated/garbled code (e.g., an else cur = ... fragment appended after a console.log), which likely introduces syntax/runtime failures and changes output artifacts/filters used downstream.

Overview
Updates extract_event.mjs to enrich extracted speaker records (LinkedIn normalization, image picking/URL resolution, per-person slug disambiguation), add deduplication, and extend outputs with people.csv plus basic extraction stats in the JSON/console output.

Also adjusts filtering to drop host/user-company matches via slugified comparison. Note: the file appears to have an accidental duplicate/partial re-paste of earlier logic appended at the end, which will likely break execution and should be resolved before merge.

Reviewed by Cursor Bugbot for commit 9d0c250. Bugbot is set up for automated code reviews on this repo. Configure here.

Enhanced event extraction script with deduplication, CSV export, and statistics reporting.
Copilot AI review requested due to automatic review settings May 5, 2026 05:19
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to enhance the event-prospecting extraction script by adding deduplication, CSV export, and basic extraction statistics, alongside some data normalization (LinkedIn/image URLs) to improve downstream processing.

Changes:

  • Add LinkedIn normalization and relative-image URL resolution.
  • Add speaker deduplication, CSV export (people.csv), and stats reporting.
  • Extend extraction mapping logic for Next.js __NEXT_DATA__ eval and markdown fallback.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +166 to +168
console.error(`Stats:`, stats);
console.log(JSON.stringify({ peopleCount: people.length, companyCount: companies.length, stats, peopleFile }, null, 2)); else cur = cur[parseInt(t.slice(1, -1), 10)];
}
Comment on lines +167 to 173
console.log(JSON.stringify({ peopleCount: people.length, companyCount: companies.length, stats, peopleFile }, null, 2)); else cur = cur[parseInt(t.slice(1, -1), 10)];
}
return cur;
}
function pickImage(s) {
// Detect image fields by KEY NAME regex (across Next.js / Sanity / Sessionize / custom CMS shapes).
// Matches anything containing portrait/headshot/photo/image/picture/avatar/thumbnail (case-insensitive).
Comment on lines +149 to +151
// 🔥 NEW: CSV Export
const csv = ['name,title,company,linkedin,image'].concat(
people.map(p => `"${p.name}","${p.title}","${p.company}","${p.linkedin}","${p.image}"`)
Comment on lines +116 to +123
// 🔥 NEW: Deduplicate
const seen = new Set();
people = people.filter(p => {
const key = p.linkedin || p.name;
if (seen.has(key)) return false;
seen.add(key);
return true;
});
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 9d0c250. Configure here.

// 🔥 NEW: CSV Export
const csv = ['name,title,company,linkedin,image'].concat(
people.map(p => `"${p.name}","${p.title}","${p.company}","${p.linkedin}","${p.image}"`)
).join('\n');
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CSV export doesn't escape quotes or handle nulls

Medium Severity

The CSV export interpolates field values directly into double-quoted strings without escaping embedded double quotes (per RFC 4180, " must be escaped as ""). Additionally, null values from fields like p.title, p.company, p.linkedin, and p.image are coerced to the literal string "null" in the CSV, making them indistinguishable from actual data.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 9d0c250. Configure here.

if (seen.has(key)) return false;
seen.add(key);
return true;
});
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dedup key uses name alone, dropping distinct people

Medium Severity

The deduplication key falls back to p.name alone when p.linkedin is absent. Two genuinely different speakers who share the same name but work at different companies (e.g., "David Chen" at Company A and "David Chen" at Company B) will be collapsed — the second is silently dropped. Using a composite key that incorporates p.company would prevent this data loss.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 9d0c250. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants