Skip to content

(EAI-1044) Add sourceType to all sources #756

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Jun 2, 2025
Merged

(EAI-1044) Add sourceType to all sources #756

merged 14 commits into from
Jun 2, 2025

Conversation

hschawe
Copy link
Collaborator

@hschawe hschawe commented May 29, 2025

Jira: https://jira.mongodb.org/browse/EAI-1044

Changes

  • Adds sourceType for all sources (except for mongodb-corp)

Notes

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if I should add sourceType: "marketing" to the web-misc web source (lines 403-418)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that marketing doesn't make a ton of sense for those URLs. We might want to break them out into one or two separate buckets with appropriate sourceType values. e.g. learn.mongodb.com/ could be university-content or perhaps a more general new one like university-info.

Let's discuss with the team tomorrow and agree on a path forward.

cbush
cbush previously requested changes May 29, 2025
Copy link
Collaborator

@cbush cbush left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand this change. What's wrong with the flexibility of string?

/**
Source type indicating the type of content the web page contains.
*/
sourceType?: SourceTypeName;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it really optional?

Copy link
Collaborator Author

@hschawe hschawe May 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, sort of. we wanted to leave the sourceType for the mongodb-corp source undefined because it's a single page that doesn't fit into any category that we might be interested in. mongodb-corp is not a web source though, so we could make this required.
edit: and, also, I did not add a sourceType to the web-misc source, since not all of those are marketing pages


/**
Arbitrary metadata for page.
*/
metadata?: PageMetadata;
};

export type SourceTypeName =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are very specific to our instance of the bot, which means it should not be in core.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 - we should keep mongodb-rag-core agnostic to the specifics of ingested sources, etc so that it can fit many use cases.

I think this type would make the most sense in ingest-mongodb-public since that is specific to our use case. Can we move it somewhere in that package and update imports throughout this PR?

Note that we may need to change some package-level config stuff to make that work. If you hit issues with this let me know and we can work through it.

@hschawe
Copy link
Collaborator Author

hschawe commented May 29, 2025

I don't really understand this change. What's wrong with the flexibility of string?

We know what strings we want to use for the sourceTypes, so this allows us to ensure we're using the correct strings. For reference, this is the mapping

Snooty docs: "tech-docs"
Devcenter: "devcenter"
MongoDB DotCom: "marketing"
MongoDB University: "university-content"
Mongoose: "tech-docs-external"
Prisma: "tech-docs-external"
Mongodb-Corp (Chatbot): undefined
University-meta (Chatbot): "university-content"
Practical Aggregations: "book-external"
Terraform: "tech-docs-external"
Wired Tiger "tech-docs-external"

@cbush
Copy link
Collaborator

cbush commented May 29, 2025

But you don't know every future possible source name and the source types you listed aren't relevant to every user of the chatbot framework.

@hschawe
Copy link
Collaborator Author

hschawe commented May 29, 2025

But you don't know every future possible source name and the source types you listed aren't relevant to every user of the chatbot framework.

My intent was that we could update the strings in SourceTypeName if needed. I'd consider moving SourceTypeName type into our ingest implementation, but that might be a bit odd to implement since the Page type is directly used everywhere... so looks like I'll be reverting sourceType to a string

@cbush cbush self-requested a review May 29, 2025 20:08
@cbush cbush dismissed their stale review May 29, 2025 20:08

Looks good, thanks for keeping core tidy 👍

Copy link
Collaborator

@nlarew nlarew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM nice work!

@hschawe hschawe merged commit cb01231 into main Jun 2, 2025
1 check passed
@hschawe hschawe deleted the EAI-1044 branch June 2, 2025 15:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants