-
Notifications
You must be signed in to change notification settings - Fork 73
(EAI-1044) Add sourceType to all sources #756
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if I should add sourceType: "marketing"
to the web-misc
web source (lines 403-418)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed that marketing doesn't make a ton of sense for those URLs. We might want to break them out into one or two separate buckets with appropriate sourceType
values. e.g. learn.mongodb.com/
could be university-content
or perhaps a more general new one like university-info
.
Let's discuss with the team tomorrow and agree on a path forward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really understand this change. What's wrong with the flexibility of string?
/** | ||
Source type indicating the type of content the web page contains. | ||
*/ | ||
sourceType?: SourceTypeName; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it really optional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, sort of. we wanted to leave the sourceType for the mongodb-corp
source undefined because it's a single page that doesn't fit into any category that we might be interested in. mongodb-corp
is not a web source though, so we could make this required.
edit: and, also, I did not add a sourceType to the web-misc
source, since not all of those are marketing pages
|
||
/** | ||
Arbitrary metadata for page. | ||
*/ | ||
metadata?: PageMetadata; | ||
}; | ||
|
||
export type SourceTypeName = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are very specific to our instance of the bot, which means it should not be in core.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 - we should keep mongodb-rag-core
agnostic to the specifics of ingested sources, etc so that it can fit many use cases.
I think this type would make the most sense in ingest-mongodb-public
since that is specific to our use case. Can we move it somewhere in that package and update imports throughout this PR?
Note that we may need to change some package-level config stuff to make that work. If you hit issues with this let me know and we can work through it.
We know what strings we want to use for the sourceTypes, so this allows us to ensure we're using the correct strings. For reference, this is the mapping Snooty docs: "tech-docs" |
But you don't know every future possible source name and the source types you listed aren't relevant to every user of the chatbot framework. |
My intent was that we could update the strings in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM nice work!
Jira: https://jira.mongodb.org/browse/EAI-1044
Changes
Notes