-
Notifications
You must be signed in to change notification settings - Fork 9
Add Github issues and PRs scanner #8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
579d529
to
70e7105
Compare
70e7105
to
75346a9
Compare
kibble/scanners/base.py
Outdated
self.log = logging.getLogger(__name__) | ||
|
||
def _persist(self, payload: Any): # pylint: disable=no-self-use | ||
"""Persists data to database. Should be implemented per scanner.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One scanner, one database/table?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One scanner, one database/table?
I assume there will be only one database with multiple collections / document types. Need to evaluate ES vs some other nosql db.
But general idea here is that each data type provides information how to persist it and how to read it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as of current ES versions, you can't have different document types in the same DB.
So it would have to be one document type per index, which is what we also did with Kibble 1, for instance you have kibble_mail, kibble_code_commit, kibble_issue etc - one per general data type. You can do more generalized queries across indices, but usually we just deal with one type at a time, so having multiple indices is not an issue...
@kaxil @sharanf @Humbedooh @michalslowikowski00 happy to get your opinion. As suggested on the dev list I introduced concept of data_sources:
- name: github_kibble
class: kibble.data_sources.github.GithubDataSource
config:
repo_owner: apache
repo_name: kibble
enabled_data_types:
- pr_issues This form allow users to specify any external data sources as long as the class path points to importable object. The role of
|
@turbaszek Thanks for working on this. My initial thought is that this looks a lot more granular than what we have in place now - which is good as we have sometimes missed at been able to get to the right level of granularity. For Github the datatypes seem fairly organised and can pretty much already allocated - how do you see this working for example for our project mailing lists? Would each list the be a datasource and the conversations the datatype? |
That's a very good question @sharanf! I would lean to what you've written. Datasource does not only represent an "external service" entity but "account/organization within an external service". So, in case of mailing list each Apache project would required configuring their own data source. For example: data_sources:
- name: asf_mails_kibble
class: kibble.data_sources.pony_mail.PonyMailDataSource
config:
project_name: kibble
enabled_data_types:
- mails
- name: asf_mails_kafka
class: kibble.data_sources.pony_mail.PonyMailDataSource
config:
project_name: kafka
enabled_data_types:
- mails
- name: asf_mails_pulsar
class: kibble.data_sources.pony_mail.PonyMailDataSource
config:
project_name: pulsar
enabled_data_types:
- mails While there's a bit of duplication in configuration it allow more granularity. In case of ASF the config will be big and repeatable but for smaller Kibble deployments it would be smaller and more configuration maybe an advantage. Additionally this additional granularity is useful in case of sources that need authorisation. In such cases we may want to store the credentials in different way or use different auth methods. |
under the License. | ||
Apache Kibble Overview | ||
====================== |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sharanf @Humbedooh @kaxil @michalslowikowski00 I added some docs/notes about current status and how things are. Let me know what do you think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Tomek! This is good work. I have added some minor text changes.
Minor text changes
@kaxil @Humbedooh @sharanf please let me know if we should proceed and merge (once I fix tests). I would like to make it move |
@turbaszek From my side I am happy to keep things moving so have no problems with starting to merge your new code once the tests are fixed. |
- uses: actions/checkout@v2 | ||
- uses: actions/setup-python@v2 | ||
- run: pip install '.[devel]' | ||
with: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this PR I tried to build some basic tooling around scanning data sources. Currently the data is only collected and not persisted anywhere. That's still something I plan to do in this PR.
Although I decided to use old approach with configuration in yaml file it should be treated only as temporary solution before we have a database.
Why didn't I use known PyGithub? It's LGPL.