Skip to content

Conversation

turbaszek
Copy link
Member

In this PR I tried to build some basic tooling around scanning data sources. Currently the data is only collected and not persisted anywhere. That's still something I plan to do in this PR.

Although I decided to use old approach with configuration in yaml file it should be treated only as temporary solution before we have a database.

Why didn't I use known PyGithub? It's LGPL.

@github-actions github-actions bot added the area:scanners Scanners related issues label May 22, 2021
@turbaszek turbaszek added this to the Kibble 0.0.1 milestone May 22, 2021
@turbaszek turbaszek force-pushed the add-github-scanners branch from 579d529 to 70e7105 Compare May 22, 2021 12:40
@turbaszek turbaszek force-pushed the add-github-scanners branch from 70e7105 to 75346a9 Compare May 22, 2021 12:58
@github-actions github-actions bot added the area:dev Development related issues label May 22, 2021
self.log = logging.getLogger(__name__)

def _persist(self, payload: Any): # pylint: disable=no-self-use
"""Persists data to database. Should be implemented per scanner."""

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One scanner, one database/table?

Copy link
Member Author

@turbaszek turbaszek May 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One scanner, one database/table?

I assume there will be only one database with multiple collections / document types. Need to evaluate ES vs some other nosql db.

But general idea here is that each data type provides information how to persist it and how to read it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as of current ES versions, you can't have different document types in the same DB.
So it would have to be one document type per index, which is what we also did with Kibble 1, for instance you have kibble_mail, kibble_code_commit, kibble_issue etc - one per general data type. You can do more generalized queries across indices, but usually we just deal with one type at a time, so having multiple indices is not an issue...

@github-actions github-actions bot removed the area:scanners Scanners related issues label May 28, 2021
@turbaszek
Copy link
Member Author

turbaszek commented May 30, 2021

@kaxil @sharanf @Humbedooh @michalslowikowski00 happy to get your opinion.

As suggested on the dev list I introduced concept of DataSource and DataType. Those for now can be configured yaml configuration file:

data_sources:
   - name: github_kibble
     class: kibble.data_sources.github.GithubDataSource
     config:
       repo_owner: apache
       repo_name: kibble
       enabled_data_types:
         - pr_issues

This form allow users to specify any external data sources as long as the class path points to importable object.

The role of DataSource is to provide authentication methods for the external service represented by it. DataType represent single type of information we can get from this source, in case of this PR those are Github issues (which include also PRs). Role of DataType is to define :

  • how to process the raw data from external source and how to persist them into database (to be done)
  • how to read the data from database including aggregation, filters etc.

In general this is rough idea I have in m mind:
Kibble-2

@sharanf
Copy link

sharanf commented May 31, 2021

@kaxil @sharanf @Humbedooh @michalslowikowski00 happy to get your opinion.

As suggested on the dev list I introduced concept of DataSource and DataType. Those for now can be configured yaml configuration file:

data_sources:
   - name: github_kibble
     class: kibble.data_sources.github.GithubDataSource
     config:
       repo_owner: apache
       repo_name: kibble
       enabled_data_types:
         - pr_issues

This form allow users to specify any external data sources as long as the class path points to importable object.

The role of DataSource is to provide authentication methods for the external service represented by it. DataType represent single type of information we can get from this source, in case of this PR those are Github issues (which include also PRs). Role of DataType is to define :

* how to process the raw data from external source and how to persist them into database (to be done)

* how to read the data from database including aggregation, filters etc.

In general this is rough idea I have in m mind:
Kibble-2

@turbaszek Thanks for working on this. My initial thought is that this looks a lot more granular than what we have in place now - which is good as we have sometimes missed at been able to get to the right level of granularity. For Github the datatypes seem fairly organised and can pretty much already allocated - how do you see this working for example for our project mailing lists? Would each list the be a datasource and the conversations the datatype?

@turbaszek
Copy link
Member Author

how do you see this working for example for our project mailing lists? Would each list the be a datasource and the conversations the datatype?

That's a very good question @sharanf!

I would lean to what you've written. Datasource does not only represent an "external service" entity but "account/organization within an external service". So, in case of mailing list each Apache project would required configuring their own data source.

For example:

data_sources:
   - name: asf_mails_kibble
     class: kibble.data_sources.pony_mail.PonyMailDataSource
     config:
       project_name: kibble
       enabled_data_types:
         - mails
   - name: asf_mails_kafka
     class: kibble.data_sources.pony_mail.PonyMailDataSource
     config:
       project_name: kafka
       enabled_data_types:
         - mails
   - name: asf_mails_pulsar
     class: kibble.data_sources.pony_mail.PonyMailDataSource
     config:
       project_name: pulsar
       enabled_data_types:
         - mails

While there's a bit of duplication in configuration it allow more granularity. In case of ASF the config will be big and repeatable but for smaller Kibble deployments it would be smaller and more configuration maybe an advantage.

Additionally this additional granularity is useful in case of sources that need authorisation. In such cases we may want to store the credentials in different way or use different auth methods.

@github-actions github-actions bot added area:docs Documentation related issues aread:database labels Jun 11, 2021
@turbaszek turbaszek requested a review from kaxil June 11, 2021 19:12
under the License.
Apache Kibble Overview
======================
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sharanf @Humbedooh @kaxil @michalslowikowski00 I added some docs/notes about current status and how things are. Let me know what do you think

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Tomek! This is good work. I have added some minor text changes.

Minor text changes
@turbaszek
Copy link
Member Author

@kaxil @Humbedooh @sharanf please let me know if we should proceed and merge (once I fix tests). I would like to make it move

@sharanf
Copy link

sharanf commented Jun 20, 2021

@kaxil @Humbedooh @sharanf please let me know if we should proceed and merge (once I fix tests). I would like to make it move

@turbaszek From my side I am happy to keep things moving so have no problems with starting to merge your new code once the tests are fixed.

- uses: actions/checkout@v2
- uses: actions/setup-python@v2
- run: pip install '.[devel]'
with:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:dev Development related issues area:docs Documentation related issues aread:database type:feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants