Add Github issues and PRs scanner #8

turbaszek · 2021-05-22T11:49:09Z

In this PR I tried to build some basic tooling around scanning data sources. Currently the data is only collected and not persisted anywhere. That's still something I plan to do in this PR.

Although I decided to use old approach with configuration in yaml file it should be treated only as temporary solution before we have a database.

Why didn't I use known PyGithub? It's LGPL.

kibble/configuration/yaml_config.py

michalslowikowski00 · 2021-05-24T19:21:17Z

kibble/scanners/base.py

+        self.log = logging.getLogger(__name__)
+
+    def _persist(self, payload: Any):  # pylint: disable=no-self-use
+        """Persists data to database. Should be implemented per scanner."""


One scanner, one database/table?

One scanner, one database/table?

I assume there will be only one database with multiple collections / document types. Need to evaluate ES vs some other nosql db.

But general idea here is that each data type provides information how to persist it and how to read it.

as of current ES versions, you can't have different document types in the same DB.
So it would have to be one document type per index, which is what we also did with Kibble 1, for instance you have kibble_mail, kibble_code_commit, kibble_issue etc - one per general data type. You can do more generalized queries across indices, but usually we just deal with one type at a time, so having multiple indices is not an issue...

turbaszek · 2021-05-30T20:34:35Z

@kaxil @sharanf @Humbedooh @michalslowikowski00 happy to get your opinion.

As suggested on the dev list I introduced concept of DataSource and DataType. Those for now can be configured yaml configuration file:

data_sources:
   - name: github_kibble
     class: kibble.data_sources.github.GithubDataSource
     config:
       repo_owner: apache
       repo_name: kibble
       enabled_data_types:
         - pr_issues

This form allow users to specify any external data sources as long as the class path points to importable object.

The role of DataSource is to provide authentication methods for the external service represented by it. DataType represent single type of information we can get from this source, in case of this PR those are Github issues (which include also PRs). Role of DataType is to define :

how to process the raw data from external source and how to persist them into database (to be done)
how to read the data from database including aggregation, filters etc.

In general this is rough idea I have in m mind:

sharanf · 2021-05-31T17:35:01Z

@kaxil @sharanf @Humbedooh @michalslowikowski00 happy to get your opinion.

As suggested on the dev list I introduced concept of DataSource and DataType. Those for now can be configured yaml configuration file:
data_sources:
   - name: github_kibble
     class: kibble.data_sources.github.GithubDataSource
     config:
       repo_owner: apache
       repo_name: kibble
       enabled_data_types:
         - pr_issues
This form allow users to specify any external data sources as long as the class path points to importable object.

The role of DataSource is to provide authentication methods for the external service represented by it. DataType represent single type of information we can get from this source, in case of this PR those are Github issues (which include also PRs). Role of DataType is to define :
* how to process the raw data from external source and how to persist them into database (to be done)

* how to read the data from database including aggregation, filters etc.
In general this is rough idea I have in m mind:

@turbaszek Thanks for working on this. My initial thought is that this looks a lot more granular than what we have in place now - which is good as we have sometimes missed at been able to get to the right level of granularity. For Github the datatypes seem fairly organised and can pretty much already allocated - how do you see this working for example for our project mailing lists? Would each list the be a datasource and the conversations the datatype?

turbaszek · 2021-05-31T17:53:44Z

how do you see this working for example for our project mailing lists? Would each list the be a datasource and the conversations the datatype?

That's a very good question @sharanf!

I would lean to what you've written. Datasource does not only represent an "external service" entity but "account/organization within an external service". So, in case of mailing list each Apache project would required configuring their own data source.

For example:

data_sources:
   - name: asf_mails_kibble
     class: kibble.data_sources.pony_mail.PonyMailDataSource
     config:
       project_name: kibble
       enabled_data_types:
         - mails
   - name: asf_mails_kafka
     class: kibble.data_sources.pony_mail.PonyMailDataSource
     config:
       project_name: kafka
       enabled_data_types:
         - mails
   - name: asf_mails_pulsar
     class: kibble.data_sources.pony_mail.PonyMailDataSource
     config:
       project_name: pulsar
       enabled_data_types:
         - mails

While there's a bit of duplication in configuration it allow more granularity. In case of ASF the config will be big and repeatable but for smaller Kibble deployments it would be smaller and more configuration maybe an advantage.

Additionally this additional granularity is useful in case of sources that need authorisation. In such cases we may want to store the credentials in different way or use different auth methods.

turbaszek · 2021-06-11T19:17:42Z

docs/architecture.rst

+    under the License.
+
+Apache Kibble Overview
+======================


@sharanf @Humbedooh @kaxil @michalslowikowski00 I added some docs/notes about current status and how things are. Let me know what do you think

Thanks Tomek! This is good work. I have added some minor text changes.

Minor text changes

turbaszek · 2021-06-18T14:42:07Z

@kaxil @Humbedooh @sharanf please let me know if we should proceed and merge (once I fix tests). I would like to make it move

sharanf · 2021-06-20T18:49:51Z

@kaxil @Humbedooh @sharanf please let me know if we should proceed and merge (once I fix tests). I would like to make it move

@turbaszek From my side I am happy to keep things moving so have no problems with starting to merge your new code once the tests are fixed.

amondirvin · 2024-07-05T04:52:45Z

.github/workflows/ci.yaml

      - uses: actions/checkout@v2
      - uses: actions/setup-python@v2
-      - run: pip install '.[devel]'
+        with:


Add Github issues and PRs scanner

0bc940d

turbaszek added the type:feature label May 22, 2021

turbaszek requested review from Humbedooh, kaxil, sharanf and michalslowikowski00 May 22, 2021 11:49

github-actions bot added the area:scanners Scanners related issues label May 22, 2021

turbaszek added this to the Kibble 0.0.1 milestone May 22, 2021

kaxil reviewed May 22, 2021

View reviewed changes

kibble/configuration/yaml_config.py Outdated Show resolved Hide resolved

turbaszek added 2 commits May 22, 2021 14:06

fixup! Add Github issues and PRs scanner

5fdca3b

fixup! fixup! Add Github issues and PRs scanner

6b2228d

turbaszek force-pushed the add-github-scanners branch from 579d529 to 70e7105 Compare May 22, 2021 12:40

fixup! fixup! fixup! Add Github issues and PRs scanner

75346a9

turbaszek force-pushed the add-github-scanners branch from 70e7105 to 75346a9 Compare May 22, 2021 12:58

github-actions bot added the area:dev Development related issues label May 22, 2021

michalslowikowski00 reviewed May 24, 2021

View reviewed changes

fixup! fixup! fixup! fixup! Add Github issues and PRs scanner

454233e

github-actions bot removed the area:scanners Scanners related issues label May 28, 2021

Add Elasticsearch

b0359a0

github-actions bot added area:docs Documentation related issues aread:database labels Jun 11, 2021

turbaszek requested a review from kaxil June 11, 2021 19:12

turbaszek commented Jun 11, 2021

View reviewed changes

Update architecture.rst

0fb16e6

Minor text changes

amondirvin approved these changes Jul 5, 2024

View reviewed changes

.github/workflows/ci.yaml

- uses: actions/checkout@v2

- uses: actions/setup-python@v2

- run: pip install '.[devel]'

with:

Copy link

amondirvin Jul 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Github issues and PRs scanner #8

Add Github issues and PRs scanner #8

Uh oh!

turbaszek commented May 22, 2021

Uh oh!

Uh oh!

michalslowikowski00 May 24, 2021

Uh oh!

turbaszek May 29, 2021 •

edited

Loading

Uh oh!

Humbedooh Jun 1, 2021

Uh oh!

turbaszek commented May 30, 2021 •

edited

Loading

Uh oh!

sharanf commented May 31, 2021

Uh oh!

turbaszek commented May 31, 2021

Uh oh!

turbaszek Jun 11, 2021

Uh oh!

sharanf Jun 13, 2021

Uh oh!

turbaszek commented Jun 18, 2021

Uh oh!

sharanf commented Jun 20, 2021

Uh oh!

amondirvin Jul 5, 2024

Uh oh!

Uh oh!

Add Github issues and PRs scanner #8

Are you sure you want to change the base?

Add Github issues and PRs scanner #8

Uh oh!

Conversation

turbaszek commented May 22, 2021

Uh oh!

Uh oh!

michalslowikowski00 May 24, 2021

Choose a reason for hiding this comment

Uh oh!

turbaszek May 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Humbedooh Jun 1, 2021

Choose a reason for hiding this comment

Uh oh!

turbaszek commented May 30, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sharanf commented May 31, 2021

Uh oh!

turbaszek commented May 31, 2021

Uh oh!

turbaszek Jun 11, 2021

Choose a reason for hiding this comment

Uh oh!

sharanf Jun 13, 2021

Choose a reason for hiding this comment

Uh oh!

turbaszek commented Jun 18, 2021

Uh oh!

sharanf commented Jun 20, 2021

Uh oh!

amondirvin Jul 5, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

turbaszek May 29, 2021 •

edited

Loading

turbaszek commented May 30, 2021 •

edited

Loading