-
Notifications
You must be signed in to change notification settings - Fork 1
新增 Web3Insights-data 数据处理、Neo4j 图数据库处理及文档支持 #3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
justinzjj
wants to merge
39
commits into
web3insight-ai:main
Choose a base branch
from
justinzjj:Justin_dev_graph
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+9,185
−1
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Added a .gitignore file to exclude common IDE and system files. Introduced a new documentation file describing the current relational database schema, a proposed graph database model, and analysis considerations for developer and repository relationships.
- Created `repo_rank_2.sql` to analyze ecosystem contributions and developer activity over the past year, excluding bot users. - Added `simple.sql` to include graph traversal queries for active developers. - Introduced `getDataBase.sql` to retrieve database schema information for key tables. - Added `graphDesign.json` to define graph structure for repositories and actors. - Updated documentation in Chinese to reflect edge definitions and project analysis methods. - Modified `web3_graph.json` to enhance graph configuration and include event attributes.
…PostgreSQL to Neo4j - Created `0-install_neo4j.sh` for Neo4j installation with customizable parameters. - Added `1-export_gharchive.sh` to export data from PostgreSQL, including actors, repos, and events. - Implemented `2-data_process.py` to clean and format exported data for Neo4j compatibility. - Developed `3-import_neo4j.sh` to import cleaned data into Neo4j, ensuring proper configurations and checks. - Included a README file to guide users through the installation and data import process. - Added the Neo4j RPM package for installation. - Updated Jupyter notebook for data processing with detailed markdown explanations.
…re normalization - Created `rank_with_weight.sql` to calculate ecosystem rankings based on weighted scores derived from total participants, active participants, and new developers. - Introduced `rank_with_zscore.sql` for ecosystem ranking using z-score normalization, incorporating metrics for active and non-active contributors, and applying weights to various metrics for scoring.
|
已补充提交部分修改,优化了解析逻辑 / 完善了文档说明 |
|
Thank you for your contribution! |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
摘要
本 PR 引入 Web3Insights-data 项目的完整数据处理流程及图数据库支持模块,涵盖 GHArchive 数据的下载、解压、清洗、结构化提取、导入 PostgreSQL 以及导出至 Neo4j 的全流程,并配套多份文档说明,统一风格,提升可复现性与可扩展性。
本次改动内容
一、数据处理模块(Data_process/)
1.gharchive_downloader.sh:支持按日期/月/年下载 GHArchive.json.gz数据2.decompress.sh:并行解压.json.gz文件,自动跳过已解压,支持日志记录3.data_clean.py:格式化原始 JSON,去除无效字段,保留核心交互信息4.data_extract.py:提取出结构化数据 CSV(actors / repos / events)5.data_import_pgsql.sh:将 CSV 数据导入 PostgreSQL(支持自动建表)Data_schema/event_payload.json:事件字段参考结构-f单文件、-d指定日、-m月、-y年,并行任务配置二、图数据库模块(for_neo4j/)
0-install_neo4j.sh:社区版 Neo4j 安装脚本1-export_gharchive.sh:将 PGSQL 中结构化数据导出为 Neo4j 可识别格式2-data_process.py:生成节点(Actor/Repo)和边(INTERACTS_WITH)3-import_neo4j.sh:调用neo4j-admin import进行全量图数据导入cypher/:常用图分析 Cypher 查询PageRank.cypher:基于 PageRank 的仓库重要性排序repo_community_detection.cypher:基于仓库的社区划分Dijkstra.cypher:最短路径查询示例new_dege.cypher:基于共同开发者构建仓库-仓库新边三、文档支持(doc/)
doc/数据处理.md:五个 ETL 步骤详细说明doc/图设计.md:图模型设计、节点边属性解释、分析方向README.md:描述目录结构、使用方法与依赖环境测试验证
2025-09-01的 GHArchive 样本文件进行测试,所有流程成功执行📌 后续计划