Skip to content

Conversation

@justinzjj
Copy link

摘要

本 PR 引入 Web3Insights-data 项目的完整数据处理流程及图数据库支持模块,涵盖 GHArchive 数据的下载、解压、清洗、结构化提取、导入 PostgreSQL 以及导出至 Neo4j 的全流程,并配套多份文档说明,统一风格,提升可复现性与可扩展性。

本次改动内容

一、数据处理模块(Data_process/)

  • 1.gharchive_downloader.sh:支持按日期/月/年下载 GHArchive .json.gz 数据
  • 2.decompress.sh:并行解压 .json.gz 文件,自动跳过已解压,支持日志记录
  • 3.data_clean.py:格式化原始 JSON,去除无效字段,保留核心交互信息
  • 4.data_extract.py:提取出结构化数据 CSV(actors / repos / events)
  • 5.data_import_pgsql.sh:将 CSV 数据导入 PostgreSQL(支持自动建表)
  • Data_schema/event_payload.json:事件字段参考结构
  • 所有步骤均支持:-f 单文件、-d 指定日、-m 月、-y 年,并行任务配置

二、图数据库模块(for_neo4j/)

  • 0-install_neo4j.sh:社区版 Neo4j 安装脚本
  • 1-export_gharchive.sh:将 PGSQL 中结构化数据导出为 Neo4j 可识别格式
  • 2-data_process.py:生成节点(Actor/Repo)和边(INTERACTS_WITH)
  • 3-import_neo4j.sh:调用 neo4j-admin import 进行全量图数据导入
  • cypher/:常用图分析 Cypher 查询
    • PageRank.cypher:基于 PageRank 的仓库重要性排序
    • repo_community_detection.cypher:基于仓库的社区划分
    • Dijkstra.cypher:最短路径查询示例
    • new_dege.cypher:基于共同开发者构建仓库-仓库新边

三、文档支持(doc/)

  • doc/数据处理.md:五个 ETL 步骤详细说明
  • doc/图设计.md:图模型设计、节点边属性解释、分析方向
  • 项目根目录下新增 README.md:描述目录结构、使用方法与依赖环境

测试验证

  • 使用 2025-09-01 的 GHArchive 样本文件进行测试,所有流程成功执行
  • 验证提取出的结构化 CSV 符合预期格式
  • 验证 PostgreSQL 导入功能正常,表结构与数据一致
  • 验证 neo4j 数据库导入功能正常,符合预期

📌 后续计划

  • 拓展 Neo4j 图数据库的数据映射与 Cypher 分析支持
  • 针对大数据量进行脚本优化

pseudoyu and others added 30 commits May 3, 2025 11:00
添加现有SQL语句
添加更改后的SQL语句 带有不同方式的归一化
Added a .gitignore file to exclude common IDE and system files. Introduced a new documentation file describing the current relational database schema, a proposed graph database model, and analysis considerations for developer and repository relationships.
- Created `repo_rank_2.sql` to analyze ecosystem contributions and developer activity over the past year, excluding bot users.
- Added `simple.sql` to include graph traversal queries for active developers.
- Introduced `getDataBase.sql` to retrieve database schema information for key tables.
- Added `graphDesign.json` to define graph structure for repositories and actors.
- Updated documentation in Chinese to reflect edge definitions and project analysis methods.
- Modified `web3_graph.json` to enhance graph configuration and include event attributes.
…PostgreSQL to Neo4j

- Created `0-install_neo4j.sh` for Neo4j installation with customizable parameters.
- Added `1-export_gharchive.sh` to export data from PostgreSQL, including actors, repos, and events.
- Implemented `2-data_process.py` to clean and format exported data for Neo4j compatibility.
- Developed `3-import_neo4j.sh` to import cleaned data into Neo4j, ensuring proper configurations and checks.
- Included a README file to guide users through the installation and data import process.
- Added the Neo4j RPM package for installation.
- Updated Jupyter notebook for data processing with detailed markdown explanations.
更新图设计json
更新数据处理脚本,仓库忽略具体数据文件
…re normalization

- Created `rank_with_weight.sql` to calculate ecosystem rankings based on weighted scores derived from total participants, active participants, and new developers.
- Introduced `rank_with_zscore.sql` for ecosystem ranking using z-score normalization, incorporating metrics for active and non-active contributors, and applying weights to various metrics for scoring.
@justinzjj
Copy link
Author

已补充提交部分修改,优化了解析逻辑 / 完善了文档说明

@pseudoyu
Copy link
Member

Thank you for your contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants