codefuse-ai
diff --git a/‎.github/workflows/check_gdl_workflow.yml
Lines changed: 55 additions & 0 deletions b/‎.github/workflows/check_gdl_workflow.yml
Lines changed: 55 additions & 0 deletions
diff --git a/‎README.md
Lines changed: 76 additions & 114 deletions b/‎README.md
Lines changed: 76 additions & 114 deletions
diff --git a/‎README_cn.md
Lines changed: 114 additions & 0 deletions b/‎README_cn.md
Lines changed: 114 additions & 0 deletions
diff --git a/‎assets/wechat_qrcode.JPG
-61.6 KB b/‎assets/wechat_qrcode.JPG
-61.6 KB
diff --git a/‎doc/1_abstract.en.md
Lines changed: 17 additions & 0 deletions b/‎doc/1_abstract.en.md
Lines changed: 17 additions & 0 deletions
diff --git a/‎doc/1_abstract.md
Lines changed: 1 addition & 0 deletions b/‎doc/1_abstract.md
Lines changed: 1 addition & 0 deletions
@@ -0,0 +1,55 @@
+name: GDL script file checker
+on:
+  push:
+    branches-ignore:
+      - 'none'  
+  pull_request:
+    branches: [ "main" ]
+
+jobs:
+  checking-job:
+    runs-on: ubuntu-latest
+    steps:
+      - name: Check out the repository to the runner
+        uses: actions/checkout@v4
+
+      - name: Install locales
+        run: |
+          sudo apt-get update && sudo apt-get install -y locales
+          sudo locale-gen zh_CN.UTF-8
+        env:
+          LANG: zh_CN.UTF-8
+          LANGUAGE: zh_CN:zh:en_US:en
+
+      - name: Download and extract the latest sparrow-cli release
+        run: |
+          ASSET_NAME="sparrow-cli.*.linux.tar.gz" # This pattern should match only the tar.gz file
+          mkdir -p $HOME/sparrow-cli
+      
+          # Use GitHub API to get the latest release information
+          RELEASE_INFO=$(curl -s -H "Authorization: token $GITHUB_TOKEN" "https://api.github.com/repos/codefuse-ai/CodeFuse-Query/releases/latest")
+      
+          # Extract the asset download URL for the asset name specified
+          # The test function is used to ensure we match only the tar.gz file, not the checksum file
+          ASSET_URL=$(echo "$RELEASE_INFO" | jq --arg asset_name "$ASSET_NAME" -r '.assets[] | select(.name | test($asset_name)) | select(.content_type == "application/x-gzip").browser_download_url')
+      
+          # Check if the asset URL is empty or not
+          if [ -z "$ASSET_URL" ]; then
+            echo "Error: Asset URL is empty."
+            exit 1
+          fi
+      
+          # Download and extract the asset
+          echo "Downloading $ASSET_URL to $HOME/sparrow-cli/sparrow-cli.tar.gz"
+          curl -fL --retry 5 "$ASSET_URL" | tar -xz -C $HOME/sparrow-cli
+        env:
+          # The GitHub token is needed for API requests to avoid rate limits
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+
+      - name: Set execute permissions for script
+        run: chmod +x ./tool/aci/check_gdl.sh
+        
+      - name: Run GDL script checking
+        run: ./tool/aci/check_gdl.sh .
+        env:
+          LC_ALL: zh_CN.UTF-8
@@ -0,0 +1,114 @@
+# CodeFuse-Query 代码大数据分析平台
+<p align="center">
+  <img src="https://github.com/codefuse-ai/MFTCoder/blob/main/assets/github-codefuse-logo-update.jpg" width="50%" />
+</p>
+
+<div align="center">
+  <p>
+    <a href="https://github.com/codefuse-ai/CodeFuse-Query">
+        <img alt="stars" src="https://img.shields.io/github/stars/codefuse-ai/CodeFuse-Query?style=social" />
+    </a>
+    <a href="https://github.com/codefuse-ai/CodeFuse-Query">
+        <img alt="forks" src="https://img.shields.io/github/forks/codefuse-ai/CodeFuse-Query?style=social" />
+    </a>
+    <a href="https://github.com/codefuse-ai/CodeFuse-Query/LICENCE">
+      <img alt="License: MIT" src="https://badgen.net/badge/license/apache2.0/blue" />
+    </a>
+    <a href="https://github.com/codefuse-ai/CodeFuse-Query/issues">
+      <img alt="Open Issues" src="https://img.shields.io/github/issues-raw/codefuse-ai/CodeFuse-Query" />
+    </a>
+    <a href="https://github.com/codefuse-ai/CodeFuse-Query/releases">
+      <img alt="Release Download" src="https://img.shields.io/github/downloads/codefuse-ai/CodeFuse-Query/total" />
+    </a>
+    <a href="https://marketplace.visualstudio.com/items?itemName=CodeFuse-Query.codefuse-query-extension">
+      <img alt="VSCode Plugin" src="https://img.shields.io/visual-studio-marketplace/i/CodeFuse-Query.codefuse-query-extension?style=social&logo=visualstudiocode&logoColor=%23007ACC" />
+    </a>
+    <a href="https://github.com/codefuse-ai/CodeFuse-Query/actions/workflows/check_gdl_workflow.yml">
+      <img alt="GDL script file checker" src="https://github.com/codefuse-ai/CodeFuse-Query/actions/workflows/check_gdl_workflow.yml/badge.svg" />
+    </a>
+  </p>
+</div>
+<div align="center">
+
+  **中文** | [English](README.md)
+</div>
+
+
+## 什么是CodeFuse-Query？
+CodeFuse-Query 是一种强大的静态代码分析平台，适合大规模、复杂的代码库分析场景。它的以数据为中心的方法和高度的可扩展性使得它在现代软件开发环境中具有独特的优势。未来，随着静态代码分析技术的不断发展，CodeFuse-Query 有望在这个领域中扮演更加重要的角色。
+
+从整体上来说，CodeFuse-Query代码数据平台分为三大部分：代码数据模型、代码查询DSL、平台产品化服务。
+### 代码数据模型：COREF
+我们定义了一种代码数据化和标准化的模型：COREF，要求所有代码都要能通过各种语言抽取器转化到该模型。 
+COREF主要包含以下几种信息：
+**COREF** = AST （抽象语法树） + ASG（抽象语义图） + CFG（控制流图） + PDG（程序依赖图）+ Call Graph（函数调用图） + Class Hierarchy （类继承关系）+ Documentation（文档/注释信息）
+注：由于每种信息的计算难度不一，所以并不是所有语言的COREF信息均包含以上全部信息，基础信息主要有AST、ASG、Call Graph、Class Hierarchy和Documentation，其他信息（ CFG 和 PDG ）仍在建设中，后续会逐步支持。
+### 代码查询DSL
+基于生成的COREF代码数据，CodeFuse-Query 使用一种自定义的DSL语言 **Gödel** 来进行查询，从而完成代码分析需求。
+Gödel是一种逻辑推理语言，它的底层实现是基于逻辑推理语言Datalog，通过描述“事实”和“规则”， 程序可以不断地推导出新的事实。Gödel也是一个声明式语言，相较于命令式编程，声明式编程更加着重描述“要什么”，而把如何实现交给计算引擎。
+既然代码已经转化为关系型数据（COREF数据以关系型数据表的形式存储），相信大家会有疑问，为什么不直接用SQL，或者是直接使用SDK，而是又要专门去学习一个新的DSL语言呢？因为Datalog的计算具备单调性和终止性，简单理解就是，Datalog是在牺牲了表达能力的前提下获得了更高的性能，而Gödel继承了这个特点。 
+
+- 相比较SDK，Gödel的主要优点是易学易用，声明式的描述，用户不需要关注中间的运算过程，只需要像SQL一样简单描述清楚需求即可。
+- 相比较SQL，Gödel的优点主要是描述能力更强、计算速度更快，例如描述递归算法和多表联合查询，而这些对于SQL来说都是比较困难的。
+### 平台化、产品化
+CodeFuse-Query 包括**Sparrow CLI **和CodeFuse-Query**在线服务Query中心**。Sparrow CLI包含了所有组件和依赖，例如抽取器，数据模型，编译器等，用户完全可以通过使用Sparrow CLI在本地进行代码数据生成和查询（Sparrow CLI的使用方式请见 第3节 安装、配置、运行）。如果用户有在线查询的需求，可以使用Query中心进行实验。
+## 支持分析的编程语言 
+截止目前，CodeFuse-Query支持对11种编程语言进行数据分析。其中对5种编程语言（ Java、JavaScript、TypeScript、XML、Go ）的支持度非常成熟，对剩余6种编程语言（Object-C、C++、Python3、Swift、SQL、Properties ）的支持度处于beta阶段，还有进一步提升和完善的空间，具体的支持情况见下表：
+
+| 语言 | 状态 | COREF模型节点数 |
+| --- | --- | --- |
+| Java | 成熟 | 162 |
+| XML | 成熟 | 12 |
+| TS/JS | 成熟 | 392 |
+| Go | 成熟 | 40 |
+| OC/C++ | beta | 53/397 |
+| Python3 | beta | 93 |
+| Swift | beta | 248 |
+| SQL | beta | 750 |
+| Properties | beta | 9 |
+
+注：以上语言状态的成熟程度判断标准是根据COREF包含的信息种类和实际落地情况来进行判定，除了OC/C++外，所有语言均支持了完整的AST信息和Documentation信息，以Java为例，COREF for Java还支持了ASG、Call Graph、Class Hierarchy、以及部分CFG信息。
+
+## 快速使用（QuickStart）
+[安装、配置、运行](./doc/3_install_and_run.md)
+
+## 文档 (Documentation)
+- [引言](./doc/1_abstract.md)
+- [概述](./doc/2_introduction.md)
+- [使用场景](./doc/user_case.md)
+- [安装、配置、运行](./doc/3_install_and_run.md)
+- [Gödel查询语言介绍](./doc/4_godelscript_language.md)
+- [VSCode开发插件](./doc/5_toolchain.md)
+- [COREF API](https://codefuse-ai.github.io/CodeFuse-Query/godel-api/coref_library_reference.html)
+
+## 教程 (tutorial)
+- [在线教程](./tutorial/README.md)
+
+## 目录结构说明
+- `cli`：命令行工具的入口，提供统一的命令行接口，调用其他模块完成具体功能
+- `language`：各语言的数据化核心（extractor）和数据建模（lib）。关于开放度的问题，请参见《关于开源范围的一些说明》章节
+- `doc`：参考文档
+- `examples`：Gödel 查询语言示例
+- `tutorial`：CodeFuse-Query 开发容器使用教程
+
+## 关于开源范围的一些说明
+截止目前，从源码**不能**构建出可执行的程序，原因在于本次开源并没有开放所有的模块，缺少的模块会在之后的一年陆续开源。尽管如此，为保障完整的体验，我们开放了**完整的安装包**下载，请见Release页面。
+关于语言的开放程度，可以查看下表：
+
+| 语言 | 数据建模开源 | 数据化核心开源 | 成熟度 |
+| --- | --- | --- | --- |
+| Python | Y | Y | RELEASE |
+| Java | Y | N | RELEASE |
+| JavaScript | Y | N | RELEASE |
+| Go | Y | N | RELEASE |
+| XML | Y | N | RELEASE |
+| Cfamily | N | N | BETA |
+| SQL | N | N | BETA |
+| Swift | N | N | BETA |
+| Properties | N | N | BETA |
+
+## 联系我们
+![微信用户群图片](./assets/wechat_qrcode.JPG)
+## Star History
+
+[![Star History Chart](https://api.star-history.com/svg?repos=codefuse-ai/CodeFuse-Query&type=Date)](https://star-history.com/#codefuse-ai/CodeFuse-Query&Date)
@@ -0,0 +1,17 @@
+# Abstract
+With the increasing popularity of large-scale software development, the demand for scalable and adaptable static code analysis techniques is growing. Traditional static analysis tools such as Clang Static Analyzer (CSA) or PMD have shown good results in checking programming rules or style issues. However, these tools are often designed for specific objectives and are unable to meet the diverse and changing needs of modern software development environments. These needs may relate to Quality of Service (QoS), various programming languages, different algorithmic requirements, and various performance needs. For example, a security team might need sophisticated algorithms like context-sensitive taint analysis to review smaller codebases, while project managers might need a lighter algorithm, such as one that calculates cyclomatic complexity, to measure developer productivity on larger codebases.  
+
+These diversified needs, coupled with the common computational resource constraints in large organizations, pose a significant challenge. Traditional tools, with their problem-specific computation methods, often fail to scale in such environments. This is why we introduced CodeQuery, a centralized data platform specifically designed for large-scale static analysis.  
+In implementing CodeQuery, we treat source code and analysis results as data, and the execution process as big data processing, a significant departure from traditional tool-centric approaches. We leverage common systems in large organizations, such as data warehouses, data computation facilities like MaxCompute and Hive, OSS object storage, and flexible computing resources like Kubernetes, allowing CodeQuery to integrate seamlessly into these systems. This approach makes CodeQuery highly maintainable and scalable, capable of supporting diverse needs and effectively addressing changing demands. Furthermore, CodeQuery's open architecture encourages interoperability between various internal systems, facilitating seamless interaction and data exchange. This level of integration and interaction not only increases the degree of automation within the organization but also improves efficiency and reduces the likelihood of manual errors. By breaking down information silos and fostering a more interconnected, automated environment, CodeQuery significantly enhances the overall productivity and efficiency of the software development process.  
+Moreover, CodeQuery's data-centric approach offers unique advantages when addressing domain-specific challenges in static source code analysis. For instance, source code is typically a highly structured and interconnected dataset, with strong informational and relational ties to other code and configuration files. By treating code as data, CodeQuery can adeptly handle these issues, making it especially suitable for use in large organizations where codebases evolve continuously but incrementally, with most code undergoing minor changes daily while remaining stable. CodeQuery also supports use cases like code-data based Business Intelligence (BI), generating reports and dashboards to aid in monitoring and decision-making processes. Additionally, CodeQuery plays an important role in analyzing training data for large language models (LLMs), providing deep insights to enhance the overall effectiveness of these models.  
+
+In the current field of static analysis, CodeQuery introduces a new paradigm. It not only meets the needs of analyzing large, complex codebases but is also adaptable to the ever-changing and diversified scenarios of static analysis. CodeQuery's data-centric approach gives it a unique advantage in dealing with code analysis issues in big data environments. Designed to address static analysis problems in large-scale software development settings, it views both source code and analysis results as data, allowing it to integrate flexibly into various systems within large organizations. This approach not only enables efficient handling of large codebases but can also accommodate various complex analysis needs, thereby making static analysis work more effective and accurate.  
+
+The characteristics and advantages of CodeQuery can be summarized as follows:
+
+- **Highly Scalable**: CodeQuery can handle large codebases and adapt to different analysis needs. This high level of scalability makes CodeQuery particularly valuable in large organizations.  
+- **Data-Centric**: By treating source code and analysis results as data, CodeQuery's data-centric approach gives it a distinct edge in addressing code analysis problems in big data environments.  
+- **Highly Integrated**: CodeQuery can integrate seamlessly into various systems within large organizations, including data warehouses, data computation facilities, object storage, and flexible computing resources. This high level of integration makes the use of CodeQuery in large organizations more convenient and efficient.  
+- **Supports Diverse Needs**: CodeQuery can process large codebases and accommodate various complex analysis needs, including QoS analysis, cross-language analysis, algorithmic needs, and performance requirements.  
+
+CodeQuery is a powerful static code analysis platform, suitable for large-scale, complex codebase analysis scenarios. Its data-centric approach and high scalability give it a unique advantage in the modern software development environment. As static code analysis technology continues to evolve, CodeQuery is expected to play an increasingly important role in this field.  
@@ -1,3 +1,4 @@
+# 引言
 随着大规模软件开发的普及，对可扩展且易于适应的静态代码分析技术的需求正在加大。传统的静态分析工具，如 Clang Static Analyzer (CSA) 或 PMD，在检查编程规则或样式问题方面已经展现出了良好的效果。然而，这些工具通常是为了满足特定的目标而设计的，往往无法满足现代软件开发环境中多变和多元化的需求。这些需求可以涉及服务质量 (QoS)、各种编程语言、不同的算法需求，以及各种性能需求。例如，安全团队可能需要复杂的算法，如上下文敏感的污点分析，来审查较小的代码库，而项目经理可能需要一种相对较轻的算法，例如计算圈复杂度的算法，以在较大的代码库上测量开发人员的生产力。
 
 这些多元化的需求，加上大型组织中常见的计算资源限制，构成了一项重大的挑战。由于传统工具采用的是问题特定的计算方式，往往无法在这种环境中实现扩展。因此，我们推出了 CodeQuery，这是一个专为大规模静态分析设计的集中式数据平台。
Original file line number	Diff line number	Diff line change
`@@ -1,3 +1,4 @@`
	`1`	`+# 引言`
`1`	`2`	随着大规模软件开发的普及，对可扩展且易于适应的静态代码分析技术的需求正在加大。传统的静态分析工具，如 Clang Static Analyzer (CSA) 或 PMD，在检查编程规则或样式问题方面已经展现出了良好的效果。然而，这些工具通常是为了满足特定的目标而设计的，往往无法满足现代软件开发环境中多变和多元化的需求。这些需求可以涉及服务质量 (QoS)、各种编程语言、不同的算法需求，以及各种性能需求。例如，安全团队可能需要复杂的算法，如上下文敏感的污点分析，来审查较小的代码库，而项目经理可能需要一种相对较轻的算法，例如计算圈复杂度的算法，以在较大的代码库上测量开发人员的生产力。
`2`	`3`
`3`	`4`	`这些多元化的需求，加上大型组织中常见的计算资源限制，构成了一项重大的挑战。由于传统工具采用的是问题特定的计算方式，往往无法在这种环境中实现扩展。因此，我们推出了 CodeQuery，这是一个专为大规模静态分析设计的集中式数据平台。`