Skip to content

Commit c86311e

Browse files
authored
Merge pull request github#6502 from github/dataflow-tutorial
Add data flow debugging guide to CodeQL docs
2 parents d0563c8 + 8c37e90 commit c86311e

File tree

3 files changed

+120
-0
lines changed

3 files changed

+120
-0
lines changed

docs/codeql/codeql-for-visual-studio-code/analyzing-your-projects.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -79,6 +79,8 @@ You can see all quick queries that you've run in the current session in the Quer
7979

8080
Once you're happy with your quick query, you should save it in a QL pack so you can access it later. For more information, see ":ref:`About QL packs <about-ql-packs>`."
8181

82+
.. _running-a-specific-part-of-a-query-or-library:
83+
8284
Running a specific part of a query or library
8385
----------------------------------------------
8486

docs/codeql/writing-codeql-queries/codeql-queries.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@ CodeQL queries are used in code scanning analyses to find problems in source cod
1616
about-data-flow-analysis
1717
creating-path-queries
1818
troubleshooting-query-performance
19+
debugging-data-flow-queries-using-partial-flow
1920

2021
- :doc:`About CodeQL queries <about-codeql-queries>`: CodeQL queries are used to analyze code for issues related to security, correctness, maintainability, and readability.
2122
- :doc:`Metadata for CodeQL queries <metadata-for-codeql-queries>`: Metadata tells users important information about CodeQL queries. You must include the correct query metadata in a query to be able to view query results in source code.
@@ -25,3 +26,4 @@ CodeQL queries are used in code scanning analyses to find problems in source cod
2526
- :doc:`About data flow analysis <about-data-flow-analysis>`: Data flow analysis is used to compute the possible values that a variable can hold at various points in a program, determining how those values propagate through the program and where they are used.
2627
- :doc:`Creating path queries <creating-path-queries>`: You can create path queries to visualize the flow of information through a codebase.
2728
- :doc:`Troubleshooting query performance <troubleshooting-query-performance>`: Improve the performance of your CodeQL queries by following a few simple guidelines.
29+
- :doc:`Debugging data-flow queries using partial flow <debugging-data-flow-queries-using-partial-flow>`: If a data-flow query doesn't produce the results you expect to see, you can use partial flow to debug the problem..
Lines changed: 116 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,116 @@
1+
.. _debugging-data-flow-queries-using-partial-flow:
2+
3+
Debugging data-flow queries using partial flow
4+
==============================================
5+
6+
If a data-flow query doesn't produce the results you expect to see, you can use partial flow to debug the problem.
7+
8+
In CodeQL, you can use :ref:`data flow analysis <about-data-flow-analysis>` to compute the possible values that a variable can hold at various points in a program.
9+
A typical data-flow query looks like this:
10+
11+
.. code-block:: ql
12+
13+
::
14+
15+
class MyConfig extends TaintTracking::Configuration {
16+
MyConfig() { this = "MyConfig" }
17+
18+
override predicate isSource(DataFlow::Node node) { node instanceof MySource }
19+
20+
override predicate isSink(DataFlow::Node node) { node instanceof MySink }
21+
}
22+
23+
from MyConfig config, DataFlow::PathNode source, DataFlow::PathNode sink
24+
where config.hasFlowPath(source, sink)
25+
select sink.getNode(), source, sink, "Sink is reached from $@.", source.getNode(), "here"
26+
27+
The same query can be slightly simplified by rewriting it without :ref:`path explanations <creating-path-queries>`:
28+
29+
.. code-block:: ql
30+
31+
from MyConfig config, DataFlow::Node source, DataFlow::Node sink
32+
where config.hasPath(source, sink)
33+
select sink, "Sink is reached from $@.", source.getNode(), "here"
34+
35+
If a data-flow query that you have written doesn't produce the results you expect it to, there may be a problem with your query.
36+
You can try to debug the potential problem by following the steps described below.
37+
38+
Checking sources and sinks
39+
--------------------------
40+
41+
Initially, you should make sure that the source and sink definitions contain what you expect. If either the source or sink is empty then there can never be any data flow. The easiest way to check this is using quick evaluation in CodeQL for VS Code. Select the text ``node instanceof MySource``, right-click, and choose "CodeQL: Quick Evaluation". This will evaluate the highlighted text, which in this case means the set of sources. For more information, see :ref:`Analyzing your projects <running-a-specific-part-of-a-query-or-library>` in the CodeQL for VS Code help.
42+
43+
If both source and sink definitions look good then we will need to look for missing flow steps.
44+
45+
``fieldFlowBranchLimit``
46+
------------------------
47+
48+
Data-flow configurations contain a parameter called ``fieldFlowBranchLimit``. If the value is set too high, you may experience performance degradation, but if it's too low you may miss results. When debugging data flow try setting ``fieldFlowBranchLimit`` to a high value and see whether your query generates more results. For example, try adding the following to your configuration:
49+
50+
.. code-block:: ql
51+
52+
override int fieldFlowBranchLimit() { result = 5000 }
53+
54+
If there are still no results and performance is still useable, then it is best to leave this set to a high value while doing further debugging.
55+
56+
Partial flow
57+
------------
58+
59+
A naive next step could be to change the sink definition to ``any()``. This would mean that we would get a lot of flow to all the places that are reachable from the sources. While this approach may work in some cases, you might find that it produces so many results that it's very hard to explore the findings. It can can also dramatically affect query performance. More importantly, you might not even see all the partial flow paths. This is because the data-flow library tries very hard to prune impossible paths and, since field stores and reads must be evenly matched along a path, we will never see paths going through a store that fail to reach a corresponding read. This can make it hard to see where flow actually stops.
60+
61+
To avoid these problems, a data-flow ``Configuration`` comes with a mechanism for exploring partial flow that tries to deal with these caveats. This is the ``Configuration.hasPartialFlow`` predicate:
62+
63+
.. code-block:: ql
64+
65+
/**
66+
* Holds if there is a partial data flow path from `source` to `node`. The
67+
* approximate distance between `node` and the closest source is `dist` and
68+
* is restricted to be less than or equal to `explorationLimit()`. This
69+
* predicate completely disregards sink definitions.
70+
*
71+
* This predicate is intended for dataflow exploration and debugging and may
72+
* perform poorly if the number of sources is too big and/or the exploration
73+
* limit is set too high without using barriers.
74+
*
75+
* This predicate is disabled (has no results) by default. Override
76+
* `explorationLimit()` with a suitable number to enable this predicate.
77+
*
78+
* To use this in a `path-problem` query, import the module `PartialPathGraph`.
79+
*/
80+
final predicate hasPartialFlow(PartialPathNode source, PartialPathNode node, int dist) {
81+
82+
As noted in the documentation for ``hasPartialFlow`` (for example, in the `CodeQL for Java documentation <https://codeql.github.com/codeql-standard-libraries/java/semmle/code/java/dataflow/internal/DataFlowImpl2.qll/predicate.DataFlowImpl2$Configuration$hasPartialFlow.3.html>__`) you must first enable this by adding an override of ``explorationLimit``. For example:
83+
84+
.. code-block:: ql
85+
86+
override int explorationLimit() { result = 5 }
87+
88+
This defines the exploration radius within which ``hasPartialFlow`` returns results.
89+
90+
It is also useful to focus on a single source at a time as the starting point for the flow exploration. This is most easily done by adding a temporary restriction in the ``isSource`` predicate.
91+
92+
To do quick evaluations of partial flow it is often easiest to add a predicate to the query that is solely intended for quick evaluation (right-click the predicate name and choose "CodeQL: Quick Evaluation"). A good starting point is something like:
93+
94+
.. code-block:: ql
95+
96+
predicate adhocPartialFlow(Callable c, PartialPathNode n, Node src, int dist) {
97+
exists(MyConfig conf, PartialPathNode source |
98+
conf.hasPartialFlow(source, n, dist) and
99+
src = source.getNode() and
100+
c = n.getNode().getEnclosingCallable()
101+
)
102+
}
103+
104+
If you are focusing on a single source then the ``src`` column is superfluous. You may of course also add other columns of interest based on ``n``, but including the enclosing callable and the distance to the source at the very least is generally recommended, as they can be useful columns to sort on to better inspect the results.
105+
106+
107+
If you see a large number of partial flow results, you can focus them in a couple of ways:
108+
109+
- If flow travels a long distance following an expected path, that can result in a lot of uninteresting flow being included in the exploration radius. To reduce the amount of uninteresting flow, you can replace the source definition with a suitable ``node`` that appears along the path and restart the partial flow exploration from that point.
110+
- Creative use of barriers and sanitizers can be used to cut off flow paths that are uninteresting. This also reduces the number of partial flow results to explore while debugging.
111+
112+
Further reading
113+
----------------
114+
115+
- :ref:`About data flow analysis <about-data-flow-analysis>`
116+
- :ref:`Creating path queries <creating-path-queries>`

0 commit comments

Comments
 (0)