@@ -13,17 +13,21 @@ DESCRIPTION
13
13
14
14
:program: `llvm-ir2vec ` is a standalone command-line tool for IR2Vec. It
15
15
generates IR2Vec embeddings for LLVM IR and supports triplet generation
16
- for vocabulary training. It provides two main operation modes:
16
+ for vocabulary training. It provides three main operation modes:
17
17
18
- 1. **Triplet Mode **: Generates triplets (opcode, type, operands) for vocabulary
18
+ 1. **Triplet Mode **: Generates numeric triplets in train2id format for vocabulary
19
19
training from LLVM IR.
20
20
21
- 2. **Embedding Mode **: Generates IR2Vec embeddings using a trained vocabulary
21
+ 2. **Entity Mode **: Generates entity mapping files (entity2id.txt) for vocabulary
22
+ training.
23
+
24
+ 3. **Embedding Mode **: Generates IR2Vec embeddings using a trained vocabulary
22
25
at different granularity levels (instruction, basic block, or function).
23
26
24
27
The tool is designed to facilitate machine learning applications that work with
25
28
LLVM IR by converting the IR into numerical representations that can be used by
26
- ML models.
29
+ ML models. The triplet mode generates numeric IDs directly instead of string
30
+ triplets, streamlining the training data preparation workflow.
27
31
28
32
.. note ::
29
33
@@ -34,18 +38,46 @@ ML models.
34
38
OPERATION MODES
35
39
---------------
36
40
41
+ Triplet Generation and Entity Mapping Modes are used for preparing
42
+ vocabulary and training data for knowledge graph embeddings. The Embedding Mode
43
+ is used for generating embeddings from LLVM IR using a pre-trained vocabulary.
44
+
45
+ The Seed Embedding Vocabulary of IR2Vec is trained on a large corpus of LLVM IR
46
+ by modeling the relationships between opcodes, types, and operands as a knowledge
47
+ graph. For this purpose, Triplet Generation and Entity Mapping Modes generate
48
+ triplets and entity mappings in the standard format used for knowledge graph
49
+ embedding training (see
50
+ <https://github.com/thunlp/OpenKE/tree/OpenKE-PyTorch?tab=readme-ov-file#data-format>
51
+ for details).
52
+
37
53
Triplet Generation Mode
38
54
~~~~~~~~~~~~~~~~~~~~~~~
39
55
40
- In triplet mode, :program: `llvm-ir2vec ` analyzes LLVM IR and extracts triplets
41
- consisting of opcodes, types, and operands. These triplets can be used to train
42
- vocabularies for embedding generation.
56
+ In triplet mode, :program: `llvm-ir2vec ` analyzes LLVM IR and extracts numeric
57
+ triplets consisting of opcode IDs, type IDs, and operand IDs. These triplets
58
+ are generated in train2id format. The tool outputs numeric IDs directly using
59
+ the ir2vec::Vocabulary mapping infrastructure, eliminating the need for
60
+ string-to-ID preprocessing.
61
+
62
+ Usage:
63
+
64
+ .. code-block :: bash
65
+
66
+ llvm-ir2vec --mode=triplets input.bc -o triplets_train2id.txt
67
+
68
+ Entity Mapping Generation Mode
69
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
70
+
71
+ In entity mode, :program: `llvm-ir2vec ` generates the entity mappings supported by
72
+ IR2Vec in entity2id format. This mode outputs all supported entities (opcodes,
73
+ types, and operands) with their corresponding numeric IDs, and is not specific for
74
+ an LLVM IR file.
43
75
44
76
Usage:
45
77
46
78
.. code-block :: bash
47
79
48
- llvm-ir2vec --mode=triplets input.bc -o triplets .txt
80
+ llvm-ir2vec --mode=entities -o entity2id .txt
49
81
50
82
Embedding Generation Mode
51
83
~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -67,6 +99,7 @@ OPTIONS
67
99
Specify the operation mode. Valid values are:
68
100
69
101
* ``triplets `` - Generate triplets for vocabulary training
102
+ * ``entities `` - Generate entity mappings for vocabulary training
70
103
* ``embeddings `` - Generate embeddings using trained vocabulary (default)
71
104
72
105
.. option :: --level= <level >
@@ -115,7 +148,7 @@ OPTIONS
115
148
116
149
``--level ``, ``--function ``, ``--ir2vec-vocab-path ``, ``--ir2vec-opc-weight ``,
117
150
``--ir2vec-type-weight ``, and ``--ir2vec-arg-weight `` are only used in embedding
118
- mode. These options are ignored in triplet mode .
151
+ mode. These options are ignored in triplet and entity modes .
119
152
120
153
INPUT FILE FORMAT
121
154
-----------------
@@ -129,14 +162,34 @@ OUTPUT FORMAT
129
162
Triplet Mode Output
130
163
~~~~~~~~~~~~~~~~~~~
131
164
132
- In triplet mode, the output consists of lines containing space-separated triplets:
165
+ In triplet mode, the output consists of numeric triplets in train2id format with
166
+ metadata headers. The format includes:
167
+
168
+ .. code-block :: text
169
+
170
+ MAX_RELATIONS=<max_relations_count>
171
+ <head_entity_id> <tail_entity_id> <relation_id>
172
+ <head_entity_id> <tail_entity_id> <relation_id>
173
+ ...
174
+
175
+ Each line after the metadata header represents one instruction relationship,
176
+ with numeric IDs for head entity, relation, and tail entity. The metadata
177
+ header (MAX_RELATIONS) provides counts for post-processing and training setup.
178
+
179
+ Entity Mode Output
180
+ ~~~~~~~~~~~~~~~~~~
181
+
182
+ In entity mode, the output consists of entity mapping in the format:
133
183
134
184
.. code-block :: text
135
185
136
- <opcode> <type> <operand1> <operand2> ...
186
+ <total_entities>
187
+ <entity_string> <numeric_id>
188
+ <entity_string> <numeric_id>
189
+ ...
137
190
138
- Each line represents the information of one instruction, with the opcode, type,
139
- and operands .
191
+ The first line contains the total number of entities, followed by one entity
192
+ mapping per line with tab-separated entity string and numeric ID .
140
193
141
194
Embedding Mode Output
142
195
~~~~~~~~~~~~~~~~~~~~~
0 commit comments