Skip to content

Commit 528ac7b

Browse files
committed
revamp-triplet-gen
1 parent 9e17794 commit 528ac7b

File tree

5 files changed

+627
-89
lines changed

5 files changed

+627
-89
lines changed

llvm/docs/CommandGuide/llvm-ir2vec.rst

Lines changed: 66 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -13,17 +13,21 @@ DESCRIPTION
1313

1414
:program:`llvm-ir2vec` is a standalone command-line tool for IR2Vec. It
1515
generates IR2Vec embeddings for LLVM IR and supports triplet generation
16-
for vocabulary training. It provides two main operation modes:
16+
for vocabulary training. It provides three main operation modes:
1717

18-
1. **Triplet Mode**: Generates triplets (opcode, type, operands) for vocabulary
18+
1. **Triplet Mode**: Generates numeric triplets in train2id format for vocabulary
1919
training from LLVM IR.
2020

21-
2. **Embedding Mode**: Generates IR2Vec embeddings using a trained vocabulary
21+
2. **Entity Mode**: Generates entity mapping files (entity2id.txt) for vocabulary
22+
training.
23+
24+
3. **Embedding Mode**: Generates IR2Vec embeddings using a trained vocabulary
2225
at different granularity levels (instruction, basic block, or function).
2326

2427
The tool is designed to facilitate machine learning applications that work with
2528
LLVM IR by converting the IR into numerical representations that can be used by
26-
ML models.
29+
ML models. The triplet mode generates numeric IDs directly instead of string
30+
triplets, streamlining the training data preparation workflow.
2731

2832
.. note::
2933

@@ -34,18 +38,46 @@ ML models.
3438
OPERATION MODES
3539
---------------
3640

41+
Triplet Generation and Entity Mapping Modes are used for preparing
42+
vocabulary and training data for knowledge graph embeddings. The Embedding Mode
43+
is used for generating embeddings from LLVM IR using a pre-trained vocabulary.
44+
45+
The Seed Embedding Vocabulary of IR2Vec is trained on a large corpus of LLVM IR
46+
by modeling the relationships between opcodes, types, and operands as a knowledge
47+
graph. For this purpose, Triplet Generation and Entity Mapping Modes generate
48+
triplets and entity mappings in the standard format used for knowledge graph
49+
embedding training (see
50+
<https://github.com/thunlp/OpenKE/tree/OpenKE-PyTorch?tab=readme-ov-file#data-format>
51+
for details).
52+
3753
Triplet Generation Mode
3854
~~~~~~~~~~~~~~~~~~~~~~~
3955

40-
In triplet mode, :program:`llvm-ir2vec` analyzes LLVM IR and extracts triplets
41-
consisting of opcodes, types, and operands. These triplets can be used to train
42-
vocabularies for embedding generation.
56+
In triplet mode, :program:`llvm-ir2vec` analyzes LLVM IR and extracts numeric
57+
triplets consisting of opcode IDs, type IDs, and operand IDs. These triplets
58+
are generated in train2id format. The tool outputs numeric IDs directly using
59+
the ir2vec::Vocabulary mapping infrastructure, eliminating the need for
60+
string-to-ID preprocessing.
61+
62+
Usage:
63+
64+
.. code-block:: bash
65+
66+
llvm-ir2vec --mode=triplets input.bc -o triplets_train2id.txt
67+
68+
Entity Mapping Generation Mode
69+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
70+
71+
In entity mode, :program:`llvm-ir2vec` generates the entity mappings supported by
72+
IR2Vec in entity2id format. This mode outputs all supported entities (opcodes,
73+
types, and operands) with their corresponding numeric IDs, and is not specific for
74+
an LLVM IR file.
4375

4476
Usage:
4577

4678
.. code-block:: bash
4779
48-
llvm-ir2vec --mode=triplets input.bc -o triplets.txt
80+
llvm-ir2vec --mode=entities -o entity2id.txt
4981
5082
Embedding Generation Mode
5183
~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -67,6 +99,7 @@ OPTIONS
6799
Specify the operation mode. Valid values are:
68100

69101
* ``triplets`` - Generate triplets for vocabulary training
102+
* ``entities`` - Generate entity mappings for vocabulary training
70103
* ``embeddings`` - Generate embeddings using trained vocabulary (default)
71104

72105
.. option:: --level=<level>
@@ -115,7 +148,7 @@ OPTIONS
115148

116149
``--level``, ``--function``, ``--ir2vec-vocab-path``, ``--ir2vec-opc-weight``,
117150
``--ir2vec-type-weight``, and ``--ir2vec-arg-weight`` are only used in embedding
118-
mode. These options are ignored in triplet mode.
151+
mode. These options are ignored in triplet and entity modes.
119152

120153
INPUT FILE FORMAT
121154
-----------------
@@ -129,14 +162,34 @@ OUTPUT FORMAT
129162
Triplet Mode Output
130163
~~~~~~~~~~~~~~~~~~~
131164

132-
In triplet mode, the output consists of lines containing space-separated triplets:
165+
In triplet mode, the output consists of numeric triplets in train2id format with
166+
metadata headers. The format includes:
167+
168+
.. code-block:: text
169+
170+
MAX_RELATIONS=<max_relations_count>
171+
<head_entity_id> <tail_entity_id> <relation_id>
172+
<head_entity_id> <tail_entity_id> <relation_id>
173+
...
174+
175+
Each line after the metadata header represents one instruction relationship,
176+
with numeric IDs for head entity, relation, and tail entity. The metadata
177+
header (MAX_RELATIONS) provides counts for post-processing and training setup.
178+
179+
Entity Mode Output
180+
~~~~~~~~~~~~~~~~~~
181+
182+
In entity mode, the output consists of entity mapping in the format:
133183

134184
.. code-block:: text
135185
136-
<opcode> <type> <operand1> <operand2> ...
186+
<total_entities>
187+
<entity_string> <numeric_id>
188+
<entity_string> <numeric_id>
189+
...
137190
138-
Each line represents the information of one instruction, with the opcode, type,
139-
and operands.
191+
The first line contains the total number of entities, followed by one entity
192+
mapping per line with tab-separated entity string and numeric ID.
140193

141194
Embedding Mode Output
142195
~~~~~~~~~~~~~~~~~~~~~
Lines changed: 95 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,95 @@
1+
; RUN: llvm-ir2vec --mode=entities | FileCheck %s
2+
3+
CHECK: 92
4+
CHECK-NEXT: Ret 0
5+
CHECK-NEXT: Br 1
6+
CHECK-NEXT: Switch 2
7+
CHECK-NEXT: IndirectBr 3
8+
CHECK-NEXT: Invoke 4
9+
CHECK-NEXT: Resume 5
10+
CHECK-NEXT: Unreachable 6
11+
CHECK-NEXT: CleanupRet 7
12+
CHECK-NEXT: CatchRet 8
13+
CHECK-NEXT: CatchSwitch 9
14+
CHECK-NEXT: CallBr 10
15+
CHECK-NEXT: FNeg 11
16+
CHECK-NEXT: Add 12
17+
CHECK-NEXT: FAdd 13
18+
CHECK-NEXT: Sub 14
19+
CHECK-NEXT: FSub 15
20+
CHECK-NEXT: Mul 16
21+
CHECK-NEXT: FMul 17
22+
CHECK-NEXT: UDiv 18
23+
CHECK-NEXT: SDiv 19
24+
CHECK-NEXT: FDiv 20
25+
CHECK-NEXT: URem 21
26+
CHECK-NEXT: SRem 22
27+
CHECK-NEXT: FRem 23
28+
CHECK-NEXT: Shl 24
29+
CHECK-NEXT: LShr 25
30+
CHECK-NEXT: AShr 26
31+
CHECK-NEXT: And 27
32+
CHECK-NEXT: Or 28
33+
CHECK-NEXT: Xor 29
34+
CHECK-NEXT: Alloca 30
35+
CHECK-NEXT: Load 31
36+
CHECK-NEXT: Store 32
37+
CHECK-NEXT: GetElementPtr 33
38+
CHECK-NEXT: Fence 34
39+
CHECK-NEXT: AtomicCmpXchg 35
40+
CHECK-NEXT: AtomicRMW 36
41+
CHECK-NEXT: Trunc 37
42+
CHECK-NEXT: ZExt 38
43+
CHECK-NEXT: SExt 39
44+
CHECK-NEXT: FPToUI 40
45+
CHECK-NEXT: FPToSI 41
46+
CHECK-NEXT: UIToFP 42
47+
CHECK-NEXT: SIToFP 43
48+
CHECK-NEXT: FPTrunc 44
49+
CHECK-NEXT: FPExt 45
50+
CHECK-NEXT: PtrToInt 46
51+
CHECK-NEXT: IntToPtr 47
52+
CHECK-NEXT: BitCast 48
53+
CHECK-NEXT: AddrSpaceCast 49
54+
CHECK-NEXT: CleanupPad 50
55+
CHECK-NEXT: CatchPad 51
56+
CHECK-NEXT: ICmp 52
57+
CHECK-NEXT: FCmp 53
58+
CHECK-NEXT: PHI 54
59+
CHECK-NEXT: Call 55
60+
CHECK-NEXT: Select 56
61+
CHECK-NEXT: UserOp1 57
62+
CHECK-NEXT: UserOp2 58
63+
CHECK-NEXT: VAArg 59
64+
CHECK-NEXT: ExtractElement 60
65+
CHECK-NEXT: InsertElement 61
66+
CHECK-NEXT: ShuffleVector 62
67+
CHECK-NEXT: ExtractValue 63
68+
CHECK-NEXT: InsertValue 64
69+
CHECK-NEXT: LandingPad 65
70+
CHECK-NEXT: Freeze 66
71+
CHECK-NEXT: FloatTy 67
72+
CHECK-NEXT: FloatTy 68
73+
CHECK-NEXT: FloatTy 69
74+
CHECK-NEXT: FloatTy 70
75+
CHECK-NEXT: FloatTy 71
76+
CHECK-NEXT: FloatTy 72
77+
CHECK-NEXT: FloatTy 73
78+
CHECK-NEXT: VoidTy 74
79+
CHECK-NEXT: LabelTy 75
80+
CHECK-NEXT: MetadataTy 76
81+
CHECK-NEXT: UnknownTy 77
82+
CHECK-NEXT: TokenTy 78
83+
CHECK-NEXT: IntegerTy 79
84+
CHECK-NEXT: FunctionTy 80
85+
CHECK-NEXT: PointerTy 81
86+
CHECK-NEXT: StructTy 82
87+
CHECK-NEXT: ArrayTy 83
88+
CHECK-NEXT: VectorTy 84
89+
CHECK-NEXT: VectorTy 85
90+
CHECK-NEXT: PointerTy 86
91+
CHECK-NEXT: UnknownTy 87
92+
CHECK-NEXT: Function 88
93+
CHECK-NEXT: Pointer 89
94+
CHECK-NEXT: Constant 90
95+
CHECK-NEXT: Variable 91

llvm/test/tools/llvm-ir2vec/triplets.ll

Lines changed: 39 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -24,15 +24,42 @@ entry:
2424
ret i32 %result
2525
}
2626

27-
; TRIPLETS: Add IntegerTy Variable Variable
28-
; TRIPLETS-NEXT: Ret VoidTy Variable
29-
; TRIPLETS-NEXT: Mul IntegerTy Variable Variable
30-
; TRIPLETS-NEXT: Ret VoidTy Variable
31-
; TRIPLETS-NEXT: Alloca PointerTy Constant
32-
; TRIPLETS-NEXT: Alloca PointerTy Constant
33-
; TRIPLETS-NEXT: Store VoidTy Variable Pointer
34-
; TRIPLETS-NEXT: Store VoidTy Variable Pointer
35-
; TRIPLETS-NEXT: Load IntegerTy Pointer
36-
; TRIPLETS-NEXT: Load IntegerTy Pointer
37-
; TRIPLETS-NEXT: Add IntegerTy Variable Variable
38-
; TRIPLETS-NEXT: Ret VoidTy Variable
27+
; TRIPLETS: MAX_RELATION=3
28+
; TRIPLETS-NEXT: 12 79 0
29+
; TRIPLETS-NEXT: 12 91 2
30+
; TRIPLETS-NEXT: 12 91 3
31+
; TRIPLETS-NEXT: 12 0 1
32+
; TRIPLETS-NEXT: 0 74 0
33+
; TRIPLETS-NEXT: 0 91 2
34+
; TRIPLETS-NEXT: 16 79 0
35+
; TRIPLETS-NEXT: 16 91 2
36+
; TRIPLETS-NEXT: 16 91 3
37+
; TRIPLETS-NEXT: 16 0 1
38+
; TRIPLETS-NEXT: 0 74 0
39+
; TRIPLETS-NEXT: 0 91 2
40+
; TRIPLETS-NEXT: 30 81 0
41+
; TRIPLETS-NEXT: 30 90 2
42+
; TRIPLETS-NEXT: 30 30 1
43+
; TRIPLETS-NEXT: 30 81 0
44+
; TRIPLETS-NEXT: 30 90 2
45+
; TRIPLETS-NEXT: 30 32 1
46+
; TRIPLETS-NEXT: 32 74 0
47+
; TRIPLETS-NEXT: 32 91 2
48+
; TRIPLETS-NEXT: 32 89 3
49+
; TRIPLETS-NEXT: 32 32 1
50+
; TRIPLETS-NEXT: 32 74 0
51+
; TRIPLETS-NEXT: 32 91 2
52+
; TRIPLETS-NEXT: 32 89 3
53+
; TRIPLETS-NEXT: 32 31 1
54+
; TRIPLETS-NEXT: 31 79 0
55+
; TRIPLETS-NEXT: 31 89 2
56+
; TRIPLETS-NEXT: 31 31 1
57+
; TRIPLETS-NEXT: 31 79 0
58+
; TRIPLETS-NEXT: 31 89 2
59+
; TRIPLETS-NEXT: 31 12 1
60+
; TRIPLETS-NEXT: 12 79 0
61+
; TRIPLETS-NEXT: 12 91 2
62+
; TRIPLETS-NEXT: 12 91 3
63+
; TRIPLETS-NEXT: 12 0 1
64+
; TRIPLETS-NEXT: 0 74 0
65+
; TRIPLETS-NEXT: 0 91 2

0 commit comments

Comments
 (0)