16
16
<a href="https://zenodo.org/badge/latestdoi/354117916"><img src="https://zenodo.org/badge/354117916.svg" alt="DOI"></a>
17
17
</p >
18
18
19
- ** pyDVL** collects algorithms for ** Data Valuation** and ** Influence Function** computation.
20
-
21
- Refer to the [ Methods] ( https://pydvl.org/devel/getting-started/methods/ )
22
- page of our documentation for a list of all implemented methods.
19
+ ** pyDVL** collects algorithms for ** Data Valuation** and ** Influence Function**
20
+ computation. Here is the list of [ all methods implemented] ( https://pydvl.org/devel/getting-started/methods/ ) .
23
21
24
22
** Data Valuation** for machine learning is the task of assigning a scalar
25
23
to each element of a training set which reflects its contribution to the final
@@ -29,7 +27,7 @@ pyDVL focuses on model-dependent methods.
29
27
30
28
<div align =" center " style =" text-align :center ;" >
31
29
<img
32
- width="70 %"
30
+ width="60 %"
33
31
align="center"
34
32
style="display: block; margin-left: auto; margin-right: auto;"
35
33
src="https://pydvl.org/devel/value/img/mclc-best-removal-10k-natural.svg"
@@ -48,7 +46,7 @@ of training samples over individual test points.
48
46
49
47
<div align =" center " style =" text-align :center ;" >
50
48
<img
51
- width="70 %"
49
+ width="60 %"
52
50
align="center"
53
51
style="display: block; margin-left: auto; margin-right: auto;"
54
52
src="https://pydvl.org/devel/examples/img/influence_functions_example.png"
@@ -82,180 +80,133 @@ $ pip install pyDVL[influence]
82
80
```
83
81
84
82
For more instructions and information refer to [ Installing pyDVL
85
- ] ( https://pydvl.org/stable/getting-started/#installation ) in the
86
- documentation.
83
+ ] ( https://pydvl.org/stable/getting-started/#installation ) in the documentation.
87
84
88
85
# Usage
89
86
90
- In the following subsections, we will showcase the usage of pyDVL
91
- for Data Valuation and Influence Functions using simple examples.
92
-
93
- For more instructions and information refer to [ Getting
94
- Started] ( https://pydvl.org/stable/getting-started/first-steps/ ) in
95
- the documentation.
96
- We provide several examples for data valuation
97
- (e.g. [ Shapley Data Valuation] ( https://pydvl.org/stable/examples/shapley_basic_spotify/ ) )
98
- and for influence functions
99
- (e.g. [ Influence Functions for Neural Networks] ( https://pydvl.org/stable/examples/influence_imagenet/ ) )
100
- with details on the algorithms and their applications.
87
+ Please read [ Getting
88
+ Started] ( https://pydvl.org/stable/getting-started/first-steps/ ) in the
89
+ documentation for more instructions. We provide several examples for data
90
+ valuation and for influence functions in our [ Example
91
+ Gallery] ( https://pydvl.org/stable/examples/ ) .
101
92
102
93
## Influence Functions
103
94
104
- For influence computation, follow these steps:
105
-
106
- 1 . Import the necessary packages (The exact packages depend on your specific use case).
107
-
108
- ``` python
109
- import torch
110
- from torch import nn
111
- from torch.utils.data import DataLoader, TensorDataset
112
-
113
- from pydvl.influence.torch import DirectInfluence
114
- from pydvl.influence.torch.util import NestedTorchCatAggregator, TorchNumpyConverter
115
- from pydvl.influence import SequentialInfluenceCalculator
116
- ```
117
-
95
+ 1 . Import the necessary packages (the exact ones depend on your specific use case).
118
96
2 . Create PyTorch data loaders for your train and test splits.
119
-
120
- ``` python
121
- input_dim = (5 , 5 , 5 )
122
- output_dim = 3
123
- train_x = torch.rand((10 , * input_dim))
124
- train_y = torch.rand((10 , output_dim))
125
- test_x = torch.rand((5 , * input_dim))
126
- test_y = torch.rand((5 , output_dim))
127
-
128
- train_data_loader = DataLoader(TensorDataset(train_x, train_y), batch_size = 2 )
129
- test_data_loader = DataLoader(TensorDataset(test_x, test_y), batch_size = 1 )
130
- ```
131
-
132
- 3 . Instantiate your neural network model.
133
-
134
- ``` python
135
- nn_architecture = nn.Sequential(
136
- nn.Conv2d(in_channels = 5 , out_channels = 3 , kernel_size = 3 ),
137
- nn.Flatten(),
138
- nn.Linear(27 , 3 ),
97
+ 3 . Instantiate your neural network model and define your loss function.
98
+ 4 . Instantiate an ` InfluenceFunctionModel ` and fit it to the training data
99
+ 5 . For small input data, you can call the ` influences() ` method on the fitted
100
+ instance. The result is a tensor of shape ` (training samples, test samples) `
101
+ that contains at index ` (i, j ` ) the influence of training sample ` i ` on
102
+ test sample ` j ` .
103
+ 6 . For larger datasets, wrap the model into a "calculator" and call methods on
104
+ it. This splits the computation into smaller chunks and allows for lazy
105
+ evaluation and out-of-core computation.
106
+
107
+ The higher the absolute value of the influence of a training sample
108
+ on a test sample, the more influential it is for the chosen test sample, model
109
+ and data loaders. The sign of the influence determines whether it is
110
+ useful (positive) or harmful (negative).
111
+
112
+ > ** Note** pyDVL currently only support PyTorch for Influence Functions. We plan
113
+ > to add support for Jax next.
114
+
115
+ ``` python
116
+ import torch
117
+ from torch import nn
118
+ from torch.utils.data import DataLoader, TensorDataset
119
+
120
+ from pydvl.influence import SequentialInfluenceCalculator
121
+ from pydvl.influence.torch import DirectInfluence
122
+ from pydvl.influence.torch.util import (
123
+ NestedTorchCatAggregator,
124
+ TorchNumpyConverter,
139
125
)
140
- ```
141
-
142
- 4 . Define your loss:
143
-
144
- ``` python
145
- loss = nn.MSELoss()
146
- ```
147
-
148
- 5 . Instantiate an ` InfluenceFunctionModel ` and fit it to the training data
149
126
150
- ``` python
151
- infl_model = DirectInfluence(nn_architecture, loss, hessian_regularization = 0.01 )
152
- infl_model = infl_model.fit(train_data_loader)
153
- ```
127
+ input_dim = (5 , 5 , 5 )
128
+ output_dim = 3
129
+ train_x, train_y = torch.rand((10 , * input_dim)), torch.rand((10 , output_dim))
130
+ test_x, test_y = torch.rand((5 , * input_dim)), torch.rand((5 , output_dim))
131
+ train_data_loader = DataLoader(TensorDataset(train_x, train_y), batch_size = 2 )
132
+ test_data_loader = DataLoader(TensorDataset(test_x, test_y), batch_size = 1 )
133
+ model = nn.Sequential(
134
+ nn.Conv2d(in_channels = 5 , out_channels = 3 , kernel_size = 3 ),
135
+ nn.Flatten(),
136
+ nn.Linear(27 , 3 ),
137
+ )
138
+ loss = nn.MSELoss()
154
139
155
- 6 . For small input data call influence method on the fitted instance.
156
-
157
- ``` python
158
- influences = infl_model.influences(test_x, test_y, train_x, train_y)
159
- ```
160
- The result is a tensor of shape ` (training samples x test samples) `
161
- that contains at index ` (i, j ` ) the influence of training sample ` i ` on
162
- test sample ` j ` .
140
+ infl_model = DirectInfluence(model, loss, hessian_regularization = 0.01 )
141
+ infl_model = infl_model.fit(train_data_loader)
163
142
164
- 7 . For larger data, wrap the model into a
165
- calculator and call methods on the calculator.
166
- ``` python
167
- infl_calc = SequentialInfluenceCalculator(infl_model)
168
-
169
- # Lazy object providing arrays batch-wise in a sequential manner
170
- lazy_influences = infl_calc.influences(test_data_loader, train_data_loader)
143
+ # For small datasets, instantiate the full influence matrix:
144
+ influences = infl_model.influences(test_x, test_y, train_x, train_y)
171
145
172
- # Trigger computation and pull results to memory
173
- influences = lazy_influences.compute( aggregator = NestedTorchCatAggregator() )
146
+ # For larger datasets, use the Influence calculators:
147
+ infl_calc = SequentialInfluenceCalculator(infl_model )
174
148
175
- # Trigger computation and write results batch-wise to disk
176
- lazy_influences.to_zarr(" influences_result" , TorchNumpyConverter())
177
- ```
178
-
149
+ # Lazy object providing arrays batch-wise in a sequential manner
150
+ lazy_influences = infl_calc.influences(test_data_loader, train_data_loader)
179
151
180
- The higher the absolute value of the influence of a training sample
181
- on a test sample, the more influential it is for the chosen test sample, model
182
- and data loaders. The sign of the influence determines whether it is
183
- useful (positive) or harmful (negative).
152
+ # Trigger computation and pull results to memory
153
+ influences = lazy_influences.compute(aggregator = NestedTorchCatAggregator())
184
154
185
- > ** Note** pyDVL currently only support PyTorch for Influence Functions.
186
- > We are planning to add support for Jax and perhaps TensorFlow or even Keras.
155
+ # Trigger computation and write results batch-wise to disk
156
+ lazy_influences.to_zarr(" influences_result" , TorchNumpyConverter())
157
+ ```
187
158
188
159
## Data Valuation
189
160
190
161
The steps required to compute data values for your samples are:
191
162
192
- 1 . Import the necessary packages (The exact packages depend on your specific use case).
193
-
194
- ``` python
195
- import matplotlib.pyplot as plt
196
- from sklearn.datasets import load_breast_cancer
197
- from sklearn.linear_model import LogisticRegression
198
- from pydvl.utils import Dataset, Scorer, Utility
199
- from pydvl.value import (
200
- compute_shapley_values,
201
- ShapleyMode,
202
- MaxUpdates,
203
- )
204
- ```
205
-
163
+ 1 . Import the necessary packages (the exact ones will depend on your specific
164
+ use case).
206
165
2 . Create a ` Dataset ` object with your train and test splits.
207
-
208
- ``` python
209
- data = Dataset.from_sklearn(
210
- load_breast_cancer(),
211
- train_size = 10 ,
212
- stratify_by_target = True ,
213
- random_state = 16 ,
214
- )
215
- ```
216
-
217
166
3 . Create an instance of a ` SupervisedModel ` (basically any sklearn compatible
218
- predictor).
219
-
220
- ``` python
221
- model = LogisticRegression()
222
- ```
223
-
224
- 4 . Create a ` Utility ` object to wrap the Dataset, the model and a scoring
225
- function.
226
-
227
- ``` python
228
- u = Utility(
229
- model,
230
- data,
231
- Scorer(" accuracy" , default = 0.0 )
232
- )
233
- ```
234
-
235
- 5 . Use one of the methods defined in the library to compute the values.
236
- In our example, we will use * Permutation Montecarlo Shapley* ,
237
- an approximate method for computing Data Shapley values.
238
-
239
- ``` python
240
- values = compute_shapley_values(
241
- u,
242
- mode = ShapleyMode.PermutationMontecarlo,
243
- done = MaxUpdates(100 ),
244
- seed = 16 ,
245
- progress = True
246
- )
247
- ```
248
- The result is a variable of type ` ValuationResult ` that contains
249
- the indices and their values as well as other attributes.
250
-
251
- The higher the value for an index, the more important it is for the chosen
252
- model, dataset and scorer.
253
-
254
- 6 . (Optional) Convert the valuation result to a dataframe and analyze and visualize the values.
255
-
256
- ``` python
257
- df = values.to_dataframe(column = " data_value" )
258
- ```
167
+ predictor), and wrap it in a ` Utility ` object together with the data and a
168
+ scoring function.
169
+ 4 . Use one of the methods defined in the library to compute the values. In the
170
+ example below, we will use * Permutation Montecarlo Shapley* , an approximate
171
+ method for computing Data Shapley values. The result is a variable of type
172
+ ` ValuationResult ` that contains the indices and their values as well as other
173
+ attributes.
174
+ 5 . Convert the valuation result to a dataframe, and analyze and visualize the
175
+ values.
176
+
177
+ The higher the value for an index, the more important it is for the chosen
178
+ model, dataset and scorer. Reciprocally, low-value points could be mislabelled,
179
+ or out-of-distribution, and dropping them can improve the model's performance.
180
+
181
+ ``` python
182
+ from sklearn.datasets import load_breast_cancer
183
+ from sklearn.linear_model import LogisticRegression
184
+
185
+ from pydvl.utils import Dataset, Scorer, Utility
186
+ from pydvl.value import (MaxUpdates, RelativeTruncation,
187
+ permutation_montecarlo_shapley)
188
+
189
+ data = Dataset.from_sklearn(
190
+ load_breast_cancer(),
191
+ train_size = 10 ,
192
+ stratify_by_target = True ,
193
+ random_state = 16 ,
194
+ )
195
+ model = LogisticRegression()
196
+ u = Utility(
197
+ model,
198
+ data,
199
+ Scorer(" accuracy" , default = 0.0 )
200
+ )
201
+ values = permutation_montecarlo_shapley(
202
+ u,
203
+ truncation = RelativeTruncation(u, 0.05 ),
204
+ done = MaxUpdates(1000 ),
205
+ seed = 16 ,
206
+ progress = True
207
+ )
208
+ df = values.to_dataframe(column = " data_value" )
209
+ ```
259
210
260
211
# Contributing
261
212
0 commit comments