Skip to content

Commit 123d01f

Browse files
committed
Merge branch 'release/v0.9.1'
2 parents 786458c + f5bc6c9 commit 123d01f

11 files changed

+169
-180
lines changed

.bumpversion.cfg

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
[bumpversion]
2-
current_version = 0.9.0
2+
current_version = 0.9.1
33
commit = False
44
tag = False
55
allow_dirty = False

CHANGELOG.md

+9-1
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,14 @@
11
# Changelog
22

3-
## 0.9.0 🆕 New methods, better docs and bugfixes 📚🐞
3+
## Unreleased
4+
5+
### Fixed
6+
7+
- `FutureWarning` for `ParallelConfig` constantly raised without actually
8+
instantiating the object
9+
[PR #562](https://github.com/aai-institute/pyDVL/pull/562)
10+
11+
## 0.9.0 - 🆕 New methods, better docs and bugfixes 📚🐞
412

513
### Added
614

CITATION.cff

+3-3
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,6 @@ keywords:
2727
- Banzhaf index
2828
license: LGPL-3.0
2929
commit: 0e929ae121820b0014bf245da1b21032186768cb
30-
version: v0.7.0
31-
doi: 10.5281/zenodo.8311583
32-
date-released: '2023-09-02'
30+
version: v0.9.0
31+
doi: 10.5281/zenodo.10966754
32+
date-released: '2024-04-12'

README.md

+109-158
Original file line numberDiff line numberDiff line change
@@ -16,10 +16,8 @@
1616
<a href="https://zenodo.org/badge/latestdoi/354117916"><img src="https://zenodo.org/badge/354117916.svg" alt="DOI"></a>
1717
</p>
1818

19-
**pyDVL** collects algorithms for **Data Valuation** and **Influence Function** computation.
20-
21-
Refer to the [Methods](https://pydvl.org/devel/getting-started/methods/)
22-
page of our documentation for a list of all implemented methods.
19+
**pyDVL** collects algorithms for **Data Valuation** and **Influence Function**
20+
computation. Here is the list of [all methods implemented](https://pydvl.org/devel/getting-started/methods/).
2321

2422
**Data Valuation** for machine learning is the task of assigning a scalar
2523
to each element of a training set which reflects its contribution to the final
@@ -29,7 +27,7 @@ pyDVL focuses on model-dependent methods.
2927

3028
<div align="center" style="text-align:center;">
3129
<img
32-
width="70%"
30+
width="60%"
3331
align="center"
3432
style="display: block; margin-left: auto; margin-right: auto;"
3533
src="https://pydvl.org/devel/value/img/mclc-best-removal-10k-natural.svg"
@@ -48,7 +46,7 @@ of training samples over individual test points.
4846

4947
<div align="center" style="text-align:center;">
5048
<img
51-
width="70%"
49+
width="60%"
5250
align="center"
5351
style="display: block; margin-left: auto; margin-right: auto;"
5452
src="https://pydvl.org/devel/examples/img/influence_functions_example.png"
@@ -82,180 +80,133 @@ $ pip install pyDVL[influence]
8280
```
8381

8482
For more instructions and information refer to [Installing pyDVL
85-
](https://pydvl.org/stable/getting-started/#installation) in the
86-
documentation.
83+
](https://pydvl.org/stable/getting-started/#installation) in the documentation.
8784

8885
# Usage
8986

90-
In the following subsections, we will showcase the usage of pyDVL
91-
for Data Valuation and Influence Functions using simple examples.
92-
93-
For more instructions and information refer to [Getting
94-
Started](https://pydvl.org/stable/getting-started/first-steps/) in
95-
the documentation.
96-
We provide several examples for data valuation
97-
(e.g. [Shapley Data Valuation](https://pydvl.org/stable/examples/shapley_basic_spotify/))
98-
and for influence functions
99-
(e.g. [Influence Functions for Neural Networks](https://pydvl.org/stable/examples/influence_imagenet/))
100-
with details on the algorithms and their applications.
87+
Please read [Getting
88+
Started](https://pydvl.org/stable/getting-started/first-steps/) in the
89+
documentation for more instructions. We provide several examples for data
90+
valuation and for influence functions in our [Example
91+
Gallery](https://pydvl.org/stable/examples/).
10192

10293
## Influence Functions
10394

104-
For influence computation, follow these steps:
105-
106-
1. Import the necessary packages (The exact packages depend on your specific use case).
107-
108-
```python
109-
import torch
110-
from torch import nn
111-
from torch.utils.data import DataLoader, TensorDataset
112-
113-
from pydvl.influence.torch import DirectInfluence
114-
from pydvl.influence.torch.util import NestedTorchCatAggregator, TorchNumpyConverter
115-
from pydvl.influence import SequentialInfluenceCalculator
116-
```
117-
95+
1. Import the necessary packages (the exact ones depend on your specific use case).
11896
2. Create PyTorch data loaders for your train and test splits.
119-
120-
```python
121-
input_dim = (5, 5, 5)
122-
output_dim = 3
123-
train_x = torch.rand((10, *input_dim))
124-
train_y = torch.rand((10, output_dim))
125-
test_x = torch.rand((5, *input_dim))
126-
test_y = torch.rand((5, output_dim))
127-
128-
train_data_loader = DataLoader(TensorDataset(train_x, train_y), batch_size=2)
129-
test_data_loader = DataLoader(TensorDataset(test_x, test_y), batch_size=1)
130-
```
131-
132-
3. Instantiate your neural network model.
133-
134-
```python
135-
nn_architecture = nn.Sequential(
136-
nn.Conv2d(in_channels=5, out_channels=3, kernel_size=3),
137-
nn.Flatten(),
138-
nn.Linear(27, 3),
97+
3. Instantiate your neural network model and define your loss function.
98+
4. Instantiate an `InfluenceFunctionModel` and fit it to the training data
99+
5. For small input data, you can call the `influences()` method on the fitted
100+
instance. The result is a tensor of shape `(training samples, test samples)`
101+
that contains at index `(i, j`) the influence of training sample `i` on
102+
test sample `j`.
103+
6. For larger datasets, wrap the model into a "calculator" and call methods on
104+
it. This splits the computation into smaller chunks and allows for lazy
105+
evaluation and out-of-core computation.
106+
107+
The higher the absolute value of the influence of a training sample
108+
on a test sample, the more influential it is for the chosen test sample, model
109+
and data loaders. The sign of the influence determines whether it is
110+
useful (positive) or harmful (negative).
111+
112+
> **Note** pyDVL currently only support PyTorch for Influence Functions. We plan
113+
> to add support for Jax next.
114+
115+
```python
116+
import torch
117+
from torch import nn
118+
from torch.utils.data import DataLoader, TensorDataset
119+
120+
from pydvl.influence import SequentialInfluenceCalculator
121+
from pydvl.influence.torch import DirectInfluence
122+
from pydvl.influence.torch.util import (
123+
NestedTorchCatAggregator,
124+
TorchNumpyConverter,
139125
)
140-
```
141-
142-
4. Define your loss:
143-
144-
```python
145-
loss = nn.MSELoss()
146-
```
147-
148-
5. Instantiate an `InfluenceFunctionModel` and fit it to the training data
149126

150-
```python
151-
infl_model = DirectInfluence(nn_architecture, loss, hessian_regularization=0.01)
152-
infl_model = infl_model.fit(train_data_loader)
153-
```
127+
input_dim = (5, 5, 5)
128+
output_dim = 3
129+
train_x, train_y = torch.rand((10, *input_dim)), torch.rand((10, output_dim))
130+
test_x, test_y = torch.rand((5, *input_dim)), torch.rand((5, output_dim))
131+
train_data_loader = DataLoader(TensorDataset(train_x, train_y), batch_size=2)
132+
test_data_loader = DataLoader(TensorDataset(test_x, test_y), batch_size=1)
133+
model = nn.Sequential(
134+
nn.Conv2d(in_channels=5, out_channels=3, kernel_size=3),
135+
nn.Flatten(),
136+
nn.Linear(27, 3),
137+
)
138+
loss = nn.MSELoss()
154139

155-
6. For small input data call influence method on the fitted instance.
156-
157-
```python
158-
influences = infl_model.influences(test_x, test_y, train_x, train_y)
159-
```
160-
The result is a tensor of shape `(training samples x test samples)`
161-
that contains at index `(i, j`) the influence of training sample `i` on
162-
test sample `j`.
140+
infl_model = DirectInfluence(model, loss, hessian_regularization=0.01)
141+
infl_model = infl_model.fit(train_data_loader)
163142

164-
7. For larger data, wrap the model into a
165-
calculator and call methods on the calculator.
166-
```python
167-
infl_calc = SequentialInfluenceCalculator(infl_model)
168-
169-
# Lazy object providing arrays batch-wise in a sequential manner
170-
lazy_influences = infl_calc.influences(test_data_loader, train_data_loader)
143+
# For small datasets, instantiate the full influence matrix:
144+
influences = infl_model.influences(test_x, test_y, train_x, train_y)
171145

172-
# Trigger computation and pull results to memory
173-
influences = lazy_influences.compute(aggregator=NestedTorchCatAggregator())
146+
# For larger datasets, use the Influence calculators:
147+
infl_calc = SequentialInfluenceCalculator(infl_model)
174148

175-
# Trigger computation and write results batch-wise to disk
176-
lazy_influences.to_zarr("influences_result", TorchNumpyConverter())
177-
```
178-
149+
# Lazy object providing arrays batch-wise in a sequential manner
150+
lazy_influences = infl_calc.influences(test_data_loader, train_data_loader)
179151

180-
The higher the absolute value of the influence of a training sample
181-
on a test sample, the more influential it is for the chosen test sample, model
182-
and data loaders. The sign of the influence determines whether it is
183-
useful (positive) or harmful (negative).
152+
# Trigger computation and pull results to memory
153+
influences = lazy_influences.compute(aggregator=NestedTorchCatAggregator())
184154

185-
> **Note** pyDVL currently only support PyTorch for Influence Functions.
186-
> We are planning to add support for Jax and perhaps TensorFlow or even Keras.
155+
# Trigger computation and write results batch-wise to disk
156+
lazy_influences.to_zarr("influences_result", TorchNumpyConverter())
157+
```
187158

188159
## Data Valuation
189160

190161
The steps required to compute data values for your samples are:
191162

192-
1. Import the necessary packages (The exact packages depend on your specific use case).
193-
194-
```python
195-
import matplotlib.pyplot as plt
196-
from sklearn.datasets import load_breast_cancer
197-
from sklearn.linear_model import LogisticRegression
198-
from pydvl.utils import Dataset, Scorer, Utility
199-
from pydvl.value import (
200-
compute_shapley_values,
201-
ShapleyMode,
202-
MaxUpdates,
203-
)
204-
```
205-
163+
1. Import the necessary packages (the exact ones will depend on your specific
164+
use case).
206165
2. Create a `Dataset` object with your train and test splits.
207-
208-
```python
209-
data = Dataset.from_sklearn(
210-
load_breast_cancer(),
211-
train_size=10,
212-
stratify_by_target=True,
213-
random_state=16,
214-
)
215-
```
216-
217166
3. Create an instance of a `SupervisedModel` (basically any sklearn compatible
218-
predictor).
219-
220-
```python
221-
model = LogisticRegression()
222-
```
223-
224-
4. Create a `Utility` object to wrap the Dataset, the model and a scoring
225-
function.
226-
227-
```python
228-
u = Utility(
229-
model,
230-
data,
231-
Scorer("accuracy", default=0.0)
232-
)
233-
```
234-
235-
5. Use one of the methods defined in the library to compute the values.
236-
In our example, we will use *Permutation Montecarlo Shapley*,
237-
an approximate method for computing Data Shapley values.
238-
239-
```python
240-
values = compute_shapley_values(
241-
u,
242-
mode=ShapleyMode.PermutationMontecarlo,
243-
done=MaxUpdates(100),
244-
seed=16,
245-
progress=True
246-
)
247-
```
248-
The result is a variable of type `ValuationResult` that contains
249-
the indices and their values as well as other attributes.
250-
251-
The higher the value for an index, the more important it is for the chosen
252-
model, dataset and scorer.
253-
254-
6. (Optional) Convert the valuation result to a dataframe and analyze and visualize the values.
255-
256-
```python
257-
df = values.to_dataframe(column="data_value")
258-
```
167+
predictor), and wrap it in a `Utility` object together with the data and a
168+
scoring function.
169+
4. Use one of the methods defined in the library to compute the values. In the
170+
example below, we will use *Permutation Montecarlo Shapley*, an approximate
171+
method for computing Data Shapley values. The result is a variable of type
172+
`ValuationResult` that contains the indices and their values as well as other
173+
attributes.
174+
5. Convert the valuation result to a dataframe, and analyze and visualize the
175+
values.
176+
177+
The higher the value for an index, the more important it is for the chosen
178+
model, dataset and scorer. Reciprocally, low-value points could be mislabelled,
179+
or out-of-distribution, and dropping them can improve the model's performance.
180+
181+
```python
182+
from sklearn.datasets import load_breast_cancer
183+
from sklearn.linear_model import LogisticRegression
184+
185+
from pydvl.utils import Dataset, Scorer, Utility
186+
from pydvl.value import (MaxUpdates, RelativeTruncation,
187+
permutation_montecarlo_shapley)
188+
189+
data = Dataset.from_sklearn(
190+
load_breast_cancer(),
191+
train_size=10,
192+
stratify_by_target=True,
193+
random_state=16,
194+
)
195+
model = LogisticRegression()
196+
u = Utility(
197+
model,
198+
data,
199+
Scorer("accuracy", default=0.0)
200+
)
201+
values = permutation_montecarlo_shapley(
202+
u,
203+
truncation=RelativeTruncation(u, 0.05),
204+
done=MaxUpdates(1000),
205+
seed=16,
206+
progress=True
207+
)
208+
df = values.to_dataframe(column="data_value")
209+
```
259210

260211
# Contributing
261212

requirements-notebooks.txt

+1-1
Original file line numberDiff line numberDiff line change
@@ -3,5 +3,5 @@ distributed==2023.4.0
33
pillow==10.3.0
44
torch==2.0.1
55
torchvision==0.15.2
6-
transformers==4.36.0
6+
transformers==4.38.0
77
zarr==2.16.1

setup.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@
1212
package_data={"pydvl": ["py.typed"]},
1313
packages=find_packages(where="src"),
1414
include_package_data=True,
15-
version="0.9.0",
15+
version="0.9.1",
1616
description="The Python Data Valuation Library",
1717
install_requires=[
1818
line

src/pydvl/__init__.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -7,4 +7,4 @@
77
The two main modules you will want to look at are [value][pydvl.value] and
88
[influence][pydvl.influence].
99
"""
10-
__version__ = "0.9.0"
10+
__version__ = "0.9.1"

0 commit comments

Comments
 (0)