-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathdeadringer.py
93 lines (81 loc) · 4.59 KB
/
deadringer.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
'''
This module contains the main code for the deadringer module.
'''
import typing
from . import pd
class NosiyData:
'''This defines the data that is supposed to contain the data with noise
added to it. Noisy data is added as addtional rows rather than mutating
the orginal DataFrame.
Parameters
----------
data: pd.DataFrame
The input pandas DataFrame that contains the clean input data
'''
def __init__(data: pd.DataFrame):
self.raw_data = data
self.noisy_data = self.raw_data.copy()
def __repr__():
return self.noisy_data
def perturb(probabilities: typing.Dict[str,typing.Any[str,float]]):
'''Function that takes in a dictionary of perturbation probabilities as
input (these values refer to the probabilities of transposition,
substitution, insertion etc) for each field. The supported probabilities
are:-
- selection: Probability of selecting a field for introducing one or
more modifications (set this to 0.0 if no modifications
should be introduced into this field ever). Note: The sum
of these select probabilities over all defined fields must
be 1. The default for this will be computed based on the
values provided for the other fields. For instance, if
FIELD_A has a selection probability of 0.8 and FIELD_B has
a selection probability of 0.1, then if FIELD_C has no
assigned selection probability, then the default is
1 - (0.8 + 0.1) = 0.1. If another field, FIELD_D, also
has no selection probability, then the assigned probability,
then the default is (1 - (0.8 + 0.1))/2 = 0.05, ie:- the
remaining probability mass is split evenly between the two
fields. TODO: Instead of avergaing, would it be possible to
share based on the base rates from the literature?
- misspelling: Probability of swapping an original value with a randomly
chosen misspelling from the corresponding misspelling
dictionary (can only be set to larger than 0.0 if such a
misspellings dictionary is defined for the given field).
TODO: Decide the API for misspellings dictionary.
- insertion: Probability of inserting a character into a field value.
- deletion: Probability of deleting a character from a field value.
- substitution: Probability of substituting a character in a field value
with another character.
- transposition: Probability of transposing two characters in a field
value.
- swap_fields: Probability of swapping the value in a field with another
(randomly selected) value for this field. TODO: Are some
fields more liable to be swapped rather than others?
FirstName-LastName swaps might be more probable than
swapping names with addresses.
- swap_words: Probability to swap two words in a field (given there are
at least two words in a field)
- space_insertion: Probability of inserting a space into a field value
(thus splitting a word).
- space_deletion: Probability of deleting a space (if available) in a
field (and thus merging two words).
- missing: Probability of a missing value in the field
Assuming that only one kind of error can per field per record, the sum
over the probabilities must either be 1.0 or 0.0 (meaning none of them).
Concertely, that means sum([misspelling, insertion, deletion, substitution,
transposition, swap_fields, space_insertion, space_deletion, missing])
must equal either 1 or 0. However, the validity of this assumption is
suspect because two kinds of errors can (and do) in a single field.
Parameters
----------
probabilities: Dict[str,float]
A dict with a the kind of perturbation and the probability as
key-value pairs. A `field` key with the name of the column whose
values are to be modified is mandatory.
Examples
--------
>>> df = pd.DataFrame({'uid': [1,2], 'name': ['Tom', 'Jerry']})
>>> noisy_df = NoisyData(df)
>>> nosiy_df.perturb({'field': 'name', 'selection': 0.5})
'''
pass