You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: docs/data_integrity.rst
+36-24
Original file line number
Diff line number
Diff line change
@@ -2,31 +2,37 @@ Data integrity
2
2
==============
3
3
4
4
5
-
Collecting mobile phone metadata can lead to corrupted data: wrong format, faulty files, empty periods of time or missing users, *etc.* bandicoot will not try to fix corrupted data, but however let you handle such situations by:
5
+
Occasionally, records in CDR and collected mobile phone metadata can be
6
+
corrupted: wrong format, faulty files, empty periods of time, missing users,
7
+
*etc.* bandicoot will not attempt to correct errors as this might lead to
8
+
incorrect analysis. It will instead:
6
9
7
-
1. warning you when importing data,
8
-
2. removing faulty records,
9
-
3. adding more than 30 reporting variables when exporting indicators.
10
+
1. warn you when you attempt to import corrupted data,
11
+
2. remove faulty records,
12
+
3. report more than 30 variables warning you of potential issues when exporting
13
+
indicators.
10
14
11
15
12
16
13
17
Warnings at import
14
18
------------------
15
19
16
-
By default, :meth:`~bandicoot.io.read_csv` logs six warnings to the standard output:
20
+
By default, :meth:`~bandicoot.io.read_csv` reports six warnings to the standard output:
17
21
18
-
1. when an attribute path is given, but no attributes are loaded (which can occur when the path is wrong, or the attribute file empty),
19
-
2. a recharges path given, but no recharges loaded,
20
-
3. the percentage of records missing a location when positive,
21
-
4. the number of antennas missing a location (when an antenna file was provided)
22
-
5. the percentage of duplicated records (which can happen when databases are mixed together)
23
-
6. the percentage of calls with an overlap of more than 5 minutes
22
+
1. when an *attribute path* is given but no attributes could be loaded, e.g.
23
+
because the path is wrong or because the attribute file is empty,
24
+
2. when a *recharges_path* is given but no recharges could be loaded,
25
+
3. the percentage of records that do not contain location informationwhen an
26
+
antenna file is provided, the number of antennas missing location information
27
+
4. the percentage of duplicated records
28
+
5. the percentage of calls with an overlap of more than 5 minutes
24
29
25
30
26
31
Removal of faulty records
27
32
-------------------------
28
33
29
-
When loading a CSV file containing records, bandicoot filters out lines with wrong values, and keeps the count of ignored lines in the :class:`~bandicoot.core.User` object:
34
+
bandicoot will automatically remove faulty records and will report the number
35
+
of ignored records (also available in the :class:`~bandicoot.core.User` Object):
30
36
31
37
.. code-block:: python
32
38
@@ -39,24 +45,30 @@ When loading a CSV file containing records, bandicoot filters out lines with wro
39
45
'interaction': 0,
40
46
'location': 0}
41
47
42
-
The previous example means that six records were removed because:
48
+
In this example, six records were removed:
43
49
44
-
- three records had wrong call durations,
45
-
- two records had wrong dates and times,
46
-
- four records had wrong with directions.
50
+
- three records had incorrect call durations,
51
+
- two records had incorrect dates and times,
52
+
- four records had incorrect incoming or outgoing directions.
47
53
48
-
.. warning:: An ignored record with multiple faulty fields will be counted for all field, and not only for the first detected. The sum of all ignored fields in ``my_user.ignored_records`` is not equal to 5, the number of ignored records.
54
+
.. warning:: An ignored record with multiple faulty fields will be double
55
+
counted and reported for each incorrect value. The total number of ignored
56
+
records is reported in all, here 5.
49
57
50
58
51
-
bandicoot can also remove duplicated records, if the option ``drop_duplicates=True`` is provided to :meth:`bandicoot.core.read_csv`. This functionality is not activated by default, as one user can send multiple text messages in less than one minute (or less, depending on the granularity of the data set), yet they should not count as duplicated.
59
+
bandicoot also offer the option to remove “duplicated records“ (same
60
+
correspondants, direction, date and time). The option ``drop_duplicates=True``
61
+
in :meth:`~bandicoot.io.read_csv` is not activated by defaul, as one user
62
+
might send multiple text messages in less than one minute (or less, depending
63
+
on the granularity of the data set).
52
64
53
65
Reporting variables
54
66
-------------------
55
67
56
-
The function :meth:`~bandicoot.utils.all` returns a nested dictionnary containing all indicators, but also 31 reporting variables:
68
+
The function :meth:`~bandicoot.utils.all` returns a nested dictionary containing all indicators, but also 39 reporting variables:
57
69
58
-
1. concerning the data loading (``antennas_path``, ``attributes_path``, ``recharges_path``),
59
-
2. about the user (``start_time``, ``end_time``, ``night_start``, ``night_end``, ``weekend`` with a list of days defining a weekend, ``number_of_records``, ``number_of_antennas``, ``number_of_recharges``, ``bins``, ``bins_with_data``, ``bins_without_data``, ``has_call``, ``has_home``, ``has_recharges``, ``has_attributes``, ``has_network``),
60
-
3. on records missing information (``percent_records_missing_location``, ``antennas_missing_locations``, and ``ignored_records`` mentioned previously),
61
-
4. on the user's ego network (``percent_outofnetwork_calls``, ``percent_outofnetwork_texts``, ``percent_outofnetwork_contacts``, ``percent_outofnetwork_call_durations``),
62
-
5. on the computation (``groupby``, ``split_week``, ``split_day``).
70
+
1. information on the files: ``antennas_path``, ``attributes_path``, ``recharges_path``,
71
+
2. information about the data: ``start_time``, ``end_time``, ``night_start``, ``night_end``, ``weekend`` with a list of days defining a weekend, ``number_of_records``, ``number_of_antennas``, ``number_of_recharges``…,
72
+
3. information on records for which information is missing: ``percent_records_missing_location``, ``antennas_missing_locations``, and ``ignored_records`` mentioned previously,
73
+
4. information on the user's ego network: ``percent_outofnetwork_calls``, ``percent_outofnetwork_texts``, ``percent_outofnetwork_contacts``, ``percent_outofnetwork_call_durations``,
74
+
5. and finally, information on the grouping: ``groupby``, ``split_week``, ``split_day``.
0 commit comments