-
Notifications
You must be signed in to change notification settings - Fork 10
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Handle dupplicated column names (#49)
V 0.4.0 ======= - New features: - New features in existing functions : - To avoid issues based on column names, we will check and rename columns that have same names. - In *aggregateByKey* generated column names are changed to be more explicit. - In *aggregateByKey* generated from character column with more than \code{thresh} values is now count of unique instead of count. - Added missing *auto* default values on cols - Bug fixes: - *whichAreBijection* and *whichAreInDouble* are using *bi_col_test* which was not working with 2 column data set. It is fixed. - *prepareSet* optinal argumennt *factor_date_type* was not working. It is fixed. - Other changes: - Changed *whichAreIncluded* example since it was to slow for CRAN. Also it might be a little bit more explicit now. - Changed *aggregateByKey* example since it was to slow for CRAN. - Integration: - Rewrite all tests to make them more readable - Code coverage is improved, depencies on *messy_adult* set is lowered WARNING: - In *aggregateByKey* generated column names are changed. - In *aggregateByKey* generated column for character is different.
- Loading branch information
1 parent
30dd044
commit 3612726
Showing
36 changed files
with
5,993 additions
and
3,866 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,129 +1,133 @@ | ||
============ | ||
Contributing | ||
============ | ||
|
||
Contributions are welcome, and they are greatly appreciated! Every | ||
little bit helps, and credit will always be given. | ||
|
||
You can contribute in many ways: | ||
|
||
Types of Contributions | ||
---------------------- | ||
|
||
Report Bugs | ||
~~~~~~~~~~~ | ||
|
||
Report bugs at https://github.com/ELToulemonde/dataPreparation/issues. | ||
|
||
If you are reporting a bug, please include: | ||
|
||
* Your operating system name and version. | ||
* Any details about your local setup that might be helpful in troubleshooting. | ||
* Detailed steps to reproduce the bug. | ||
|
||
Fix Bugs | ||
~~~~~~~~ | ||
|
||
Look through the GitHub issues for bugs. Anything tagged with "bug" | ||
and "help wanted" is open to whoever wants to implement it. | ||
|
||
Implement Features | ||
~~~~~~~~~~~~~~~~~~ | ||
|
||
Look through the GitHub issues for features. Anything tagged with "enhancement" | ||
and "help wanted" is open to whoever wants to implement it. | ||
|
||
If you have some new features ideas, you can also open an issue and tag it with | ||
"enhancement". | ||
|
||
Write Documentation | ||
~~~~~~~~~~~~~~~~~~~ | ||
|
||
dataPreparation could always use more documentation, whether as part of the | ||
official dataPreparation docs, vignettes, or even on the web in blog posts, | ||
articles, and such. | ||
|
||
Submit Feedback | ||
~~~~~~~~~~~~~~~ | ||
|
||
The best way to send feedback is to file an issue at https://github.com/ELToulemonde/dataPreparation/issues. | ||
|
||
If you are proposing a feature: | ||
|
||
* Explain in detail how it would work. | ||
* Keep the scope as narrow as possible, to make it easier to implement. | ||
* Remember that this is a volunteer-driven project, and that contributions | ||
are welcome :) | ||
|
||
|
||
Controlling your developments | ||
----------------------------- | ||
|
||
1. Build new functionality in the best R file | ||
|
||
2. Document function. Make sure every param is commented. | ||
|
||
3. Change package version in DESCRIPTION and add what's new in NEWS.md | ||
|
||
3. Build unit tests in tests\testthat\test_.....R (one test file per R code file). Unit test should make sure that your function works exactly the way you want it. | ||
|
||
4. Generate documentation using devtools::document() and control unit tests using devtools::test() | ||
|
||
5. Build and install package using devtools::build(), devtools::install() | ||
|
||
6. Control that every thing is conform to CRAN requirement using devtools::check(pkg = "dataPreparation"). You should have no error, no warning and no notes. | ||
|
||
7. Control code coverage with cov <- covr::package_coverage() and then covr::zero_coverage(cov). Your new lines of codes shouldn't appear. | ||
|
||
8. Push on github and travis will check if everything is ok. There would also be a code coverage control. | ||
9. If everything passed: create a Pull Request. | ||
|
||
10. Thank you very much for your contributions! :) | ||
|
||
|
||
Code conventions | ||
----------------- | ||
|
||
+---------------+-------------+----------------------------------------+ | ||
|Use | convention | interpretation | | ||
+===============+=============+========================================+ | ||
|Function names | set... | Change into ... | | ||
+ +-------------+----------------------------------------+ | ||
| | which... | Identify ... | | ||
+ +-------------+----------------------------------------+ | ||
| | fast... | Perform in an efficient way | | ||
+ + + (col by col or by exponential search) + | ||
| | | | | ||
+ +-------------+----------------------------------------+ | ||
| | is.XXX | Check if is XXX | | ||
+ +-------------+----------------------------------------+ | ||
| | generate... | Create new columns | | ||
+---------------+-------------+----------------------------------------+ | ||
|Arguments | drop | Should original columns be dropped | | ||
+ +-------------+----------------------------------------+ | ||
| | verbose | Boolean to handle if algorithm talk | | ||
+ +-------------+----------------------------------------+ | ||
| | dataSet | Input data set | | ||
+ +-------------+----------------------------------------+ | ||
| | cols | A list of columns names | | ||
+---------------+-------------+----------------------------------------+ | ||
| Variables | data_sample | Slice of data set copied for calcul | | ||
+ +-------------+----------------------------------------+ | ||
| | result | The result that will be returned | | ||
+ +-------------+----------------------------------------+ | ||
| | ..._tmp | Partially build object | | ||
+ +-------------+----------------------------------------+ | ||
| | ...s | Iteritable (list, array, ...) | | ||
+ +-------------+----------------------------------------+ | ||
| | n... | Number of ... | | ||
+ +-------------+----------------------------------------+ | ||
| | col | A column name | | ||
+ +-------------+----------------------------------------+ | ||
| | I | a list of index. | | ||
+ +-------------+----------------------------------------+ | ||
| | n_test | Number of rows on which we perform test| | ||
+ +-------------+----------------------------------------+ | ||
| | args | Agruments from "..." | | ||
+ +-------------+----------------------------------------+ | ||
| | start_time | From proc.time() to keep track of time | | ||
============ | ||
Contributing | ||
============ | ||
|
||
Contributions are welcome, and they are greatly appreciated! Every | ||
little bit helps, and credit will always be given. | ||
|
||
You can contribute in many ways: | ||
|
||
Types of Contributions | ||
---------------------- | ||
|
||
Report Bugs | ||
~~~~~~~~~~~ | ||
|
||
Report bugs at https://github.com/ELToulemonde/dataPreparation/issues. | ||
|
||
If you are reporting a bug, please include: | ||
|
||
* Your operating system name and version. | ||
* Any details about your local setup that might be helpful in troubleshooting. | ||
* Detailed steps to reproduce the bug. | ||
|
||
Fix Bugs | ||
~~~~~~~~ | ||
|
||
Look through the GitHub issues for bugs. Anything tagged with "bug" | ||
and "help wanted" is open to whoever wants to implement it. | ||
|
||
Implement Features | ||
~~~~~~~~~~~~~~~~~~ | ||
|
||
Look through the GitHub issues for features. Anything tagged with "enhancement" | ||
and "help wanted" is open to whoever wants to implement it. | ||
|
||
If you have some new features ideas, you can also open an issue and tag it with | ||
"enhancement". | ||
|
||
All implementation should | ||
- respect code conventions listed bellow, | ||
- should be testted (using same testing schema as other tests). | ||
|
||
Write Documentation | ||
~~~~~~~~~~~~~~~~~~~ | ||
|
||
dataPreparation could always use more documentation, whether as part of the | ||
official dataPreparation docs, vignettes, or even on the web in blog posts, | ||
articles, and such. | ||
|
||
Submit Feedback | ||
~~~~~~~~~~~~~~~ | ||
|
||
The best way to send feedback is to file an issue at https://github.com/ELToulemonde/dataPreparation/issues. | ||
|
||
If you are proposing a feature: | ||
|
||
* Explain in detail how it would work. | ||
* Keep the scope as narrow as possible, to make it easier to implement. | ||
* Remember that this is a volunteer-driven project, and that contributions | ||
are welcome :) | ||
|
||
|
||
Controlling your developments | ||
----------------------------- | ||
|
||
1. Build new functionality in the best R file | ||
|
||
2. Document function. Make sure every param is commented. | ||
|
||
3. Change package version in DESCRIPTION and add what's new in NEWS.md | ||
|
||
3. Build unit tests in tests\testthat\test_.....R (one test file per R code file). Unit test should make sure that your function works exactly the way you want it. | ||
|
||
4. Generate documentation using devtools::document() and control unit tests using devtools::test() | ||
|
||
5. Build and install package using devtools::build(), devtools::install() | ||
|
||
6. Control that every thing is conform to CRAN requirement using devtools::check(pkg = "dataPreparation"). You should have no error, no warning and no notes. | ||
|
||
7. Control code coverage with cov <- covr::package_coverage() and then covr::zero_coverage(cov). Your new lines of codes shouldn't appear. | ||
|
||
8. Push on github and travis will check if everything is ok. There would also be a code coverage control. | ||
9. If everything passed: create a Pull Request. | ||
|
||
10. Thank you very much for your contributions! :) | ||
|
||
|
||
Code conventions | ||
----------------- | ||
|
||
+---------------+-------------+----------------------------------------+ | ||
|Use | convention | interpretation | | ||
+===============+=============+========================================+ | ||
|Function names | set... | Change into ... | | ||
+ +-------------+----------------------------------------+ | ||
| | which... | Identify ... | | ||
+ +-------------+----------------------------------------+ | ||
| | fast... | Perform in an efficient way | | ||
+ + + (col by col or by exponential search) + | ||
| | | | | ||
+ +-------------+----------------------------------------+ | ||
| | is.XXX | Check if is XXX | | ||
+ +-------------+----------------------------------------+ | ||
| | generate... | Create new columns | | ||
+---------------+-------------+----------------------------------------+ | ||
|Arguments | drop | Should original columns be dropped | | ||
+ +-------------+----------------------------------------+ | ||
| | verbose | Boolean to handle if algorithm talk | | ||
+ +-------------+----------------------------------------+ | ||
| | dataSet | Input data set | | ||
+ +-------------+----------------------------------------+ | ||
| | cols | A list of columns names | | ||
+---------------+-------------+----------------------------------------+ | ||
| Variables | data_sample | Slice of data set copied for calcul | | ||
+ +-------------+----------------------------------------+ | ||
| | result | The result that will be returned | | ||
+ +-------------+----------------------------------------+ | ||
| | ..._tmp | Partially build object | | ||
+ +-------------+----------------------------------------+ | ||
| | ...s | Iteritable (list, array, ...) | | ||
+ +-------------+----------------------------------------+ | ||
| | n... | Number of ... | | ||
+ +-------------+----------------------------------------+ | ||
| | col | A column name | | ||
+ +-------------+----------------------------------------+ | ||
| | I | a list of index. | | ||
+ +-------------+----------------------------------------+ | ||
| | n_test | Number of rows on which we perform test| | ||
+ +-------------+----------------------------------------+ | ||
| | args | Agruments from "..." | | ||
+ +-------------+----------------------------------------+ | ||
| | start_time | From proc.time() to keep track of time | | ||
+---------------+-------------+----------------------------------------+ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,6 +1,6 @@ | ||
Package: dataPreparation | ||
Title: Automated Data Preparation | ||
Version: 0.3.9 | ||
Version: 0.4.0 | ||
Authors@R: person("Emmanuel-Lin", "Toulemonde", email = "[email protected]", role = c("aut", "cre")) | ||
Description: Do most of the painful data preparation for a data science project with a minimum amount of code; Take advantages of data.table efficiency and use some algorithmic trick in order to perform data preparation in a time and RAM efficient way. | ||
Depends: | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.