Skip to content

Commit

Permalink
Handle dupplicated column names (#49)
Browse files Browse the repository at this point in the history
V 0.4.0
=======
- New features:
  - New features in existing functions : 
    - To avoid issues based on column names, we will check and rename columns that have same names. 
    - In *aggregateByKey* generated column names are changed to be more explicit. 
    - In *aggregateByKey* generated from character column with more than \code{thresh} values is now count of unique instead of count.
    - Added missing *auto* default values on cols

- Bug fixes:
  - *whichAreBijection* and *whichAreInDouble* are using *bi_col_test* which was not working with 2 column data set. It is fixed.
  - *prepareSet* optinal argumennt *factor_date_type* was not working. It is fixed.

- Other changes: 
  - Changed *whichAreIncluded* example since it was to slow for CRAN. Also it might be a little bit more explicit now.
  - Changed *aggregateByKey* example since it was to slow for CRAN.

- Integration:
  - Rewrite all tests to make them more readable
  - Code coverage is improved, depencies on *messy_adult* set is lowered

WARNING:
- In *aggregateByKey* generated column names are changed.
- In *aggregateByKey* generated column for character is different.
  • Loading branch information
ELToulemonde authored Mar 28, 2019
1 parent 30dd044 commit 3612726
Show file tree
Hide file tree
Showing 36 changed files with 5,993 additions and 3,866 deletions.
260 changes: 132 additions & 128 deletions CONTRIBUTING.rst
Original file line number Diff line number Diff line change
@@ -1,129 +1,133 @@
============
Contributing
============

Contributions are welcome, and they are greatly appreciated! Every
little bit helps, and credit will always be given.

You can contribute in many ways:

Types of Contributions
----------------------

Report Bugs
~~~~~~~~~~~

Report bugs at https://github.com/ELToulemonde/dataPreparation/issues.

If you are reporting a bug, please include:

* Your operating system name and version.
* Any details about your local setup that might be helpful in troubleshooting.
* Detailed steps to reproduce the bug.

Fix Bugs
~~~~~~~~

Look through the GitHub issues for bugs. Anything tagged with "bug"
and "help wanted" is open to whoever wants to implement it.

Implement Features
~~~~~~~~~~~~~~~~~~

Look through the GitHub issues for features. Anything tagged with "enhancement"
and "help wanted" is open to whoever wants to implement it.

If you have some new features ideas, you can also open an issue and tag it with
"enhancement".

Write Documentation
~~~~~~~~~~~~~~~~~~~

dataPreparation could always use more documentation, whether as part of the
official dataPreparation docs, vignettes, or even on the web in blog posts,
articles, and such.

Submit Feedback
~~~~~~~~~~~~~~~

The best way to send feedback is to file an issue at https://github.com/ELToulemonde/dataPreparation/issues.

If you are proposing a feature:

* Explain in detail how it would work.
* Keep the scope as narrow as possible, to make it easier to implement.
* Remember that this is a volunteer-driven project, and that contributions
are welcome :)


Controlling your developments
-----------------------------

1. Build new functionality in the best R file

2. Document function. Make sure every param is commented.

3. Change package version in DESCRIPTION and add what's new in NEWS.md

3. Build unit tests in tests\testthat\test_.....R (one test file per R code file). Unit test should make sure that your function works exactly the way you want it.

4. Generate documentation using devtools::document() and control unit tests using devtools::test()

5. Build and install package using devtools::build(), devtools::install()

6. Control that every thing is conform to CRAN requirement using devtools::check(pkg = "dataPreparation"). You should have no error, no warning and no notes.

7. Control code coverage with cov <- covr::package_coverage() and then covr::zero_coverage(cov). Your new lines of codes shouldn't appear.

8. Push on github and travis will check if everything is ok. There would also be a code coverage control.
9. If everything passed: create a Pull Request.

10. Thank you very much for your contributions! :)


Code conventions
-----------------

+---------------+-------------+----------------------------------------+
|Use | convention | interpretation |
+===============+=============+========================================+
|Function names | set... | Change into ... |
+ +-------------+----------------------------------------+
| | which... | Identify ... |
+ +-------------+----------------------------------------+
| | fast... | Perform in an efficient way |
+ + + (col by col or by exponential search) +
| | | |
+ +-------------+----------------------------------------+
| | is.XXX | Check if is XXX |
+ +-------------+----------------------------------------+
| | generate... | Create new columns |
+---------------+-------------+----------------------------------------+
|Arguments | drop | Should original columns be dropped |
+ +-------------+----------------------------------------+
| | verbose | Boolean to handle if algorithm talk |
+ +-------------+----------------------------------------+
| | dataSet | Input data set |
+ +-------------+----------------------------------------+
| | cols | A list of columns names |
+---------------+-------------+----------------------------------------+
| Variables | data_sample | Slice of data set copied for calcul |
+ +-------------+----------------------------------------+
| | result | The result that will be returned |
+ +-------------+----------------------------------------+
| | ..._tmp | Partially build object |
+ +-------------+----------------------------------------+
| | ...s | Iteritable (list, array, ...) |
+ +-------------+----------------------------------------+
| | n... | Number of ... |
+ +-------------+----------------------------------------+
| | col | A column name |
+ +-------------+----------------------------------------+
| | I | a list of index. |
+ +-------------+----------------------------------------+
| | n_test | Number of rows on which we perform test|
+ +-------------+----------------------------------------+
| | args | Agruments from "..." |
+ +-------------+----------------------------------------+
| | start_time | From proc.time() to keep track of time |
============
Contributing
============

Contributions are welcome, and they are greatly appreciated! Every
little bit helps, and credit will always be given.

You can contribute in many ways:

Types of Contributions
----------------------

Report Bugs
~~~~~~~~~~~

Report bugs at https://github.com/ELToulemonde/dataPreparation/issues.

If you are reporting a bug, please include:

* Your operating system name and version.
* Any details about your local setup that might be helpful in troubleshooting.
* Detailed steps to reproduce the bug.

Fix Bugs
~~~~~~~~

Look through the GitHub issues for bugs. Anything tagged with "bug"
and "help wanted" is open to whoever wants to implement it.

Implement Features
~~~~~~~~~~~~~~~~~~

Look through the GitHub issues for features. Anything tagged with "enhancement"
and "help wanted" is open to whoever wants to implement it.

If you have some new features ideas, you can also open an issue and tag it with
"enhancement".

All implementation should
- respect code conventions listed bellow,
- should be testted (using same testing schema as other tests).

Write Documentation
~~~~~~~~~~~~~~~~~~~

dataPreparation could always use more documentation, whether as part of the
official dataPreparation docs, vignettes, or even on the web in blog posts,
articles, and such.

Submit Feedback
~~~~~~~~~~~~~~~

The best way to send feedback is to file an issue at https://github.com/ELToulemonde/dataPreparation/issues.

If you are proposing a feature:

* Explain in detail how it would work.
* Keep the scope as narrow as possible, to make it easier to implement.
* Remember that this is a volunteer-driven project, and that contributions
are welcome :)


Controlling your developments
-----------------------------

1. Build new functionality in the best R file

2. Document function. Make sure every param is commented.

3. Change package version in DESCRIPTION and add what's new in NEWS.md

3. Build unit tests in tests\testthat\test_.....R (one test file per R code file). Unit test should make sure that your function works exactly the way you want it.

4. Generate documentation using devtools::document() and control unit tests using devtools::test()

5. Build and install package using devtools::build(), devtools::install()

6. Control that every thing is conform to CRAN requirement using devtools::check(pkg = "dataPreparation"). You should have no error, no warning and no notes.

7. Control code coverage with cov <- covr::package_coverage() and then covr::zero_coverage(cov). Your new lines of codes shouldn't appear.

8. Push on github and travis will check if everything is ok. There would also be a code coverage control.
9. If everything passed: create a Pull Request.

10. Thank you very much for your contributions! :)


Code conventions
-----------------

+---------------+-------------+----------------------------------------+
|Use | convention | interpretation |
+===============+=============+========================================+
|Function names | set... | Change into ... |
+ +-------------+----------------------------------------+
| | which... | Identify ... |
+ +-------------+----------------------------------------+
| | fast... | Perform in an efficient way |
+ + + (col by col or by exponential search) +
| | | |
+ +-------------+----------------------------------------+
| | is.XXX | Check if is XXX |
+ +-------------+----------------------------------------+
| | generate... | Create new columns |
+---------------+-------------+----------------------------------------+
|Arguments | drop | Should original columns be dropped |
+ +-------------+----------------------------------------+
| | verbose | Boolean to handle if algorithm talk |
+ +-------------+----------------------------------------+
| | dataSet | Input data set |
+ +-------------+----------------------------------------+
| | cols | A list of columns names |
+---------------+-------------+----------------------------------------+
| Variables | data_sample | Slice of data set copied for calcul |
+ +-------------+----------------------------------------+
| | result | The result that will be returned |
+ +-------------+----------------------------------------+
| | ..._tmp | Partially build object |
+ +-------------+----------------------------------------+
| | ...s | Iteritable (list, array, ...) |
+ +-------------+----------------------------------------+
| | n... | Number of ... |
+ +-------------+----------------------------------------+
| | col | A column name |
+ +-------------+----------------------------------------+
| | I | a list of index. |
+ +-------------+----------------------------------------+
| | n_test | Number of rows on which we perform test|
+ +-------------+----------------------------------------+
| | args | Agruments from "..." |
+ +-------------+----------------------------------------+
| | start_time | From proc.time() to keep track of time |
+---------------+-------------+----------------------------------------+
2 changes: 1 addition & 1 deletion DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
Package: dataPreparation
Title: Automated Data Preparation
Version: 0.3.9
Version: 0.4.0
Authors@R: person("Emmanuel-Lin", "Toulemonde", email = "[email protected]", role = c("aut", "cre"))
Description: Do most of the painful data preparation for a data science project with a minimum amount of code; Take advantages of data.table efficiency and use some algorithmic trick in order to perform data preparation in a time and RAM efficient way.
Depends:
Expand Down
25 changes: 25 additions & 0 deletions NEWS
Original file line number Diff line number Diff line change
@@ -1,3 +1,28 @@
V 0.4.0
=======
- New features:
- New features in existing functions :
- To avoid issues based on column names, we will check and rename columns that have same names.
- In *aggregateByKey* generated column names are changed to be more explicit.
- In *aggregateByKey* generated from character column with more than \code{thresh} values is now count of unique instead of count.
- Added missing *auto* default values on cols

- Bug fixes:
- *whichAreBijection* and *whichAreInDouble* are using *bi_col_test* which was not working with 2 column data set. It is fixed.
- *prepareSet* optinal argumennt *factor_date_type* was not working. It is fixed.

- Other changes:
- Changed *whichAreIncluded* example since it was to slow for CRAN. Also it might be a little bit more explicit now.
- Changed *aggregateByKey* example since it was to slow for CRAN.

- Integration:
- Rewrite all tests to make them more readable
- Code coverage is improved, depencies on *messy_adult* set is lowered

WARNING:
- In *aggregateByKey* generated column names are changed.
- In *aggregateByKey* generated column for character is different.

V 0.3.9
=======
- Integration:
Expand Down
25 changes: 25 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,28 @@
V 0.4.0
=======
- New features:
- New features in existing functions :
- To avoid issues based on column names, we will check and rename columns that have same names.
- In *aggregateByKey* generated column names are changed to be more explicit.
- In *aggregateByKey* generated from character column with more than \code{thresh} values is now count of unique instead of count.
- Added missing *auto* default values on cols

- Bug fixes:
- *whichAreBijection* and *whichAreInDouble* are using *bi_col_test* which was not working with 2 column data set. It is fixed.
- *prepareSet* optinal argumennt *factor_date_type* was not working. It is fixed.

- Other changes:
- Changed *whichAreIncluded* example since it was to slow for CRAN. Also it might be a little bit more explicit now.
- Changed *aggregateByKey* example since it was to slow for CRAN.

- Integration:
- Rewrite all tests to make them more readable
- Code coverage is improved, depencies on *messy_adult* set is lowered

WARNING:
- In *aggregateByKey* generated column names are changed.
- In *aggregateByKey* generated column for character is different.

V 0.3.9
=======
- Integration:
Expand Down
Loading

0 comments on commit 3612726

Please sign in to comment.