Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strip whitespace from beginning and end of all records #31

Open
dlebauer opened this issue Jan 15, 2020 · 10 comments
Open

Strip whitespace from beginning and end of all records #31

dlebauer opened this issue Jan 15, 2020 · 10 comments

Comments

@dlebauer
Copy link
Member

remove leading and trailing whitespace,

eg. this: ABC DEF ,Lactuca sativa,, should become ABC DEF,Lactuca sativa,,

for all fields

@gsrohde
Copy link

gsrohde commented Jan 16, 2020

@dlebauer et al.: If you want normalized white space to be a database invariant for any particular columns, I strongly recommend implementing this at the database level with a trigger function. Then, no matter from what pathway the value of a column is inserted or updated, one can be assured that the white space adheres to some normal form.

For an example of how this might be done, you can look at the PL/pgSQL trigger function normalize_name_whitespace() which is called whenever a cultivar is inserted or updated: it takes the value for name given in the insert or update statement and strips leading and trailing whitespace and converts any internal sequence of whitespace characters (including tabs and newlines) to a single space character.

A major motivation for this was to give the uniqueness constraint unique_name_per_species more teeth: it doesn't do much good to ensure names are unique if we can still have what are essentially duplicates that vary only in the distribution of white space.

@dlebauer dlebauer added the good first issue Good for newcomers label Jan 17, 2020
@shivanshu1086
Copy link

Hi @dlebauer,
Can I work on this?

@dlebauer
Copy link
Member Author

dlebauer commented Mar 2, 2020

@shivanshu1086 that would be great - step 1:

@gsrohde - based on your comment above, my understanding is that if you try to put a cultivars.csv with a white space added, the white space should automatically be stripped when it is inserted. (is that correct?). I am pretty sure that this didn't happen, which is why I wrote up this issue. please let me know if the following sounds reasonable:

@shivanshu1086 you could start by creating a reproducible error - see if you can use curl to insert this file cultivars_whitespace_test.csv into the database. If the strip whitespace function is working, it should fail because it violates a uniqueness constraint. If it works, you can query the cultivars table to ensure that the whitespace was stripped (e.g. the record should have the trailing whitespace removed and be converted from DP_075 to DP_075 in the database).

If that works, you can also check that the uniqueness constraint works by trying to upload this file cultivars_uniqueness_test.csv and expect that it will fail with an error that says it violates uniqueness constraint.

All of that just confirms the current behavior.

Next step would be to follow the pattern here:
https://github.com/PecanProject/bety/blob/266dea4cd0e40446c0cdba09e1fd8318a3e341b2/db/migrate/20150202220519_add_uniqueness_constraints.rb#L52
image

to implement normalize_name_whitespace() on additional fields. If you can submit a pull request that applies this pattern to the name field of the treatments table as a first step, then during the pull request review we can discuss internally which fields and tables we want to apply this function to.

also, I've removed the "good first issue" tag since it was originally concieved of as working in the python / Flask API rather than as a database trigger function.

@dlebauer dlebauer removed the good first issue Good for newcomers label Mar 2, 2020
@gsrohde
Copy link

gsrohde commented Mar 2, 2020

@dlebauer I tested this out on the bety7 machine with

update cultivars set name = '           Caddo  lots   of spaces        more spaces        ' where id = 7000000503;

(I probably don't have access to whatever machine you use.) The resulting row looked like this

7000000503 |      2588 | Caddo lots of spaces more spaces   |         |       | 2020-03-02 20:03:57.120819 | 2020-03-02 20:19:39.908616 |

You might want to try doing \d cultivars on whatever database you are using. Check for the three relevant triggers and constraints:

Indexes:
...
    "unique_name_per_species" UNIQUE CONSTRAINT, btree (name, specie_id)
...
Check constraints:
    "normalized_names" CHECK (is_whitespace_normalized(name::text))
...
Triggers:
    normalize_cultivar_names BEFORE INSERT OR UPDATE ON cultivars FOR EACH ROW EXECUTE PROCEDURE normalize_name_whitespace()
...

@shivanshu1086
Copy link

ok @dlebauer , working on that!

@shivanshu1086
Copy link

Hey @dlebauer @gsrohde ,
I have run these scripts by making a separate request for the cultivars_whitespace_test.csv.
Screenshot 2020-03-04 at 3 50 25 PM
When I am trying a curl request I am getting this error... Can anyone help me out with this?

@shivanshu1086
Copy link

I have done with that error @dlebauer, Please have a look at what I have got with that.

Screenshot 2020-03-04 at 5 13 58 PM

That works, it has successfully inserted the cultivars_whitespace_test.csv into the database.

Then I tried to check the uniqueness constrains by adding the rows into cultivars_whitespace_test.csv from cultivars_uniqueness_test.csv. Then I came up with this.

Screenshot 2020-03-04 at 5 40 08 PM

@shivanshu1086
Copy link

This is what we get from the database:
Screenshot 2020-03-04 at 7 12 15 PM

@dlebauer
Copy link
Member Author

dlebauer commented Mar 4, 2020

@gsrohde the database I was using has those triggers installed.

Very mysterious. One could try debugging it by putting output statements in the function called by the trigger function, making sure it gets called upon a qualifying event and making sure it does what it is supposed to do.

@shivanshu1086
Copy link

@dlebauer please check #37 that can be helpful for checking the whitespaces for the database.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants