Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Working across different levels of nesting? #69

Open
cboettig opened this issue May 24, 2018 · 7 comments
Open

Working across different levels of nesting? #69

cboettig opened this issue May 24, 2018 · 7 comments

Comments

@cboettig
Copy link

This package is very exciting, great work! Our dataspice team was thrilled to see how nicely it already handles common requests on a dataspice.json file, e.g.:

library(roomba)
json <- jsonlite::read_json("https://raw.githubusercontent.com/ropenscilabs/dataspice/master/inst/metadata-tables/dataspice.json")

## Works nicely when all columns come from same level of nesting:
json %>% roomba(c("givenName", "familyName"))
json %>% roomba(c("value", "unitText", "description"))

But not clear how to get a repeated column from a different level of nesting:

json %>% roomba(c("value", "unitText", "description", "box"))

Probably requires some kind of notation to indicate the different levels, e.g. in this case maybe something like

json %>% roomba(c("variableMeasured.value", 
                                   "variableMeasured.unitText", 
                                   "variableMeasured.description", 
                                   "spatialCoverage.geo.box"))

would be possible?

@jimhester
Copy link
Collaborator

Perhaps it could use variableMeasures$unitText? We had also briefly discussed using tidyeval so you wouldn't have to quote variable names, which would go along with this idea.

@aedobbyn
Copy link
Collaborator

aedobbyn commented May 25, 2018

@cboettig glad it's making the tidying of dataspice JSON a bit easier! I think you've hit upon what is currently the main deficiency of the package. I like @jimhester's solution of allowing for unquoted cols inputs along with the $ syntax.

Our original thinking was to provide a function that returns all names in the list at any level (e.g. variableMeasured$unitText, variableMeasured$description, spatialCoverage$geo$box, etc.) so that the user can select some subset of those for cols.

Since that vector of names will be quite long, we might also want to provide a way for the user to ask for all levels of nesting under a certain name, maybe in a way like dplyr::starts_with(variableMeasured).

@cstawitz
Copy link
Owner

I like @aedobbyn suggestion to use dplyr::starts_with() to keep the variable argument from getting too clunky.

Maybe this deserves its own issue, but another complexity of multiple levels of nesting is different numbers of rows being returned. For example, the twitter data:

library(roomba)
data("twitter_data")
>twitter_data[[1]]$entities$urls[[1]]$indices
[[1]]
[1] 117

[[2]]
[1] 140

In this case there are two values at the url level that correspond to one value at the entities level. Right now this happens:

> roomba(twitter_data, c("name","indices"))
Error in bind_rows_(x, .id) : Argument 2 must be length 1, not 2

My thought would be we want some kind of long format ie:

> roomba(twitter_data, c("name","indices"))
# A tibble: 2 x 2
                                 name          entities$urls$indices
                                <chr>          <chr>
1                    Code for America            117
2                    Code for America            140

@alistaire47
Copy link

Thanks to the authors; this is a great idea for a package! Another nested geodata example from the Google Places API returns both the location lat/lon and the viewport (which itself has two lat/lon pairs):

response <- list(
    html_attributions = list(),
    results = list(
        list(formatted_address = "4900 Georgia Ave NW, Washington, DC 20011, USA", 
             geometry = list(location = list(lat = 38.9499476, lng = -77.0274465), 
                             viewport = list(northeast = list(lat = 38.9513095798927, lng = -77.0259576201073), 
                                             southwest = list(lat = 38.9486099201073, lng = -77.0286572798927))), 
             icon = "https://maps.gstatic.com/mapfiles/place_api/icons/shopping-71.png", 
             id = "af15897f117bc7e03dfdfbd42d728b49f1e89d9e", 
             name = "Carquest Auto Parts - CQ of Washington DC", 
             opening_hours = list(open_now = TRUE, weekday_text = list()), 
             photos = list(list(
                 height = 399L, 
                 html_attributions = list("<a href=\"https://maps.google.com/maps/contrib/114756074981885904642/photos\">CARQUEST Auto Parts # 6360</a>"), 
                 photo_reference = "CmRaAAAAn9x83EO46LLgExGB549kblIzfcUsr0YMfvesTcb2wypsF4AItXPTgOj8CsmSm93H7AZTZhkbcq27-_VzSGmsKRK1jcQxwyQ5waTJD4WgH5uQR2OnzVmMJEhTvBkBpD5cEhCISyoZwUC_2cZ9mPN7eCZuGhRVDN2IS7dbVqNVf55RABimFGaneA", 
                 width = 600L)), 
             place_id = "ChIJr9kFdGzIt4kRUfOVZB-HkpI", 
             rating = 4.1, 
             reference = "CmRbAAAAhdTteLADmBP9aoOayLlTVi2uF_Q7Vjv5txpuvtiKpkziA9z_5wKMdjM1kK6hNDEJBLIaHlyDnWZfKdT5X_Fsy6B9B_niHLxeQqbqF5jfV9snXpAxks57TmpJiRgAXSorEhCJFRfMPTis725paCKJHgMXGhSDxyEK0-SvzzBooeKcuW20CHJOpg", 
             types = list("car_repair", "store", "point_of_interest", 
                          "establishment")), 
        list(formatted_address = "3908 Pennsylvania Ave SE, Washington, DC 20020, USA", 
             geometry = list(location = list(lat = 38.8658696, lng = -76.9502563), 
                             viewport = list(northeast = list(lat = 38.8669400798927, lng = -76.9490863201073), 
                                             southwest = list(lat = 38.8642404201073, lng = -76.9517859798927))), 
             icon = "https://maps.gstatic.com/mapfiles/place_api/icons/shopping-71.png", 
             id = "933f517d7634fff163e2564804c6ff88cf4e7816", 
             name = "Addison Auto Parts", 
             opening_hours = list(open_now = TRUE, weekday_text = list()), 
             place_id = "ChIJr1N7fxC5t4kRLABe0OVcF4E", rating = 4.7, 
             reference = "CmRbAAAAw7N5mVqBN4-0RXwoZ38VEvwXLXxZIih__1vR3J7zdr0dBxVMOw-V5EMB0YoRFVNbsaa7AfiE_YJyVP8q8JT_hsk0FHBvKDu3ONGc3Bm8C38DNk8rmzQhLKFJeoYs5_FCEhCB26vfpK6ZXonWG3LNHR3VGhQDlMQm6xDdQ55J0GAADzhmsIQBuA", 
             types = list("car_repair", "store", "point_of_interest", "establishment"))
        ), 
    status = "OK")

roomba::roomba(response, cols = c('name', 'place_id', 'lat', 'lng'), keep = any)
#> # A tibble: 8 x 4
#>   name                                      place_id             lat   lng
#>   <chr>                                     <chr>              <dbl> <dbl>
#> 1 Carquest Auto Parts - CQ of Washington DC ChIJr9kFdGzIt4kRU…  NA    NA  
#> 2 <NA>                                      <NA>                38.9 -77.0
#> 3 <NA>                                      <NA>                39.0 -77.0
#> 4 <NA>                                      <NA>                38.9 -77.0
#> 5 Addison Auto Parts                        ChIJr1N7fxC5t4kRL…  NA    NA  
#> 6 <NA>                                      <NA>                38.9 -77.0
#> 7 <NA>                                      <NA>                38.9 -76.9
#> 8 <NA>                                      <NA>                38.9 -77.0

This case can currently be hacked into shape with tidyr, but only making the assumptions that the locations returned are in the order they appear in the response and that the order in the response is consistent. These are pretty safe assumptions, but subsetting notation would make it all moot.

I see a few options for the API:

  1. It'd be nice to just be able to specify that I want location and either
    1. have that expanded to two columns or viewport to four, but particularly in the latter case, some auto-naming is necessary to avoid identically-named columns. Concatenating the element names with . or _ (location.lat, viewport_northeast_lat) would probably be fine (it can always be cleaned up later).
    2. just leave it as a list column that can be cleaned up manually (maybe as an option?), or
    3. in the particular case of geodata, coerce it to an sf column. This would be super-cool, but probably a bit of work to implement properly.
  2. Alternatively/in addition, subsetting could achieve the same thing (albeit with more typing). Naming, either/both automatically (location$lng to location_lng) or/and manually (loc_lng = location$lng) still needs to be addressed with this interface.
  3. I'm hesitant about tidyselect, as they're operating on combinations of names, not just one, and thus wouldn't mean the quite same thing as in dplyr. Maybe that's ok? Just stating location would already be unambiguous above, and would plausibly be expected to return a list column, or given what roomba does, such a column's unnested data. There's certainly room for more sophisticated selection, though, e.g. contains('lat') or ~purrr::has_element(.x, c(TRUE, FALSE)). An API for list subsetting is a pretty deep rabbit hole, though, which I'm assuming is why purrr::pluck hasn't been extended yet despite demand. It could easily be a package of its own if it could be organized comprehensibly.

@aedobbyn
Copy link
Collaborator

@alistaire47 thanks for your comment and response example!

For what it's worth I like the idea of allowing the user to specify whether they want the data in nested or unnested format and, if unnested, representing column names in parent_child fashion.

@ChrisCioffi
Copy link

Has this ever been resolved?

@aedobbyn
Copy link
Collaborator

It has not -- feel free to submit a PR or throw out an idea of tackling if you have one!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants