Working across different levels of nesting? #69

cboettig · 2018-05-24T18:34:58Z

This package is very exciting, great work! Our dataspice team was thrilled to see how nicely it already handles common requests on a dataspice.json file, e.g.:

library(roomba)
json <- jsonlite::read_json("https://raw.githubusercontent.com/ropenscilabs/dataspice/master/inst/metadata-tables/dataspice.json")

## Works nicely when all columns come from same level of nesting:
json %>% roomba(c("givenName", "familyName"))
json %>% roomba(c("value", "unitText", "description"))

But not clear how to get a repeated column from a different level of nesting:

json %>% roomba(c("value", "unitText", "description", "box"))

Probably requires some kind of notation to indicate the different levels, e.g. in this case maybe something like

json %>% roomba(c("variableMeasured.value", 
                                   "variableMeasured.unitText", 
                                   "variableMeasured.description", 
                                   "spatialCoverage.geo.box"))

would be possible?

The text was updated successfully, but these errors were encountered:

jimhester · 2018-05-24T18:50:13Z

Perhaps it could use variableMeasures$unitText? We had also briefly discussed using tidyeval so you wouldn't have to quote variable names, which would go along with this idea.

aedobbyn · 2018-05-25T15:34:13Z

@cboettig glad it's making the tidying of dataspice JSON a bit easier! I think you've hit upon what is currently the main deficiency of the package. I like @jimhester's solution of allowing for unquoted cols inputs along with the $ syntax.

Our original thinking was to provide a function that returns all names in the list at any level (e.g. variableMeasured$unitText, variableMeasured$description, spatialCoverage$geo$box, etc.) so that the user can select some subset of those for cols.

Since that vector of names will be quite long, we might also want to provide a way for the user to ask for all levels of nesting under a certain name, maybe in a way like dplyr::starts_with(variableMeasured).

cstawitz · 2018-05-25T17:36:05Z

I like @aedobbyn suggestion to use dplyr::starts_with() to keep the variable argument from getting too clunky.

Maybe this deserves its own issue, but another complexity of multiple levels of nesting is different numbers of rows being returned. For example, the twitter data:

library(roomba)
data("twitter_data")
>twitter_data[[1]]$entities$urls[[1]]$indices
[[1]]
[1] 117

[[2]]
[1] 140

In this case there are two values at the url level that correspond to one value at the entities level. Right now this happens:

> roomba(twitter_data, c("name","indices"))
Error in bind_rows_(x, .id) : Argument 2 must be length 1, not 2

My thought would be we want some kind of long format ie:

> roomba(twitter_data, c("name","indices"))
# A tibble: 2 x 2
                                 name          entities$urls$indices
                                <chr>          <chr>
1                    Code for America            117
2                    Code for America            140

alistaire47 · 2018-06-09T16:03:55Z

Thanks to the authors; this is a great idea for a package! Another nested geodata example from the Google Places API returns both the location lat/lon and the viewport (which itself has two lat/lon pairs):

response <- list(
    html_attributions = list(),
    results = list(
        list(formatted_address = "4900 Georgia Ave NW, Washington, DC 20011, USA", 
             geometry = list(location = list(lat = 38.9499476, lng = -77.0274465), 
                             viewport = list(northeast = list(lat = 38.9513095798927, lng = -77.0259576201073), 
                                             southwest = list(lat = 38.9486099201073, lng = -77.0286572798927))), 
             icon = "https://maps.gstatic.com/mapfiles/place_api/icons/shopping-71.png", 
             id = "af15897f117bc7e03dfdfbd42d728b49f1e89d9e", 
             name = "Carquest Auto Parts - CQ of Washington DC", 
             opening_hours = list(open_now = TRUE, weekday_text = list()), 
             photos = list(list(
                 height = 399L, 
                 html_attributions = list("<a href=\"https://maps.google.com/maps/contrib/114756074981885904642/photos\">CARQUEST Auto Parts # 6360</a>"), 
                 photo_reference = "CmRaAAAAn9x83EO46LLgExGB549kblIzfcUsr0YMfvesTcb2wypsF4AItXPTgOj8CsmSm93H7AZTZhkbcq27-_VzSGmsKRK1jcQxwyQ5waTJD4WgH5uQR2OnzVmMJEhTvBkBpD5cEhCISyoZwUC_2cZ9mPN7eCZuGhRVDN2IS7dbVqNVf55RABimFGaneA", 
                 width = 600L)), 
             place_id = "ChIJr9kFdGzIt4kRUfOVZB-HkpI", 
             rating = 4.1, 
             reference = "CmRbAAAAhdTteLADmBP9aoOayLlTVi2uF_Q7Vjv5txpuvtiKpkziA9z_5wKMdjM1kK6hNDEJBLIaHlyDnWZfKdT5X_Fsy6B9B_niHLxeQqbqF5jfV9snXpAxks57TmpJiRgAXSorEhCJFRfMPTis725paCKJHgMXGhSDxyEK0-SvzzBooeKcuW20CHJOpg", 
             types = list("car_repair", "store", "point_of_interest", 
                          "establishment")), 
        list(formatted_address = "3908 Pennsylvania Ave SE, Washington, DC 20020, USA", 
             geometry = list(location = list(lat = 38.8658696, lng = -76.9502563), 
                             viewport = list(northeast = list(lat = 38.8669400798927, lng = -76.9490863201073), 
                                             southwest = list(lat = 38.8642404201073, lng = -76.9517859798927))), 
             icon = "https://maps.gstatic.com/mapfiles/place_api/icons/shopping-71.png", 
             id = "933f517d7634fff163e2564804c6ff88cf4e7816", 
             name = "Addison Auto Parts", 
             opening_hours = list(open_now = TRUE, weekday_text = list()), 
             place_id = "ChIJr1N7fxC5t4kRLABe0OVcF4E", rating = 4.7, 
             reference = "CmRbAAAAw7N5mVqBN4-0RXwoZ38VEvwXLXxZIih__1vR3J7zdr0dBxVMOw-V5EMB0YoRFVNbsaa7AfiE_YJyVP8q8JT_hsk0FHBvKDu3ONGc3Bm8C38DNk8rmzQhLKFJeoYs5_FCEhCB26vfpK6ZXonWG3LNHR3VGhQDlMQm6xDdQ55J0GAADzhmsIQBuA", 
             types = list("car_repair", "store", "point_of_interest", "establishment"))
        ), 
    status = "OK")

roomba::roomba(response, cols = c('name', 'place_id', 'lat', 'lng'), keep = any)
#> # A tibble: 8 x 4
#>   name                                      place_id             lat   lng
#>   <chr>                                     <chr>              <dbl> <dbl>
#> 1 Carquest Auto Parts - CQ of Washington DC ChIJr9kFdGzIt4kRU…  NA    NA  
#> 2 <NA>                                      <NA>                38.9 -77.0
#> 3 <NA>                                      <NA>                39.0 -77.0
#> 4 <NA>                                      <NA>                38.9 -77.0
#> 5 Addison Auto Parts                        ChIJr1N7fxC5t4kRL…  NA    NA  
#> 6 <NA>                                      <NA>                38.9 -77.0
#> 7 <NA>                                      <NA>                38.9 -76.9
#> 8 <NA>                                      <NA>                38.9 -77.0

This case can currently be hacked into shape with tidyr, but only making the assumptions that the locations returned are in the order they appear in the response and that the order in the response is consistent. These are pretty safe assumptions, but subsetting notation would make it all moot.

I see a few options for the API:

It'd be nice to just be able to specify that I want location and either
1. have that expanded to two columns or viewport to four, but particularly in the latter case, some auto-naming is necessary to avoid identically-named columns. Concatenating the element names with . or _ (location.lat, viewport_northeast_lat) would probably be fine (it can always be cleaned up later).
2. just leave it as a list column that can be cleaned up manually (maybe as an option?), or
3. in the particular case of geodata, coerce it to an sf column. This would be super-cool, but probably a bit of work to implement properly.
Alternatively/in addition, subsetting could achieve the same thing (albeit with more typing). Naming, either/both automatically (location$lng to location_lng) or/and manually (loc_lng = location$lng) still needs to be addressed with this interface.
I'm hesitant about tidyselect, as they're operating on combinations of names, not just one, and thus wouldn't mean the quite same thing as in dplyr. Maybe that's ok? Just stating location would already be unambiguous above, and would plausibly be expected to return a list column, or given what roomba does, such a column's unnested data. There's certainly room for more sophisticated selection, though, e.g. contains('lat') or ~purrr::has_element(.x, c(TRUE, FALSE)). An API for list subsetting is a pretty deep rabbit hole, though, which I'm assuming is why purrr::pluck hasn't been extended yet despite demand. It could easily be a package of its own if it could be organized comprehensibly.

aedobbyn · 2018-06-10T00:50:25Z

@alistaire47 thanks for your comment and response example!

For what it's worth I like the idea of allowing the user to specify whether they want the data in nested or unnested format and, if unnested, representing column names in parent_child fashion.

ChrisCioffi · 2019-04-10T18:50:05Z

Has this ever been resolved?

aedobbyn · 2019-04-10T20:35:56Z

It has not -- feel free to submit a PR or throw out an idea of tackling if you have one!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Working across different levels of nesting? #69

Working across different levels of nesting? #69

cboettig commented May 24, 2018

jimhester commented May 24, 2018

aedobbyn commented May 25, 2018 •

edited

Loading

cstawitz commented May 25, 2018

alistaire47 commented Jun 9, 2018

aedobbyn commented Jun 10, 2018

ChrisCioffi commented Apr 10, 2019

aedobbyn commented Apr 10, 2019

Working across different levels of nesting? #69

Working across different levels of nesting? #69

Comments

cboettig commented May 24, 2018

jimhester commented May 24, 2018

aedobbyn commented May 25, 2018 • edited Loading

cstawitz commented May 25, 2018

alistaire47 commented Jun 9, 2018

aedobbyn commented Jun 10, 2018

ChrisCioffi commented Apr 10, 2019

aedobbyn commented Apr 10, 2019

aedobbyn commented May 25, 2018 •

edited

Loading