generate fuller (no enhancements) transform map for CAP organizations #5

cmharlow · 2017-12-20T23:37:18Z

blocked by #1 and by #4

mjgiarlo · 2017-12-21T23:34:18Z

I'm still at a bit of a loss as to how to get Traject to parse objects of the shape I have (a tree, where every object may have 0 or more children):

{
  "name": "whatever",
  "children": [
    {
      "name": "foo",
      "children": [ ... ]
    }
  ]
}

Unclear to me what seam is provided in Traject that lets me say "descend into children and treat this list as additional records."

Options

Figure this out (assuming there's a good answer in Traject/TrajectPlus already. (Pester folks like @cbeer, @billdueber, and @jrochkind for more information.)
Pre-process the Stanford orgs JSON to look like something more "Traject-friendly."
Wrap the Traject::Indexer in the StanfordOrganizations transformer with knowledge of this structure, and call Traject once per level of the hierarchy.
Write a custom reader based on TrajectPlus::JsonReader.

billdueber · 2017-12-22T01:55:43Z

One option would be to define a reader class that recursively walks through each object. A reader has a simple interface (and probably shouldn't have any interface other than to be an iterator -- I've been thinking along those lines). But for now, you could do something like this.

require 'json'
class MyJsonHierarchyReader

  # @param [#each] input_stream Probably an open file, but any iterator will do
  # so long as it returns a valid JSON object from #each
  def initialize(input_stream, settings)
    # ... whatever you need to do. Let's pretend it's
    # a newline-delimited JSON file, since you didn't
    # specify anything
    @json_lines = input_stream
  end

  def each(&blk)
    @json_lines.each do |jline|
      h = JSON.parse jline
      recursive_yield(h, blk)
    end
  end

  # Do a depth-first crawl
  def recursive_yield(h, blk)
    doc = build_a_doc(h)
    blk.call doc
    Array(h['children']).each do |child|
      recursive_yield(child, blk)
    end
  end

  def build_a_doc(h)
    {"my_id" => h['id'] }
  end

end

I made some sample data at this gist

Of course, if your children need to know something about the parents, you'll have to inject that data as you crawl down the tree, but that shouldn't be hard.

If you wanna post a real record or two, it might help.

billdueber · 2017-12-22T02:32:31Z

And here's a working traject setup.

jrochkind · 2017-12-22T02:42:56Z

That's your input? And you aren't using it at all with traject yet, but would like to?

Yep, you need a custom reader, and bill's outline seems about right, if that is doing what you want. If your input files will be at all large, I'd try to parse them in a streaming fashion (can you do that with json?), and if you yield them streaming traject will keeps everything streaming and avoid loading everything into memory at once.

If you need to access multiple levels at once... it might get trickier, and also involve a custom writer, that knows it'll be writing more than one record (maybe just taking an array of output and delegating each one to an existing writer).

I hadn't known TrajectPlus was a thing, just seeing it now for the first time. Looks like some pretty good stuff. Would it make sense to link to it in Traject README? And/or merge some of it into traject, as optional classes/modules to use to get the behavior?

cbeer · 2017-12-22T14:22:47Z

I hadn't known TrajectPlus was a thing, just seeing it now for the first time. Looks like some pretty good stuff. Would it make sense to link to it in Traject README? And/or merge some of it into traject, as optional classes/modules to use to get the behavior?

TrajectPlus is relatively new and is a convenient place to experiment with some new (-to-us) patterns, and we'll definitely continue to push the useful ideas upstream (like traject#145). I suspect some of the macros (compose + conditional) are close, and some would still benefit from refinement or generalization.

cmharlow · 2017-12-22T15:26:52Z

@mjgiarlo i liked option 3 as a 'just get the proof of concept stuff done to test ourselves'. But with bill's recommendations above, it may be better to go step 4 (starting with the JSON Reader extension maybe being specific to the CAP output, like a CapJsonReader), until we get a bit further with this and other data sources to look towards generalization.

cmharlow · 2017-12-22T15:30:13Z

@billdueber we're not sure we can share the data from the sources yet (see #7), but thanks for the rec - we had spitballed something like this, but its great to see your input.

jrochkind · 2017-12-22T16:02:25Z

Awesome, thanks @cbeer that makes sense. It looks like for most of them you don't need to monkey-patch existing traject, they are just new readers/writers, or new macros (which could be provided as options you include or extend into your indexer, rather than on by default), which makes them pretty easy to include in the main distro.

I agree it makes sense to try em out separately until they are proven, but I'd def be interested in incorporating em upstream. In the meantime, maybe a link in the traject README to the experimental/in-development options in TrajectPlus might make sense.

Eventually, if we have solid support for XML and other non-MARC features, I've been thinking for a while of a refactor of traject where the marc-specific stuff is not in the base class, so you don't have it if you don't need it. That would probably require a major-version bump, and some thought about how to still make the command line super convenient to use (perhaps sub-commands like traject marc, traject xml, traject json that turn on different default modules.)

mjgiarlo · 2017-12-22T16:35:05Z

Wow, thanks all! I wrote up the three options above and then about ten minutes later, I wound up going down the road @billdueber suggested, though to start I'm just using TrajectPlus::JsonReader.class_eval { } to test out the approach. This validates the approach, methinks, and I'll plan to break this into a class of its own within Rialto::Etl.

jrochkind · 2017-12-22T17:07:44Z

It's definitely totally fine to write a special-purpose project-specific reader, nothing wrong with that, it's an intended seam, and should be forwards compatible for a long time.

cmharlow · 2017-12-22T17:09:18Z

just recording here convo on slack - @mjgiarlo i really like that approach here for modularity of RIALTO source loaders.

mjgiarlo · 2017-12-22T18:24:01Z

I've got this working, thanks to y'all. Pushing my branch momentarily. Not closing this issue until we also incorporate the work in #1.

Closes #4 Refs #5 * Updates TrajectPlus to fix #4 * Adds the first transformer, taking StanfordOrganizations data and transforms it into mapped JSON (not 100% complete: see also #10, #12) * Adds a custom JSON reader for StanfordOrganizations data * Adds a translation map for Stanford organization types, mapping them onto VIVO types * Updates README to document transform command

cmharlow · 2017-12-22T21:09:31Z

i'm going to look now, add stuff from #1 , get this wrapped up.

mjgiarlo · 2018-01-13T05:04:01Z

@jrochkind @billdueber @cmh2166 Here's what I wound up with, btw: da12a15#diff-db585871001355a5ca6af51c1c774eba

jrochkind · 2018-01-13T14:58:46Z

@mjgiarlo I don't entirely understand what's going on there, but I like that it's not very many lines of code and no monkey-patching or anything, so that suggests it's looking good to me! :)

I don't totally understand how you use traject to write to RDF, which is what seems to be going on here? In general, I'd like to understand y'alls use cases better to see if I can make core traject support them better in a 3.0 release. @mjgiarlo would you have time to chat synchronously/voice some time about your use cases?

cmharlow · 2018-01-13T20:10:43Z

@mjgiarlo Great. will try to review my work on the fuller MAP and tests from before break. Need to catch up on the RIALTO work status here sometime this weekend.

mjgiarlo · 2018-01-23T00:43:30Z

Since the blocking issues are closed and I've done some work to update the mapping, I'm going to close this issue. Let's open new issues for specific mapping implementation changes as we go. Thanks, all.

cmharlow added mapping + modeling transformers labels Dec 20, 2017

cmharlow assigned cmharlow and unassigned cmharlow Dec 20, 2017

cmharlow assigned mjgiarlo and cmharlow Dec 22, 2017

mjgiarlo closed this as completed Jan 23, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

generate fuller (no enhancements) transform map for CAP organizations #5

generate fuller (no enhancements) transform map for CAP organizations #5

cmharlow commented Dec 20, 2017

mjgiarlo commented Dec 21, 2017 •

edited

Loading

billdueber commented Dec 22, 2017 •

edited

Loading

billdueber commented Dec 22, 2017

jrochkind commented Dec 22, 2017 •

edited

Loading

cbeer commented Dec 22, 2017

cmharlow commented Dec 22, 2017

cmharlow commented Dec 22, 2017

jrochkind commented Dec 22, 2017

mjgiarlo commented Dec 22, 2017

jrochkind commented Dec 22, 2017

cmharlow commented Dec 22, 2017

mjgiarlo commented Dec 22, 2017

cmharlow commented Dec 22, 2017

mjgiarlo commented Jan 13, 2018

jrochkind commented Jan 13, 2018

cmharlow commented Jan 13, 2018 •

edited

Loading

mjgiarlo commented Jan 23, 2018

generate fuller (no enhancements) transform map for CAP organizations #5

generate fuller (no enhancements) transform map for CAP organizations #5

Comments

cmharlow commented Dec 20, 2017

mjgiarlo commented Dec 21, 2017 • edited Loading

Options

billdueber commented Dec 22, 2017 • edited Loading

billdueber commented Dec 22, 2017

jrochkind commented Dec 22, 2017 • edited Loading

cbeer commented Dec 22, 2017

cmharlow commented Dec 22, 2017

cmharlow commented Dec 22, 2017

jrochkind commented Dec 22, 2017

mjgiarlo commented Dec 22, 2017

jrochkind commented Dec 22, 2017

cmharlow commented Dec 22, 2017

mjgiarlo commented Dec 22, 2017

cmharlow commented Dec 22, 2017

mjgiarlo commented Jan 13, 2018

jrochkind commented Jan 13, 2018

cmharlow commented Jan 13, 2018 • edited Loading

mjgiarlo commented Jan 23, 2018

mjgiarlo commented Dec 21, 2017 •

edited

Loading

billdueber commented Dec 22, 2017 •

edited

Loading

jrochkind commented Dec 22, 2017 •

edited

Loading

cmharlow commented Jan 13, 2018 •

edited

Loading