-
Notifications
You must be signed in to change notification settings - Fork 0
generate fuller (no enhancements) transform map for CAP organizations #5
Comments
I'm still at a bit of a loss as to how to get Traject to parse objects of the shape I have (a tree, where every object may have 0 or more children): {
"name": "whatever",
"children": [
{
"name": "foo",
"children": [ ... ]
}
]
} Unclear to me what seam is provided in Traject that lets me say "descend into Options
|
One option would be to define a reader class that recursively walks through each object. A reader has a simple interface (and probably shouldn't have any interface other than to be an iterator -- I've been thinking along those lines). But for now, you could do something like this. require 'json'
class MyJsonHierarchyReader
# @param [#each] input_stream Probably an open file, but any iterator will do
# so long as it returns a valid JSON object from #each
def initialize(input_stream, settings)
# ... whatever you need to do. Let's pretend it's
# a newline-delimited JSON file, since you didn't
# specify anything
@json_lines = input_stream
end
def each(&blk)
@json_lines.each do |jline|
h = JSON.parse jline
recursive_yield(h, blk)
end
end
# Do a depth-first crawl
def recursive_yield(h, blk)
doc = build_a_doc(h)
blk.call doc
Array(h['children']).each do |child|
recursive_yield(child, blk)
end
end
def build_a_doc(h)
{"my_id" => h['id'] }
end
end I made some sample data at this gist Of course, if your children need to know something about the parents, you'll have to inject that data as you crawl down the tree, but that shouldn't be hard. If you wanna post a real record or two, it might help. |
And here's a working traject setup. |
That's your input? And you aren't using it at all with traject yet, but would like to? Yep, you need a custom reader, and bill's outline seems about right, if that is doing what you want. If your input files will be at all large, I'd try to parse them in a streaming fashion (can you do that with json?), and if you yield them streaming traject will keeps everything streaming and avoid loading everything into memory at once. If you need to access multiple levels at once... it might get trickier, and also involve a custom writer, that knows it'll be writing more than one record (maybe just taking an array of output and delegating each one to an existing writer). I hadn't known TrajectPlus was a thing, just seeing it now for the first time. Looks like some pretty good stuff. Would it make sense to link to it in Traject README? And/or merge some of it into traject, as optional classes/modules to use to get the behavior? |
TrajectPlus is relatively new and is a convenient place to experiment with some new (-to-us) patterns, and we'll definitely continue to push the useful ideas upstream (like traject#145). I suspect some of the macros ( |
@mjgiarlo i liked option 3 as a 'just get the proof of concept stuff done to test ourselves'. But with bill's recommendations above, it may be better to go step 4 (starting with the JSON Reader extension maybe being specific to the CAP output, like a CapJsonReader), until we get a bit further with this and other data sources to look towards generalization. |
@billdueber we're not sure we can share the data from the sources yet (see #7), but thanks for the rec - we had spitballed something like this, but its great to see your input. |
Awesome, thanks @cbeer that makes sense. It looks like for most of them you don't need to monkey-patch existing traject, they are just new readers/writers, or new macros (which could be provided as options you I agree it makes sense to try em out separately until they are proven, but I'd def be interested in incorporating em upstream. In the meantime, maybe a link in the traject README to the experimental/in-development options in TrajectPlus might make sense. Eventually, if we have solid support for XML and other non-MARC features, I've been thinking for a while of a refactor of traject where the marc-specific stuff is not in the base class, so you don't have it if you don't need it. That would probably require a major-version bump, and some thought about how to still make the command line super convenient to use (perhaps sub-commands like |
Wow, thanks all! I wrote up the three options above and then about ten minutes later, I wound up going down the road @billdueber suggested, though to start I'm just using |
It's definitely totally fine to write a special-purpose project-specific reader, nothing wrong with that, it's an intended seam, and should be forwards compatible for a long time. |
just recording here convo on slack - @mjgiarlo i really like that approach here for modularity of RIALTO source loaders. |
I've got this working, thanks to y'all. Pushing my branch momentarily. Not closing this issue until we also incorporate the work in #1. |
Closes #4 Refs #5 * Updates TrajectPlus to fix #4 * Adds the first transformer, taking StanfordOrganizations data and transforms it into mapped JSON (not 100% complete: see also #10, #12) * Adds a custom JSON reader for StanfordOrganizations data * Adds a translation map for Stanford organization types, mapping them onto VIVO types * Updates README to document transform command
i'm going to look now, add stuff from #1 , get this wrapped up. |
@jrochkind @billdueber @cmh2166 Here's what I wound up with, btw: da12a15#diff-db585871001355a5ca6af51c1c774eba |
@mjgiarlo I don't entirely understand what's going on there, but I like that it's not very many lines of code and no monkey-patching or anything, so that suggests it's looking good to me! :) I don't totally understand how you use traject to write to RDF, which is what seems to be going on here? In general, I'd like to understand y'alls use cases better to see if I can make core traject support them better in a 3.0 release. @mjgiarlo would you have time to chat synchronously/voice some time about your use cases? |
@mjgiarlo Great. will try to review my work on the fuller MAP and tests from before break. Need to catch up on the RIALTO work status here sometime this weekend. |
Since the blocking issues are closed and I've done some work to update the mapping, I'm going to close this issue. Let's open new issues for specific mapping implementation changes as we go. Thanks, all. |
blocked by #1 and by #4
The text was updated successfully, but these errors were encountered: