Skip to content

Archive channel tree command [DRAFT] #2654

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

ivanistheone
Copy link
Contributor

Description

This is a POC for "channel archiving" command that exports the complete channel tree as JSON.

Steps to Test

  • Run ./contentcuration/manage.py archivechanneltree {channel_id} for a {channel_id} that exists in the local DB.
  • Look at the output JSON file produced

Implementation Notes

At a high level, how did you implement this?

  • Added the helper function archive_channel_tree(channel_id, tree='main') in contentcuration/contentcuration/utils/archive.py
  • Added a management command archivechanneltree that calls this function.

Does this introduce any tech-debt items?

Since we're using a new serializer for this task, the fields of that serializer would have to be kept up to data as Studio data models evolve.

Checklist

  • Is the code clean and well-commented?
  • Are there tests for this change?
  • Are there any new ways this uses user data that needs to be factored into our Privacy Policy?
    • Maybe
      • Archived channels, if stored long term, will not be deleted when a user deletes their account.
      • If we serialize usernames (editors/viewers) this would mean they will persist even after deleting an account.
  • Need to consider long term storage requirements for these archives (e.g. save to new content/archives/ dir in a GCP bucket)

Comments

This is strictly POC and not finished; would need to be continued in order make sure channel archives contain all the info needed for all possible use cases (e.g. is info enough to "restore" a channel from archive?).

Reviewers

  • Jordan @jayoshih please take a look and see if it makes sense
  • Kevin @kollivier the channel archive json could potentially be used to "decouple" the Kolibri db publishing (exportchannel command) from the need to access studio DB (assuming all the necessary info is present in the archived

@codecov
Copy link

codecov bot commented Dec 9, 2020

Codecov Report

Merging #2654 (a3f3cdb) into master (fb16568) will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #2654   +/-   ##
=======================================
  Coverage   85.39%   85.39%           
=======================================
  Files         298      298           
  Lines       15767    15767           
=======================================
  Hits        13465    13465           
  Misses       2302     2302           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 407e90a...a3f3cdb. Read the comment docs.

ivanistheone added a commit to learningequality/treediffer that referenced this pull request Dec 9, 2020
@rtibbles
Copy link
Member

rtibbles commented Feb 1, 2021

What's needed to help push this forward, @ivanistheone?

@ivanistheone ivanistheone changed the base branch from develop to master February 4, 2021 20:31
@ivanistheone
Copy link
Contributor Author

For context this PR was due to a misunderstanding on my part—when I head Jordan was working on channel diff, I rushed to get archive channel command and associated detailed diff code ready so she could use it, but then I realized "channel diff" meant just the simpler "channel counts diff" and detailed diff wasn't in scope, hence the pause on it.

That being said, it would be a good to start archiving channels data, even if no frontend for these yet.

@rtibbles Here is a mini-list of possible next steps:

  • A. Confirm need/usefulness (will post a separate comment with use cases)
  • B. Finish TODOs (Ivan can support, provided not urgent)
  • C. Need in-house champion to review/supervise/land this PR
  • D. Decisions needed:
    • D1/ Where to store archives? (in GCP, but same bucket or new?)
    • D2/ Under what path do we store archives?
      Maybe content/archives/jsontrees/{channel_id}/{version}/{channel_id}.json ?
    • D3/ Should we store thumbnail_encoding in JSON jsontree archive?
      (recommendation: no, because it will require ~10x more storage)

Other related dev work:

  • Potential to refactor export code to use jsontrees archived channel instead of access directly DB
  • Archive DB command: before publishing a new version .sqlite3 file,
    save a backup copy of the old version DB to content/archives/dbs/channel_id}/{old_version}/{channel_id}.sqlite3

I'm a bit out of the loop so cannot speak as to priority/timeframes, but happy to help out in free time on B. after A. (confirm this mgmt command is needed).

@ivanistheone
Copy link
Contributor Author

Use cases

These were discussed a bit with Jordan and @kollivier as useful, but not sure if/when they would fit in roadmap:

1/ channeldiff task + command

maintree_path = archive_channel_tree(channel_id, tree="main")
stagingtree_path = archive_channel_tree(channel_id, tree="staging")
maintree = json.load(stagingtree_path)
stagingtree = json.load(stagingtree_path)
maint_staging_diff = diff(maintree, stagingtree, preset="studio")
# save to GCP bucket with public URL
# download diff from public URL (for content integration debug for Kevins and Vahids)

See standalone POC command-line code for this here: treediffer/examples/studiodiffferpoc.py
Screen Shot 2021-02-04 at 4 01 03 PM

2/ channeldiff UI

run channeldiff task, then
frontend loads diff from GCP public URL and render nicely

3/ archival

Not sure if need to tackle that right now since requires consideration about scalability + long term user data retention. Would be nice to have a combined command archivechannel that does both archivechanneltree and archivechanneldb.

  • APPLICATION 3A: diff between vM and vN of channel based on JSON-archived versions created after each version increment. Pitch: "See what new nodes have been added to channel X since you last viewed/edited/imported_from it

4/ PUBLISH/EXPORT Koibri DB from studio JSON archive tree

Instead of export.py being based on direct access to DB; Kolibri-DB creation can be an independent task with input studio_tree_archive.json --> Kolibri DB (plus perseus files get if needed).

  • BONUS 4R: possibility to reuse same PUBLISH code in Ricecooker (previously discussed)
  • BONUS 4S: possibility to PUBLISH a "preview" Kolibri channel from staging or ricecooker trees of a channel (without ACTIVATing it first)
  • BONUS 4K: possibility to reuse same PUBLISH code in Kolibri (?)

5/ content provenance

All the expensive "graph analytics" like which channel imports from can be done easily based on channel archives json
https://github.com/fle-internal/content-provenance/blob/master/scripts/import_provenance.py#L23-L76
and generate "what is in this channel" visualizations
see http://design-sprint.learningequality.org/importcounts/

6/ ROC data importer

Not needed for ROC prototype, but good to have full Studio data (including provenance)
https://rocdata.readthedocs.io/en/latest/importers/kolibri_studio.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants