Archive channel tree command [DRAFT] #2654

ivanistheone · 2020-12-09T15:45:20Z

Description

This is a POC for "channel archiving" command that exports the complete channel tree as JSON.

Steps to Test

Run ./contentcuration/manage.py archivechanneltree {channel_id} for a {channel_id} that exists in the local DB.
Look at the output JSON file produced

Implementation Notes

At a high level, how did you implement this?

Added the helper function archive_channel_tree(channel_id, tree='main') in contentcuration/contentcuration/utils/archive.py
Added a management command archivechanneltree that calls this function.

Does this introduce any tech-debt items?

Since we're using a new serializer for this task, the fields of that serializer would have to be kept up to data as Studio data models evolve.

Checklist

Is the code clean and well-commented?
Are there tests for this change?
Are there any new ways this uses user data that needs to be factored into our Privacy Policy?
- Maybe
  - Archived channels, if stored long term, will not be deleted when a user deletes their account.
  - If we serialize usernames (editors/viewers) this would mean they will persist even after deleting an account.
Need to consider long term storage requirements for these archives (e.g. save to new content/archives/ dir in a GCP bucket)

Comments

This is strictly POC and not finished; would need to be continued in order make sure channel archives contain all the info needed for all possible use cases (e.g. is info enough to "restore" a channel from archive?).

Reviewers

Jordan @jayoshih please take a look and see if it makes sense
Kevin @kollivier the channel archive json could potentially be used to "decouple" the Kolibri db publishing (exportchannel command) from the need to access studio DB (assuming all the necessary info is present in the archived

codecov · 2020-12-09T15:58:12Z

Codecov Report

Merging #2654 (a3f3cdb) into master (fb16568) will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master    #2654   +/-   ##
=======================================
  Coverage   85.39%   85.39%           
=======================================
  Files         298      298           
  Lines       15767    15767           
=======================================
  Hits        13465    13465           
  Misses       2302     2302

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 407e90a...a3f3cdb. Read the comment docs.

rtibbles · 2021-02-01T21:14:55Z

What's needed to help push this forward, @ivanistheone?

This is minimal additions to make sure JSON archive format really works with treediffer preset="studio" defined in https://github.com/learningequality/treediffer/blob/master/src/treediffer/presets.py#L39-L80

ivanistheone · 2021-02-04T20:52:54Z

For context this PR was due to a misunderstanding on my part—when I head Jordan was working on channel diff, I rushed to get archive channel command and associated detailed diff code ready so she could use it, but then I realized "channel diff" meant just the simpler "channel counts diff" and detailed diff wasn't in scope, hence the pause on it.

That being said, it would be a good to start archiving channels data, even if no frontend for these yet.

@rtibbles Here is a mini-list of possible next steps:

A. Confirm need/usefulness (will post a separate comment with use cases)
B. Finish TODOs (Ivan can support, provided not urgent)
C. Need in-house champion to review/supervise/land this PR
D. Decisions needed:
- D1/ Where to store archives? (in GCP, but same bucket or new?)
- D2/ Under what path do we store archives?
  Maybe content/archives/jsontrees/{channel_id}/{version}/{channel_id}.json ?
- D3/ Should we store thumbnail_encoding in JSON jsontree archive?
  (recommendation: no, because it will require ~10x more storage)

Other related dev work:

Potential to refactor export code to use jsontrees archived channel instead of access directly DB
Archive DB command: before publishing a new version .sqlite3 file,
save a backup copy of the old version DB to content/archives/dbs/channel_id}/{old_version}/{channel_id}.sqlite3

I'm a bit out of the loop so cannot speak as to priority/timeframes, but happy to help out in free time on B. after A. (confirm this mgmt command is needed).

ivanistheone · 2021-02-04T21:02:38Z

Use cases

These were discussed a bit with Jordan and @kollivier as useful, but not sure if/when they would fit in roadmap:

1/ channeldiff task + command

maintree_path = archive_channel_tree(channel_id, tree="main")
stagingtree_path = archive_channel_tree(channel_id, tree="staging")
maintree = json.load(stagingtree_path)
stagingtree = json.load(stagingtree_path)
maint_staging_diff = diff(maintree, stagingtree, preset="studio")
# save to GCP bucket with public URL
# download diff from public URL (for content integration debug for Kevins and Vahids)

See standalone POC command-line code for this here: treediffer/examples/studiodiffferpoc.py

2/ channeldiff UI

run channeldiff task, then
frontend loads diff from GCP public URL and render nicely

3/ archival

Not sure if need to tackle that right now since requires consideration about scalability + long term user data retention. Would be nice to have a combined command archivechannel that does both archivechanneltree and archivechanneldb.

APPLICATION 3A: diff between vM and vN of channel based on JSON-archived versions created after each version increment. Pitch: "See what new nodes have been added to channel X since you last viewed/edited/imported_from it

4/ PUBLISH/EXPORT Koibri DB from studio JSON archive tree

Instead of export.py being based on direct access to DB; Kolibri-DB creation can be an independent task with input studio_tree_archive.json --> Kolibri DB (plus perseus files get if needed).

BONUS 4R: possibility to reuse same PUBLISH code in Ricecooker (previously discussed)
BONUS 4S: possibility to PUBLISH a "preview" Kolibri channel from staging or ricecooker trees of a channel (without ACTIVATing it first)
BONUS 4K: possibility to reuse same PUBLISH code in Kolibri (?)

5/ content provenance

All the expensive "graph analytics" like which channel imports from can be done easily based on channel archives json
https://github.com/fle-internal/content-provenance/blob/master/scripts/import_provenance.py#L23-L76
and generate "what is in this channel" visualizations
see http://design-sprint.learningequality.org/importcounts/

6/ ROC data importer

Not needed for ROC prototype, but good to have full Studio data (including provenance)
https://rocdata.readthedocs.io/en/latest/importers/kolibri_studio.html

ivanistheone added a commit to learningequality/treediffer that referenced this pull request Dec 9, 2020

Update studio preset to match learningequality/studio#2654

16f4020

ivanistheone added 6 commits February 4, 2021 15:22

Will it serialize? Yes it will!

36e284e

Will it recurse? Yes it will!

1aaaab5

Include channel metadata in archive json

dbca1ea

Added archivechanneltree command

ab75cff

Added files and assessment_items that look like API

d348470

This is minimal additions to make sure JSON archive format really works with treediffer preset="studio" defined in https://github.com/learningequality/treediffer/blob/master/src/treediffer/presets.py#L39-L80

Black strings

a3f3cdb

ivanistheone force-pushed the archive_channel branch from 7fefb6c to a3f3cdb Compare February 4, 2021 20:30

ivanistheone changed the base branch from develop to master February 4, 2021 20:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Archive channel tree command [DRAFT] #2654

Archive channel tree command [DRAFT] #2654

ivanistheone commented Dec 9, 2020

Uh oh!

codecov bot commented Dec 9, 2020 •

edited

Loading

Uh oh!

rtibbles commented Feb 1, 2021

Uh oh!

ivanistheone commented Feb 4, 2021

Uh oh!

ivanistheone commented Feb 4, 2021

Uh oh!

Uh oh!

Archive channel tree command [DRAFT] #2654

Are you sure you want to change the base?

Archive channel tree command [DRAFT] #2654

Conversation

ivanistheone commented Dec 9, 2020

Description

Steps to Test

Implementation Notes

At a high level, how did you implement this?

Does this introduce any tech-debt items?

Checklist

Comments

Reviewers

Uh oh!

codecov bot commented Dec 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

rtibbles commented Feb 1, 2021

Uh oh!

ivanistheone commented Feb 4, 2021

Uh oh!

ivanistheone commented Feb 4, 2021

Use cases

1/ channeldiff task + command

2/ channeldiff UI

3/ archival

4/ PUBLISH/EXPORT Koibri DB from studio JSON archive tree

5/ content provenance

6/ ROC data importer

Uh oh!

Uh oh!

codecov bot commented Dec 9, 2020 •

edited

Loading