Skip to content
This repository has been archived by the owner on May 19, 2021. It is now read-only.

make using roxygen-like documentation for analysis directories #77

Open
AliciaSchep opened this issue May 11, 2017 · 11 comments
Open

make using roxygen-like documentation for analysis directories #77

AliciaSchep opened this issue May 11, 2017 · 11 comments

Comments

@AliciaSchep
Copy link

I've been thinking about whether it would be possible & useful to have roxygen-like tags for documenting input and outputs of analysis scripts that could be used for easily creating a makefile when needed. This idea is very related to first part of thread #5, particularly the second comment (from @njtierney) about the struggle to go from exploratory analysis to something reproducible and subsequent discussion of make, but as that thread has moved on a bit into testing/CI/pkg issues I figured I'd started a new thread.

The idea would be that in a given R script (or Rmarkdown) you might at some point read in inputs and at other points write outputs. You could tag inputs and outputs:

#' myfile.csv
#' A really cool data file!
#' @source coolwebsite.com
#' @input myinfile.csv
mytable <- read_csv("myfile.csv")

myoutput <- do_stuff(mytable)

#' myoutput.rds
#' My awesome calculated result
#' @output myoutput.rds
saveRDS(myoutput)

Then another script might have:

#' @input myoutput.rds
myinput <- readRDS("myoutput.rds")

Within the directory containing all these scripts, you could run a command that reads through all the scripts and their input and output files and creates a makefile. If there are any circular dependencies those would get flagged. The command would also create man pages for each input and output object, as well as an overall workflow documentation with a dependency graph linking to individual input/output documentation.

There already is an R package to automatically make makefiles from R scripts -- easyMake. It tries to automatically detect when a file reads in an input or exports a file. I think roxygen-like tags might be a bit more flexible and transparent, as you would be able to specify each input and output file without having to rely on all the input and output functions used being recognized. This roxygen-like system would also enable creation of a better documentation of the workflow and inputs/outputs than just the makefile or a dependency graph of filenames.

Perhaps rather than creating a new roxygen-like system, roxygen itself could also be adapted for this purpose?

@stephlocke
Copy link

I like the idea of enhanced metadata & documentation for my work

@bzkrouse
Copy link

Nice idea, I'm also interested in giving more attention to the struggle of organizing and keeping track of exploratory analysis. The concept of collecting metadata on analysis was also discussed in #23 - although also with emphasis on collecting information about results.

@MilesMcBain
Copy link
Contributor

I only just noticed this issue in the midst of cleaning up mine. I think what you're describing here is a REALLY great idea. How about a name: makedown? 😉

@hadley
Copy link
Member

hadley commented May 18, 2017

I like this idea but I think generating a makefile will be error prone. Will be more robust (if more work) to manage the dependency graph in R itself.

@AliciaSchep
Copy link
Author

Thanks @bzkrouse for this linking this to thread #23, I hadn't read through that one yet, and some of the goals are certainly shared, although I think this idea is more limited in scope. Compared to some of the fairly comprehensive systems discussed in that thread, the idea here is for something fairly minimal and very easy to incorporate into existing script-based anslyses

@MilesMcBain makedown sounds like a great name! Even if ultimately make itself isn't actually used...

As for using make versus managing things in R itself, I think the main benefit of using make is less work 😁 Although perhaps generating the makefile in a reliable way may prove harder than I am anticipating...

@AliciaSchep AliciaSchep changed the title roxygen-like documentation for analysis directories make using roxygen-like documentation for analysis directories May 19, 2017
@hadley
Copy link
Member

hadley commented May 19, 2017

Generating the makefile will allow you get to a quick proof of concept up and running, and that's a great goal for the unconf. However, code generation in general is hard, and having the dependency graph in another environment means you can't do cool visualisations in R etc.

@hadley
Copy link
Member

hadley commented May 19, 2017

Another thing worth considering is if you could automatically detect inputs/outputs for many common situations - i.e. in your example above, you could parse the file and detect read.csv() and saveRDS() and automatically generate the input/output annotations. You'd still need manual annotations for non-standard functions, but you might be able give people a fairly comprehensive solution for free.

@hadley
Copy link
Member

hadley commented May 19, 2017

It would also be handy to be able have this work inline, although you'd need someway to represent that the output was new R objects:

if_needed(
  input = c(object("types"), "my_csv.csv"),
  output = c(object("df"), "my_plot.pdf"),
  {
    df <- read_csv("my_csv.csv", col_types = types)
    ggplot(df, aes(x, y)) + geom_point()
    ggsave("my_plot.pdf")
  }
)

And in that case you could determine the inputs and outputs from the code, so you could just write:

if_needed({
  df <- read_csv("my_csv.csv", col_types = types)
  ggplot(df, aes(x, y)) + geom_point()
  ggsave("my_plot.pdf")
})

@hadley
Copy link
Member

hadley commented May 22, 2017

I hope you don't mind but I've taken your basic idea and run with it: https://docs.google.com/document/d/1avYAqjTS7zSZn7JAAOZhFPkhkPvYwaPVrSpo31Cu0Yc/edit#. I'd love your thoughts!

@AliciaSchep
Copy link
Author

Definitely don't mind, looks great! In terms of my original idea, there were two related goals, one of which was linking dependencies across R files (without having to create your own make file), and the other was to enable documentation of inputs and outputs so as to be able to create documented dependency graph. Proposal for lazyr seems like great solution for first goal, but doesn't necessarily help for second, although perhaps those goals shouldn't have been linked anyways.

@coatless
Copy link

This feels like the merging of CodeDepends and YesWorkflow / Live Demo, which would be very useful.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants