Skip to content
sjkdenny edited this page Jul 28, 2015 · 30 revisions

Welcome to the array_image_tools_SKD wiki!

Part 1: Make CPsignal files

Starting from where the array_tools pipeline leaves off, finds the CPfluor directories and CPseq directories and joins them for future processing. This culminates in CPsignal files, which has all of the information of a CPseq file with the fit fluorescence data appended. It will also make a 'reduced' CPsignal file, which has all of the tiles concatenated, but only contains the subset you want to further process. This subset is defined by filterPos, a set of filter names that you wish to keep. See array_tools for more information about filters.

Inputs:

  • filtered CPseq files
  • CPfluor directories
    • This is in the form of a ‘.map’ file (tab delimited)
      • row 1: root directory
      • row 2: all RNA CPfluor directory (if none, leave line blank)
      • rows 3 to N+3: first column, binding series images CPfluor directories
      • rows 3 to N+3: second column, associated concentrations (in nM)
      • where N is the number of binding points
    • For off rates and on rates, please provide the same format. This program assumes all of your onrate/offrate data will be in a single CPfluor directory. This is given in row 3, with a dummy variable for the concentration (i.e. 0)
  • optional inputs:
    • "Positive filter”: filterPos
      • list of filter names for all clusters you plan to fit.
      • If not given, all clusters are fit (more time consuming)

Outputs:

  • CPsignal file has columns:
    • cluster ID
    • filter
    • read1 seq
    • read1 quality
    • read2 seq
    • read2 quality
    • index1 seq
    • index1 quality
    • index2 seq
    • index2 quality
    • allRNA signal
    • comma-separated binding series/off rate/onrate signal
  • directory "CPsignal" contains tile-separated CPsignal files.
  • directory "CPfitted" contains the reduced, concatenated CPsignal file.

Part 2: Make CPannot file

Given a sequenced chip, annotates the clusters with unique variant number. This culminates in a file with two columns: one for the tileID and one with the unique variant number. There are several options for making this file, or you can supply it.

Inputs:

  • library characterization file
    • file that lists the unique sequences. This file can have any number of columns; however it must have a header and it must contain the column 'sequence.' This column will be used to annotate clusters with the unique variant information
    • The 'sequence' column will be uniqued, so it can contain duplicate sequences
    • The final 'variant_number' will correspond to the index within the sequence column that sequence was first found.
  • Optional: unique_barcodes file.
    • A file containing the barcode and associated consensus sequence.
    • This file can have any number of columns; however it must have a header and it must contain the columns 'sequence' and 'barcode'.
    • can be generating using the script compressBarcodes.py.
    • should be filtered to only contain those barcodes that are 'good'
  • Other options
    • barcodeCol
      • name of column in CPsignal file containing barcodes (if unique_barcodes file was supplied). Default is index1_seq
    • seqCol
      • name of column in CPsignal files containing sequences in which to look for (if unique_barcodes file not supplied). Default is read2_seq
    • noReverseComplement
      • Default is to look for the reverse complement of sequences in seqCol (or sequence column of unique_barcodes file). Flag this option if you would like to look for the forward sequence as well.

Description

If unique_barcodes file is found, the program will look for the

  • Option 1: Given a list of seq

    • “Negative filter”: filterNeg
      • list of filter names for set of clusters that should not have any fluorescence and represent nonspecific binding or spurious fits to the image data.
      • If not given, the complement of filterPos is assumed to represent the background clusters. If filterPos is not set, background is found heuristically by ranking the fluorescence in one binding point.
  • settings/flags

  • -nc, —null_column: which point of binding series to use for initial estimate of binders/nonbinders

  • default is -1

Description: Concatenation of fluorescence and sequencing information This will concatenate the sequence information with the fit fluorescence information Final columns are:

Files are still separated by tile if a tile is missing for any concentration/time point, these are set to nan Fitting of single clusters in filterPos Clusters that are labeled with any of the filters in filterPos are fit to binding curves/off rates/ on rates Binding Curves To find optimal binding constraints, clusters are divided into set that likely bind and set that likely don’t bind. This is done using the set of clusters labeled with filterNeg. The distribution of fluorescence in these ’null’ clusters are compared to the fluorescence in the clusters to fit. Those that are significantly different from null distribution are designated probable binders. Those that are not different are probable nonbinders. for binding curves, the default point of comparison is the last point of the binding series. Normalization the all cluster signal is used to normalize the binding images if it was initially provided and the flag —no_normalization is not enforced. This all cluster signal is ‘trimmed’ to avoid Binding constraints to find fmin: the distribution of fluorescence in the first point of the binding series of the probable non binders is used to estimate constraints on fmin. The initial point is the maximum to find fmax: fit 10000 top clusters (or however many have qvalue < 0.01) if this is a really small fraction, might need to do it a different way i.e. way of no filters What to do if filterPos is all? find max and min at concentration point Part 2: Label clusters give it a barcode column and a unique barcodes file for me: use index read1, which is either NaN (not tecto), the first 16 bases of the read1 sequence, or the first N bases before finding the RNAP promoter site Can also do just finding of unique sequences? will probably be at a much lower rate in the end, make cluster -> annotation file

Clone this wiki locally