Skip to content
sjkdenny edited this page Aug 18, 2016 · 50 revisions

This page describes what happens during the findFmaxDist script. See Fitting a binding series for a description on how this fits into the overall pipeline and an example of how to run it.

Inputs

This step will require inputs:

  1. concentrations (in nM)
  2. single cluster fits (CPfitted)
  3. annotated clusters (CPannot)

How it works

  1. For each variant, find the median of fit parameters of the single cluster fits (fmin_init, fmax_init, dG_init)
  2. Filter variants for good fitters.
    • for the set of single cluster fits for a given variant, decide if the fit was good.
      • fmax_stde < fmax
      • rsq > 0.5
      • dG_stde < 1 kcal/mol. Note: this parameter has been too stringent in the past, i.e. for the puf data.
    • If a variant has singificantly more 'good' clusters that expected by chance (I assume a background rate of 25% good fits per variant), then this variant was fit well.
      • This filters for variants with few measurements as well as for variants with many bad fits.
    • Finally, a threshold on the Kd is applied based on the input concentrations.
      • Only variants that were close to saturation in the final concentration
    • If a variant has significantly more 'good' clusters that expected by chance (I assume a background rate of 25% good fits per variant), then this variant was fit well.
      • Variants with only few measurements will also be filtered by this strategy.
      • By default, this cutoff rejects variants with binomial p value > 0.01, but this can be adjusted with the -p or --pvalue_cutoff flag, i.e. -p 0.05 for a less stringent cutoff.
  • Filter variants for tight binders.
    • A threshold on the Kd is applied based on the input concentrations and the median fit parameters.
    • Variants are filtered such that the median dG_init corresponded to at least 95% bound at the final concentration. (assuming Kd = x/f - x, where x is the final concentration and f is 0.95).
    • This cutoff can also be imposed by providing the -k or --kd_cutoff flag, i.e. -k 10 to filter for variants that have a median Kd of less than or equal to 10 nM.
  1. Decide whether to fit the relationship between the standard deviation of median fmax's and the number of measurements N directly, or whether resampling is required.
    • If there are at least 20 different N's with at least 10 variants each, fit the relationship directly, otherwise, simulate relationship by resampling clusters.
    • If resampling is required, construct distributions of fmax's by resampling individual clusters for different N's, rather than using median fmax's.
  2. Fit the distribution for each value of N.
    • Distributions are assumed to be gamma distributions.
    • Initially the distribution of median fmax's of good fitting, tight binding variants is fit to a gamma distribution. The fit mean of this distribution is then fixed for subsequent fits.
    • For each number of measurements N for which there are sufficient tight binders, a subsetted distribution of fmax's is obtained, and this distribution is fit to a gamma distribution, with the mean fixed, but the standard deviation and but an offset parameter allowed to float.
    • The set of standard deviations (σ's) versus the number of measurements N is fit to the analytic function: c1/sqrt(N) + c2.
    • The set of offsets per N is assumed to be zero in the final representation of the fmax dist.

Outputs

  1. Fmaxdist.p
    • Fmax mean (constant for all variants) and std (depends on number of measurements)
  2. The initial fit parameters per variant.
  3. A bunch of figures

Figures

Initial fit parameters

A. B.

The relationship between initial median fit parameters for variants that fit well.

A) Fmax vs Kd. The red dashed line indicates the threshold of Kd below which variants should achieve >= 95% saturation at the final concentration.

B) Fmin vs Kd. Note that the fmin's obtained from the single clsuter fits for tight binders appear systematically high. This is dealt with in the next stage of the pipeline by fixing the fmin to a single value for all clusters.

Note: the practice of fixing fmin seems to work well for tectoRNAs, but ideally should be dealt with by having a zero concentration image that defines fmin per variant. We've seen issues in the puf datasets, especially if a chip was reused and things that had bound in a previous experiment started out with significantly higher fluorescence that other clusters.

How many clusters pass cutoffs

A. <img src="https://github.com/GreenleafLab/array_fitting_tools/wiki/figures/histogram_fraction_fit.png", width="300"> B. <img src="https://github.com/GreenleafLab/array_fitting_tools/wiki/figures/fraction_passing_cutoff_in_affinity_bins.png", width="300">

Some information on the variants that pass cutoffs.

Number of clusters per variant

Distribution of number of clusters/variant for variants that fit well and were tight binders.

Fit of Fmax versus number of measurements

A.

B. C.

Fit of fmax values to distributions.

A) All fmax values are fit to a gamma distribution to obtain the mean fmax.

B) Variants that have different total number of measurements are individually fit to gamma distributions, with the mean parameter fixed, but the offset and standard deviation values allowed to float. B) The standard deviations are fit to c1/sqrt(N) + c2. C) The offsets are assumed to be zero in the final distributions.

Example fits for different number of clusters

A. B. C.

Example distributions using the fixed mean, offset=0, and the standard deviation found for N clusters. A) 4 clusters, B) 18 clusters, C) 73 clusters.

Clone this wiki locally