Skip to content

Fitting a binding series

sjkdenny edited this page Jan 21, 2019 · 36 revisions

This page will bring you through an example on how to quantify binding affinity of the ligand, starting with quantified images.

Input files

This pipeline requires the following input files:

  1. Fluorescent values.
    • A CPseries file: a pandas DataFrame, indexed on the clusterId, with columns corresponding to the fluorescent values for different points in the binding series.
    • Ideally this file only includes the clusters you would like to fit.
  2. Concentrations.
    • A text file representing the concentrations of each column in the fluorescent values.
    • Units are in nM
  3. Library members.
    • A CPannot file: a pandas DataFrame, indexed on the clusterID, with a single column 'variant' that indicates which library member is at that cluster.

If you don't yet have these files, refer to this page on processing CPfluor files and mapping barcodes: ([data preprocessing](Processing quantified images into an experimental series)).

An example CPseries file:

                                                      0          1          2           3           4           5           6            7
clusterID
M00653:72:000000000-AKPP5:1:2101:19526:15124   0.263915        NaN  18.680380   28.731330  113.997074  309.156274  392.594983   406.253575
M00653:72:000000000-AKPP5:1:2102:4005:7276    12.443187   6.235821  19.070643   70.348760  167.753083  309.825965  546.657635   863.191553
M00653:72:000000000-AKPP5:1:2102:14626:8517   20.914235  10.280927  66.042941  196.194123  235.140475  389.937945  550.857002   693.632958
M00653:72:000000000-AKPP5:1:2102:19935:11152  13.561834  24.205506  62.365845  105.381084  331.450972  579.017631  986.909676  1147.608517
M00653:72:000000000-AKPP5:1:2102:23710:15626  13.483145   8.294032  44.503121   48.735191  213.691357  364.032730  632.511202   665.162297

An example CPannot file:

                                             variant_number
clusterID
M00653:72:000000000-AKPP5:1:2101:10000:1184             NaN
M00653:72:000000000-AKPP5:1:2101:10000:14640          24793
M00653:72:000000000-AKPP5:1:2101:10000:18114           5378
M00653:72:000000000-AKPP5:1:2101:10000:20592           7446
M00653:72:000000000-AKPP5:1:2101:10000:21349           7577

An example concentrations file: (units are in nM)

0.914494742
2.743484225
8.230452675
24.69135802
74.07407407
222.2222222
666.6666667
2000

Normalization

Normalize the binding fluorescence (green channel) by a the all-RNA fluorescence data (red channel).

# normalize green channel CPseries file by red channel CPseries file.
basename=bindingCurves/AKPP5_ALL_Bottom_filtered_reduced
python -m normalizeSeries -b $basename.CPseries.gz -a $basename"_red.CPseries.gz"

This produces a CPseries file, where the fluorescence values have been normalized. (see [normalization](Normalization of cluster fluorescence intensities) for more info).

Fit single clusters

Fit single clusters with minimal constraints.

# fit single clusters
c=concentrations.txt
normbasename=bindingCurves/AKPP5_ALL_Bottom_filtered_reduced_normalized
python -m singleClusterFits -b $normbasename".CPseries.gz" -c $c -n 20

Note: if you'd like to only fit a subset to make sure the script is working, use the --subset flag, which should take less than a minute to run.

This produces two outputs:

  1. Single cluster fits.
    • A CPfitted file: a pandas DataFrame, indexed on the clusterId, with columns corresponding to the fit parameters (fmax, dG, fmin) and their associated stde errors (fmax_stde, etc), as well as the coefficient of determination (rsq), the exit flag (from lmfit), and the root mean squared error (rmse).
  2. Fit parameters.
    • A fitParameters file giving the initial guesses, upper- and lower-bounds on the three fit parameters (fmax, dG, fmin). Note: these are the parameters that are shared across all clusters. The initial guess for the fmax will be estimated per cluster (by the maximum value of the fluorescence), and so is set to NaN.

Example CPfitted file:

                                                   fmax        dG          fmin   fmax_stde     dG_stde    fmin_stde        rsq exit_flag       rmse
clusterID
M00653:72:000000000-AKPP5:1:2116:4535:17179   0.9501852 -11.66725     0.4678564   0.3332177   0.5082402    0.3451283  0.8971255         1  0.2126636
M00653:72:000000000-AKPP5:1:2109:12728:13104   1.930328 -9.517453    0.05344163  0.08412358   0.1137484   0.05735404  0.9910591         1  0.1903357
M00653:72:000000000-AKPP5:1:2115:24878:15052   2.366877 -9.283474    0.02620492  0.06113703  0.06477567    0.0361533  0.9971191         1   0.128555
M00653:72:000000000-AKPP5:1:2104:19291:13131   2.298948 -9.742383  2.860502e-07   0.1420009   0.1525948   0.09507577  0.9833778         1  0.3186123
M00653:72:000000000-AKPP5:1:2102:10109:8541     1.99432 -9.335828  1.521232e-08   0.1096474   0.1393694  0.008049328  0.9867835         1  0.2351829

Examples fitParameters file:

                fmax         dG      fmin
lowerbound  0.000000 -14.787322  0.000000
initial          NaN  -7.637215  0.179725
upperbound       inf  -4.962856       inf

Find fmax distribution

Find the distribution of fmax for good variants.

c=concentrations.txt
an=anyRNA.CPannot.pkl
normbasename=bindingCurves/AKPP5_ALL_Bottom_filtered_reduced_normalized
python -m findFmaxDist -cf $normbasename.CPfitted.gz -a $an -c $c

This produces two outputs:

  • fmax distribution
    • a python class that stores a gamma distribution whose mean is fixed, but whose standard deviation will depend on the number of measurements (i.e. number of clusters with the same variant ID).
  • initial per-variant fit parameters
    • a CPvariant file: a pandas DataFrame, indexed on the variantID, with columns corresponding to the median fit parameters from the single cluster fits (fmax_init, dG_init, fmin_init), the number of clusters per variant (numTests), the fraction of clusters that fit well (fitFraction), the pvalue that this fitFraction was generated from a background rate of 0.25 (pvalue), as well as columns that will be filled out in the next step.

Example CPvariant file:

                fmax_init   dG_init  fmin_init  numTests  fitFraction        pvalue  numClusters  fmax_lb  fmax  fmax_ub  dG_lb  dG  dG_ub  fmin_lb  fmin  fmin_ub  rsq  numIter  flag
variant_number
0                2.522423 -8.596066   0.052727        54     0.981481  5.022825e-31          NaN      NaN   NaN      NaN    NaN NaN    NaN      NaN   NaN      NaN  NaN      NaN   NaN
1                2.179792 -8.474715   0.026333        78     0.948718  1.287681e-39          NaN      NaN   NaN      NaN    NaN NaN    NaN      NaN   NaN      NaN  NaN      NaN   NaN
2                1.880778 -8.400040   0.023979        46     0.978261  2.807083e-26          NaN      NaN   NaN      NaN    NaN NaN    NaN      NaN   NaN      NaN  NaN      NaN   NaN
3                1.884838 -8.467149   0.027602        44     1.000000  3.231174e-27          NaN      NaN   NaN      NaN    NaN NaN    NaN      NaN   NaN      NaN  NaN      NaN   NaN
4                2.085465 -8.133927   0.036511        50     0.980000  1.191180e-28          NaN      NaN   NaN      NaN    NaN NaN    NaN      NaN   NaN      NaN  NaN      NaN   NaN

See Finding fmax distribution for more details on what happens during this program and output figure descriptions.

Bootstrap fits

Using the fmax distribution, the fluorescence values, and the median values per variant of the single cluster fits, bootstrap the fits per variant.

c=concentrations.txt
an=anyRNA.CPannot.pkl
normbasename=bindingCurves/AKPP5_ALL_Bottom_filtered_reduced_normalized
python -m bootStrapFits -v $normbasename.init.CPvariant.gz -a $an -b $normbasename.CPseries.gz -c concentrations.txt -f $normbasename.fmaxdist.p

This script will save another CPvariant file. The first 6 columns are identical to the initial file.

column description
fmax_init median fmax from the single cluster fits
dG_init median ΔG from the single cluster fits (kcal/mol)
fmin_init median fmin from the single cluster fits
numTests number of clusters per variant
fitFraction the fraction of clusters associated with that variant that fit well
pvalue the binomial one-tailed pvalue that this fitFraction was generated from a background rate of 0.25
numClusters same as numTests
fmax_lb lower bound of 95% confidence interval on fmax
fmax fit fmax
fmax_ub upper bound of 95% confidence interval on fmax
dG_lb lower bound of 95% confidence interval on ΔG
dG fit ΔG (kcal/mol)
dG_ub upper bound of 95% confidence interval on ΔG
fmin_lb lower bound of 95% confidence interval on fmin
fmin fit fmin
fmin_ub upper bound of 95% confidence interval on fmin
rsq median coefficient of determination for the resampled fits
numIter number of iterations of the resampling
flag whether fmax was enforced (1) or whether it was allowed to float during the fit (0)