-
Notifications
You must be signed in to change notification settings - Fork 4
Fitting a binding series
This page will bring you through an example on how to quantify binding affinity of the ligand, starting with quantified images.
This pipeline requires the following input files:
- Fluorescent values.
- A CPseries file: a pandas DataFrame, indexed on the clusterId, with columns corresponding to the fluorescent values for different points in the binding series.
- Ideally this file only includes the clusters you would like to fit.
- Concentrations.
- A text file representing the concentrations of each column in the fluorescent values.
- Units are in nM
- Library members.
- A CPannot file: a pandas DataFrame, indexed on the clusterID, with a single column 'variant' that indicates which library member is at that cluster.
If you don't yet have these files, refer to this page on processing CPfluor files and mapping barcodes: ([data preprocessing](Processing quantified images into an experimental series)).
0 1 2 3 4 5 6 7
clusterID
M00653:72:000000000-AKPP5:1:2101:19526:15124 0.263915 NaN 18.680380 28.731330 113.997074 309.156274 392.594983 406.253575
M00653:72:000000000-AKPP5:1:2102:4005:7276 12.443187 6.235821 19.070643 70.348760 167.753083 309.825965 546.657635 863.191553
M00653:72:000000000-AKPP5:1:2102:14626:8517 20.914235 10.280927 66.042941 196.194123 235.140475 389.937945 550.857002 693.632958
M00653:72:000000000-AKPP5:1:2102:19935:11152 13.561834 24.205506 62.365845 105.381084 331.450972 579.017631 986.909676 1147.608517
M00653:72:000000000-AKPP5:1:2102:23710:15626 13.483145 8.294032 44.503121 48.735191 213.691357 364.032730 632.511202 665.162297
variant_number
clusterID
M00653:72:000000000-AKPP5:1:2101:10000:1184 NaN
M00653:72:000000000-AKPP5:1:2101:10000:14640 24793
M00653:72:000000000-AKPP5:1:2101:10000:18114 5378
M00653:72:000000000-AKPP5:1:2101:10000:20592 7446
M00653:72:000000000-AKPP5:1:2101:10000:21349 7577
0.914494742
2.743484225
8.230452675
24.69135802
74.07407407
222.2222222
666.6666667
2000
Normalize the binding fluorescence (green channel) by a the all-RNA fluorescence data (red channel).
# normalize green channel CPseries file by red channel CPseries file.
basename=bindingCurves/AKPP5_ALL_Bottom_filtered_reduced
python -m normalizeSeries -b $basename.CPseries.gz -a $basename"_red.CPseries.gz"
This produces a CPseries file, where the fluorescence values have been normalized. (see [normalization](Normalization of cluster fluorescence intensities) for more info).
Fit single clusters with minimal constraints.
# fit single clusters
c=concentrations.txt
normbasename=bindingCurves/AKPP5_ALL_Bottom_filtered_reduced_normalized
python -m singleClusterFits -b $normbasename".CPseries.gz" -c $c -n 20
Note: if you'd like to only fit a subset to make sure the script is working,
use the --subset
flag, which should take less than a minute to run.
This produces two outputs:
- Single cluster fits.
- A CPfitted file: a pandas DataFrame, indexed on the clusterId, with columns corresponding to the fit parameters (fmax, dG, fmin) and their associated stde errors (fmax_stde, etc), as well as the coefficient of determination (rsq), the exit flag (from lmfit), and the root mean squared error (rmse).
- Fit parameters.
- A fitParameters file giving the initial guesses, upper- and lower-bounds on the three fit parameters (fmax, dG, fmin). Note: these are the parameters that are shared across all clusters. The initial guess for the fmax will be estimated per cluster (by the maximum value of the fluorescence), and so is set to NaN.
fmax dG fmin fmax_stde dG_stde fmin_stde rsq exit_flag rmse
clusterID
M00653:72:000000000-AKPP5:1:2116:4535:17179 0.9501852 -11.66725 0.4678564 0.3332177 0.5082402 0.3451283 0.8971255 1 0.2126636
M00653:72:000000000-AKPP5:1:2109:12728:13104 1.930328 -9.517453 0.05344163 0.08412358 0.1137484 0.05735404 0.9910591 1 0.1903357
M00653:72:000000000-AKPP5:1:2115:24878:15052 2.366877 -9.283474 0.02620492 0.06113703 0.06477567 0.0361533 0.9971191 1 0.128555
M00653:72:000000000-AKPP5:1:2104:19291:13131 2.298948 -9.742383 2.860502e-07 0.1420009 0.1525948 0.09507577 0.9833778 1 0.3186123
M00653:72:000000000-AKPP5:1:2102:10109:8541 1.99432 -9.335828 1.521232e-08 0.1096474 0.1393694 0.008049328 0.9867835 1 0.2351829
fmax dG fmin
lowerbound 0.000000 -14.787322 0.000000
initial NaN -7.637215 0.179725
upperbound inf -4.962856 inf
Find the distribution of fmax for good variants.
c=concentrations.txt
an=anyRNA.CPannot.pkl
normbasename=bindingCurves/AKPP5_ALL_Bottom_filtered_reduced_normalized
python -m findFmaxDist -cf $normbasename.CPfitted.gz -a $an -c $c
This produces two outputs:
- fmax distribution
- a python class that stores a gamma distribution whose mean is fixed, but whose standard deviation will depend on the number of measurements (i.e. number of clusters with the same variant ID).
- initial per-variant fit parameters
- a CPvariant file: a pandas DataFrame, indexed on the variantID, with columns corresponding to the median fit parameters from the single cluster fits (fmax_init, dG_init, fmin_init), the number of clusters per variant (numTests), the fraction of clusters that fit well (fitFraction), the pvalue that this fitFraction was generated from a background rate of 0.25 (pvalue), as well as columns that will be filled out in the next step.
fmax_init dG_init fmin_init numTests fitFraction pvalue numClusters fmax_lb fmax fmax_ub dG_lb dG dG_ub fmin_lb fmin fmin_ub rsq numIter flag
variant_number
0 2.522423 -8.596066 0.052727 54 0.981481 5.022825e-31 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 2.179792 -8.474715 0.026333 78 0.948718 1.287681e-39 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 1.880778 -8.400040 0.023979 46 0.978261 2.807083e-26 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 1.884838 -8.467149 0.027602 44 1.000000 3.231174e-27 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 2.085465 -8.133927 0.036511 50 0.980000 1.191180e-28 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
See Finding fmax distribution for more details on what happens during this program and output figure descriptions.
Using the fmax distribution, the fluorescence values, and the median values per variant of the single cluster fits, bootstrap the fits per variant.
c=concentrations.txt
an=anyRNA.CPannot.pkl
normbasename=bindingCurves/AKPP5_ALL_Bottom_filtered_reduced_normalized
python -m bootStrapFits -v $normbasename.init.CPvariant.gz -a $an -b $normbasename.CPseries.gz -c concentrations.txt -f $normbasename.fmaxdist.p
This script will save another CPvariant file. The first 6 columns are identical to the initial file.
column | description |
---|---|
fmax_init | median fmax from the single cluster fits |
dG_init | median ΔG from the single cluster fits (kcal/mol) |
fmin_init | median fmin from the single cluster fits |
numTests | number of clusters per variant |
fitFraction | the fraction of clusters associated with that variant that fit well |
pvalue | the binomial one-tailed pvalue that this fitFraction was generated from a background rate of 0.25 |
numClusters | same as numTests |
fmax_lb | lower bound of 95% confidence interval on fmax |
fmax | fit fmax |
fmax_ub | upper bound of 95% confidence interval on fmax |
dG_lb | lower bound of 95% confidence interval on ΔG |
dG | fit ΔG (kcal/mol) |
dG_ub | upper bound of 95% confidence interval on ΔG |
fmin_lb | lower bound of 95% confidence interval on fmin |
fmin | fit fmin |
fmin_ub | upper bound of 95% confidence interval on fmin |
rsq | median coefficient of determination for the resampled fits |
numIter | number of iterations of the resampling |
flag | whether fmax was enforced (1) or whether it was allowed to float during the fit (0) |