Skip to content

Commit

Permalink
updated documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
pwrose committed Oct 5, 2018
1 parent 4cc44c1 commit b28ecc4
Show file tree
Hide file tree
Showing 11 changed files with 56 additions and 54 deletions.
14 changes: 7 additions & 7 deletions example1/0-Workflow.html
Original file line number Diff line number Diff line change
Expand Up @@ -11783,12 +11783,12 @@ <h1 id="Predict-Fold-Type-of-a-Protein-from-Protein-Sequence">Predict Fold Type
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p><strong>The notebooks in this directory demonstrate the "Ten Rules for Reproducible Research in Jupyter Notebooks". Throughout the notebooks we refer to some the rules we applied.</strong></p>
<p><strong>The notebooks in this directory demonstrate and apply the "Ten Rules for Reproducible Research in Jupyter Notebooks". Throughout the notebooks we refer to some the rules we applied.</strong></p>
<p><strong>For example, this notebook demonstrates:</strong></p>
<hr>
<p><strong>Rule 1: Tell a Story for an Audience.</strong> This notebook was developed for biologists to learn how to apply a simple machine learning model to protein sequences.</p>
<p><strong>Rule 1: Tell a Story for an Audience.</strong> This notebook was developed to learn how to apply a simple machine learning model to predict protein features based on protein sequences.</p>
<p><strong>Rule 3: Build a Pipeline.</strong> This notebook describes the entire workflow from data preparation, feature calculation, model fitting, to prediction. The modularity makes it easy to replace one of the steps, for example, use a different method to calculate features or apply a different machine learning model.</p>
<p><strong>Rule 5: Use Cell, Section adn Notebook Divisions to Make Steps Clear.</strong> We broke the workflow into separate notebooks and use this top-level notebook to explain and orchestrate the workflow.</p>
<p><strong>Rule 5: Use Cell, Section and Notebook Divisions to Make Steps Clear.</strong> We broke the workflow into separate notebooks and use this top-level notebook to explain and organize the workflow.</p>
<hr>

</div>
Expand All @@ -11806,7 +11806,7 @@ <h2 id="Introduction">Introduction<a class="anchor-link" href="#Introduction">&#
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Protein chains fold in regular patterns. Secondary structure describes the geometry of segments of a protein chain. The most common secondary structure elements are</p>
<p>Proteins have four different levels of structure – primary, secondary, tertiary and quaternary. Secondary structure describes the geometry of segments of a protein chain. The most common secondary structure elements are:</p>
<ul>
<li>Alpha helices</li>
<li>Beta sheets</li>
Expand All @@ -11819,7 +11819,7 @@ <h2 id="Introduction">Introduction<a class="anchor-link" href="#Introduction">&#
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>We can classify proteins into three major fold classes based on their predominant secondary structure content</p>
<p>We can classify proteins into three major fold classes based on their predominant secondary structure content:</p>
<ul>
<li>alpha: contains predominantly alpha helices</li>
<li>beta: contains predominantly beta sheets</li>
Expand All @@ -11833,7 +11833,7 @@ <h2 id="Introduction">Introduction<a class="anchor-link" href="#Introduction">&#
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h2 id="Goal">Goal<a class="anchor-link" href="#Goal">&#182;</a></h2><p>This notebook demonstrates how to create a reproducible record to create a machine learning model. We train a simple model to predict the fold class of a protein given its protein sequence using a representative set of 3D structures from the Protein Data Bank.</p>
<h2 id="Goal">Goal<a class="anchor-link" href="#Goal">&#182;</a></h2><p>This notebook demonstrates how to create a reproducible record using a machine learning model. We train the model to predict the fold class of a protein given its amino acid sequence using a representative set of 3D structures from the Protein Data Bank.</p>
<p><strong>Run the following notebooks and explore how we applied the Ten Simple Rules.</strong></p>

</div>
Expand Down Expand Up @@ -11887,7 +11887,7 @@ <h2 id="2.-Calculate-Features">2. Calculate Features<a class="anchor-link" href=
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Protein sequences cannot be directly used for machine learning. Here use the Word2vec method to calculate a fixed-sized feature vector for each protein sequence.</p>
<p>Protein sequences cannot be directly used for machine learning. Here we use the Word2vec method to calculate a fixed-sized feature vector for each protein sequence.</p>
<p>Run the following notebook to calculate feature vectors.</p>

</div>
Expand Down
14 changes: 7 additions & 7 deletions example1/0-Workflow.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -11,17 +11,17 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"**The notebooks in this directory demonstrate the \"Ten Rules for Reproducible Research in Jupyter Notebooks\". Throughout the notebooks we refer to some the rules we applied.**\n",
"**The notebooks in this directory demonstrate and apply the \"Ten Rules for Reproducible Research in Jupyter Notebooks\". Throughout the notebooks we refer to some the rules we applied.**\n",
"\n",
"**For example, this notebook demonstrates:**\n",
"\n",
"---\n",
"\n",
"**Rule 1: Tell a Story for an Audience.** This notebook was developed for biologists to learn how to apply a simple machine learning model to protein sequences.\n",
"**Rule 1: Tell a Story for an Audience.** This notebook was developed to learn how to apply a simple machine learning model to predict protein features based on protein sequences.\n",
"\n",
"**Rule 3: Build a Pipeline.** This notebook describes the entire workflow from data preparation, feature calculation, model fitting, to prediction. The modularity makes it easy to replace one of the steps, for example, use a different method to calculate features or apply a different machine learning model.\n",
"\n",
"**Rule 5: Use Cell, Section adn Notebook Divisions to Make Steps Clear.** We broke the workflow into separate notebooks and use this top-level notebook to explain and orchestrate the workflow.\n",
"**Rule 5: Use Cell, Section and Notebook Divisions to Make Steps Clear.** We broke the workflow into separate notebooks and use this top-level notebook to explain and organize the workflow.\n",
"\n",
"---"
]
Expand All @@ -37,7 +37,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Protein chains fold in regular patterns. Secondary structure describes the geometry of segments of a protein chain. The most common secondary structure elements are\n",
"Proteins have four different levels of structure – primary, secondary, tertiary and quaternary. Secondary structure describes the geometry of segments of a protein chain. The most common secondary structure elements are:\n",
"* Alpha helices\n",
"* Beta sheets"
]
Expand All @@ -46,7 +46,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We can classify proteins into three major fold classes based on their predominant secondary structure content\n",
"We can classify proteins into three major fold classes based on their predominant secondary structure content:\n",
"* alpha: contains predominantly alpha helices\n",
"* beta: contains predominantly beta sheets\n",
"* alpha+beta: contains alpha helices and beta sheets"
Expand All @@ -57,7 +57,7 @@
"metadata": {},
"source": [
"## Goal\n",
"This notebook demonstrates how to create a reproducible record to create a machine learning model. We train a simple model to predict the fold class of a protein given its protein sequence using a representative set of 3D structures from the Protein Data Bank.\n",
"This notebook demonstrates how to create a reproducible record using a machine learning model. We train the model to predict the fold class of a protein given its amino acid sequence using a representative set of 3D structures from the Protein Data Bank.\n",
"\n",
"**Run the following notebooks and explore how we applied the Ten Simple Rules.**"
]
Expand Down Expand Up @@ -103,7 +103,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Protein sequences cannot be directly used for machine learning. Here use the Word2vec method to calculate a fixed-sized feature vector for each protein sequence.\n",
"Protein sequences cannot be directly used for machine learning. Here we use the Word2vec method to calculate a fixed-sized feature vector for each protein sequence.\n",
"\n",
"Run the following notebook to calculate feature vectors. "
]
Expand Down
15 changes: 8 additions & 7 deletions example1/1-CreateDataset.html
Original file line number Diff line number Diff line change
Expand Up @@ -11775,7 +11775,7 @@
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="Create-Dataset">Create Dataset<a class="anchor-link" href="#Create-Dataset">&#182;</a></h1><p>This notebook extracts protein secondary structure information from the Protein Data Bank for a set of representative protein chains and assigns a fold classification.</p>
<h1 id="Create-Dataset">Create Dataset<a class="anchor-link" href="#Create-Dataset">&#182;</a></h1><p>This notebook extracts from the Protein Data Bank information about the secondary structure of proteins. The ultimate goal is to assign a fold classification for a set of representative proteins.</p>

</div>
</div>
Expand Down Expand Up @@ -11826,7 +11826,7 @@ <h1 id="Create-Dataset">Create Dataset<a class="anchor-link" href="#Create-Datas
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h2 id="Read-Representative-Set-of-Protein-Chains">Read Representative Set of Protein Chains<a class="anchor-link" href="#Read-Representative-Set-of-Protein-Chains">&#182;</a></h2><p>Protein chains in the Protein Data Bank (PDB) are redundant. For example, there are more than 300 structures of hemoglobin in the PDB. To reduce biases in our analysis, we use a representative set of protein chains, i.e., a set of protein chains with minimal sequence identity among its members. For this analysis we downloaded a non-redundant set from the <a href="http://dunbrack.fccc.edu/PISCES.php">PISCES website</a> with a maximum of 25% sequence identity and an X-ray resolution of &lt;= 3.0 &#197;.</p>
<h2 id="Read-a-Representative-Set-of-Protein-Chains">Read a Representative Set of Protein Chains<a class="anchor-link" href="#Read-a-Representative-Set-of-Protein-Chains">&#182;</a></h2><p>Protein sequences in the Protein Data Bank (PDB) are redundant. For example, there are more than 300 structures of hemoglobin in the PDB. To reduce biases in our analysis, we use a representative set of protein chains, i.e., a set of protein chains with minimal sequence identity among its members. For this analysis we downloaded a non-redundant set from the <a href="http://dunbrack.fccc.edu/PISCES.php">PISCES website</a> with a maximum of 25% sequence identity and an X-ray resolution of &lt;= 3.0 &#197;.</p>
<p>Wang G, Dunbrack RL Jr (2005) PISCES: recent improvements to a PDB sequence culling server, Nucleic Acids Res. 33, W94-8. <a href="https://doi.org/10.1093/nar/gki402">doi: 10.1093/nar/gki402</a></p>

</div>
Expand Down Expand Up @@ -11956,7 +11956,7 @@ <h2 id="Read-Representative-Set-of-Protein-Chains">Read Representative Set of Pr
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h2 id="Read-Protein-Sequence-and-Secondary-Structure-Data">Read Protein Sequence and Secondary Structure Data<a class="anchor-link" href="#Read-Protein-Sequence-and-Secondary-Structure-Data">&#182;</a></h2><p>Protein secondary structure is most commonly assigned with the DSSP method.</p>
<h2 id="Read-Protein-Sequence-and-Secondary-Structure-Data">Read Protein Sequence and Secondary Structure Data<a class="anchor-link" href="#Read-Protein-Sequence-and-Secondary-Structure-Data">&#182;</a></h2><p>Secondary structure of proteins is most commonly assigned with the DSSP method.</p>
<p>Kabsch W, Sander C (1983) Dictionary of protein secondary structure: Pattern recognition of hydrogen‐bonded and geometrical features. Biopolymers 22, 2577–637. <a href="https://doi.org/10.1002/bip.360221211">doi 0.1002/bip.360221211</a></p>
<p>DSSP defines <a href="https://en.wikipedia.org/wiki/Protein_secondary_structure">8 classes of secondary structure</a>:</p>
<table>
Expand Down Expand Up @@ -12000,8 +12000,8 @@ <h2 id="Read-Protein-Sequence-and-Secondary-Structure-Data">Read Protein Sequenc
</tr>
</tbody>
</table>
<p>The <a href="https://www.rcsb.org">RCSB Protein Data Bank</a> provides protein sequence and DSSP secondary structure assignments. The method below reads a local copy of this file and returns the data as a Pandas dataframe.</p>
<p>The <strong>secondary_structure</strong> string below is the secondary strucuture assignment for each amino acid residue in the <strong>sequence</strong>.</p>
<p>The <a href="https://www.rcsb.org">RCSB Protein Data Bank</a> provides protein sequences and DSSP secondary structure assignments. The method below reads a local copy of this file and returns the data as a Pandas dataframe.</p>
<p>The <strong>secondary_structure</strong> string below is the secondary structure assignment for each amino acid residue in the <strong>sequence</strong>.</p>

</div>
</div>
Expand Down Expand Up @@ -12414,7 +12414,7 @@ <h2 id="Calculate-Secondary-Structure-Content">Calculate Secondary Structure Con
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h2 id="Classify-Chains-by-Secondary-Structure-Content">Classify Chains by Secondary Structure Content<a class="anchor-link" href="#Classify-Chains-by-Secondary-Structure-Content">&#182;</a></h2><p>Next we classify each protein chain into one of four classes. We use a threshold of 25% to define a predominant class.</p>
<h2 id="Classify-Sequences-by-Secondary-Structure-Content">Classify Sequences by Secondary Structure Content<a class="anchor-link" href="#Classify-Sequences-by-Secondary-Structure-Content">&#182;</a></h2><p>Next we classify each protein chain into one of four classes. We use a threshold of 25% to define a predominant class.</p>
<ul>
<li>alpha: predominantly alpha (&gt;=25%)</li>
<li>beta: predominantly beta (&gt;=25%)</li>
Expand Down Expand Up @@ -12457,7 +12457,8 @@ <h2 id="Classify-Chains-by-Secondary-Structure-Content">Classify Chains by Secon
<div class="prompt input_prompt">In&nbsp;[8]:</div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">df</span><span class="p">[</span><span class="n">value_col</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="n">protein_fold_class</span><span class="p">,</span> <span class="n">minThreshold</span><span class="o">=</span><span class="mf">0.05</span><span class="p">,</span> <span class="n">maxThreshold</span><span class="o">=</span><span class="mf">0.25</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># assign protein fold class</span>
<span class="n">df</span><span class="p">[</span><span class="n">value_col</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="n">protein_fold_class</span><span class="p">,</span> <span class="n">minThreshold</span><span class="o">=</span><span class="mf">0.05</span><span class="p">,</span> <span class="n">maxThreshold</span><span class="o">=</span><span class="mf">0.25</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>

<span class="c1"># exclude protein chains without a dominant classification from further analysis.</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="n">df</span><span class="p">[</span><span class="n">value_col</span><span class="p">]</span> <span class="o">!=</span> <span class="s1">&#39;other&#39;</span><span class="p">]</span>
Expand Down
15 changes: 8 additions & 7 deletions example1/1-CreateDataset.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
"metadata": {},
"source": [
"# Create Dataset\n",
"This notebook extracts protein secondary structure information from the Protein Data Bank for a set of representative protein chains and assigns a fold classification."
"This notebook extracts from the Protein Data Bank information about the secondary structure of proteins. The ultimate goal is to assign a fold classification for a set of representative proteins."
]
},
{
Expand Down Expand Up @@ -47,8 +47,8 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Read Representative Set of Protein Chains\n",
"Protein chains in the Protein Data Bank (PDB) are redundant. For example, there are more than 300 structures of hemoglobin in the PDB. To reduce biases in our analysis, we use a representative set of protein chains, i.e., a set of protein chains with minimal sequence identity among its members. For this analysis we downloaded a non-redundant set from the [PISCES website](http://dunbrack.fccc.edu/PISCES.php) with a maximum of 25% sequence identity and an X-ray resolution of <= 3.0 &#197;. \n",
"## Read a Representative Set of Protein Chains\n",
"Protein sequences in the Protein Data Bank (PDB) are redundant. For example, there are more than 300 structures of hemoglobin in the PDB. To reduce biases in our analysis, we use a representative set of protein chains, i.e., a set of protein chains with minimal sequence identity among its members. For this analysis we downloaded a non-redundant set from the [PISCES website](http://dunbrack.fccc.edu/PISCES.php) with a maximum of 25% sequence identity and an X-ray resolution of <= 3.0 &#197;. \n",
"\n",
"Wang G, Dunbrack RL Jr (2005) PISCES: recent improvements to a PDB sequence culling server, Nucleic Acids Res. 33, W94-8. [doi: 10.1093/nar/gki402](https://doi.org/10.1093/nar/gki402)"
]
Expand Down Expand Up @@ -170,7 +170,7 @@
"metadata": {},
"source": [
"## Read Protein Sequence and Secondary Structure Data\n",
"Protein secondary structure is most commonly assigned with the DSSP method.\n",
"Secondary structure of proteins is most commonly assigned with the DSSP method.\n",
"\n",
"Kabsch W, Sander C (1983) Dictionary of protein secondary structure: Pattern recognition of hydrogen‐bonded and geometrical features. Biopolymers 22, 2577–637. [doi 0.1002/bip.360221211](https://doi.org/10.1002/bip.360221211)\n",
"\n",
Expand All @@ -187,9 +187,9 @@
"| Bend (the only non-hydrogen-bond based assignment) | S |\n",
"| Coil (residues which are not in any of the above conformations) | C |\n",
"\n",
"The [RCSB Protein Data Bank](https://www.rcsb.org) provides protein sequence and DSSP secondary structure assignments. The method below reads a local copy of this file and returns the data as a Pandas dataframe. \n",
"The [RCSB Protein Data Bank](https://www.rcsb.org) provides protein sequences and DSSP secondary structure assignments. The method below reads a local copy of this file and returns the data as a Pandas dataframe. \n",
"\n",
"The **secondary_structure** string below is the secondary strucuture assignment for each amino acid residue in the **sequence**."
"The **secondary_structure** string below is the secondary structure assignment for each amino acid residue in the **sequence**."
]
},
{
Expand Down Expand Up @@ -614,7 +614,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Classify Chains by Secondary Structure Content\n",
"## Classify Sequences by Secondary Structure Content\n",
"Next we classify each protein chain into one of four classes. We use a threshold of 25% to define a predominant class.\n",
"\n",
"* alpha: predominantly alpha (>=25%)\n",
Expand Down Expand Up @@ -803,6 +803,7 @@
}
],
"source": [
"# assign protein fold class\n",
"df[value_col] = df.apply(protein_fold_class, minThreshold=0.05, maxThreshold=0.25, axis=1)\n",
"\n",
"# exclude protein chains without a dominant classification from further analysis.\n",
Expand Down
Loading

0 comments on commit b28ecc4

Please sign in to comment.