:constructions: add enrichtment example to docs

Multiomics-Analytics-Group · Nov 25, 2024 · 7f51e02 · 7f51e02
1 parent 3961d84
commit 7f51e02
Show file tree

Hide file tree

Showing 2 changed files with 332 additions and 0 deletions.
diff --git a/docs/api_examples/enrichment_analysis.ipynb b/docs/api_examples/enrichment_analysis.ipynb
@@ -0,0 +1,331 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "f79a8051",
+   "metadata": {
+    "lines_to_next_cell": 2
+   },
+   "source": [
+    "# Enrichment analysis\n",
+    "\n",
+    "- we need some groups of genes to compute clusters\n",
+    "- we need functional annotations, i.e. a category summarizing a set of genes.\n",
+    "-\n",
+    "You can start with watching Lars Juhl Jensen's brief introduction to enrichment analysis\n",
+    "on [youtube](https://www.youtube.com/watch?v=2NC1QOXmc5o).\n",
+    "\n",
+    "Use example data for ovarian cancer\n",
+    "([PXD010372](https://github.com/Multiomics-Analytics-Group/acore/tree/main/example_data/PXD010372))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "956ed7b7",
+   "metadata": {
+    "lines_to_next_cell": 2,
+    "tags": [
+     "hide-output"
+    ]
+   },
+   "outputs": [],
+   "source": [
+    "%pip install acore"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a3030d08",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from pathlib import Path\n",
+    "\n",
+    "import pandas as pd\n",
+    "\n",
+    "import acore\n",
+    "import acore.differential_regulation\n",
+    "import acore.enrichment_analysis"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "fddd607c",
+   "metadata": {},
+   "source": [
+    "Parameters of this notebook"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "6af9349a",
+   "metadata": {
+    "tags": [
+     "parameters"
+    ]
+   },
+   "outputs": [],
+   "source": [
+    "base_path: str = (\n",
+    "    \"https://raw.githubusercontent.com/Multiomics-Analytics-Group/acore/refs/heads/main/\"\n",
+    "    \"example_data/PXD010372/processed\"\n",
+    ")\n",
+    "omics: str = f\"{base_path}/omics.csv\"\n",
+    "meta_pgs: str = f\"{base_path}/meta_pgs.csv\"\n",
+    "meta: str = f\"{base_path}/meta_patients.csv\"\n",
+    "N_to_sample: int = 1_000"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "10ed1830",
+   "metadata": {},
+   "source": [
+    "# Load processed data"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "8d70ef4c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df_omics = pd.read_csv(omics, index_col=0)\n",
+    "df_meta_pgs = pd.read_csv(meta_pgs, index_col=0)\n",
+    "df_meta = pd.read_csv(meta, index_col=0)\n",
+    "df_omics"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b3897e1f",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df_omics.notna().sum().sort_values(ascending=True).plot()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8ce47108",
+   "metadata": {},
+   "source": [
+    "Keep only features with a certain amount of non-NaN values and select 100 of these\n",
+    "for illustration. Add the ones which were differently regulated in the ANOVA using all\n",
+    "the protein groups."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f3a8ab49",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "idx_always_included = [\"Q5HYN5\", \"P39059\", \"O43432\", \"O43175\"]\n",
+    "df_omics[idx_always_included]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "1145a2cd",
+   "metadata": {
+    "lines_to_next_cell": 2
+   },
+   "outputs": [],
+   "source": [
+    "df_omics = (\n",
+    "    df_omics\n",
+    "    # .dropna(axis=1)\n",
+    "    .drop(idx_always_included, axis=1)\n",
+    "    .dropna(thresh=18, axis=1)\n",
+    "    .sample(\n",
+    "        N_to_sample - len(idx_always_included),\n",
+    "        axis=1,\n",
+    "        random_state=42,\n",
+    "    )\n",
+    "    .join(df_omics[idx_always_included])\n",
+    ")\n",
+    "df_omics"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "aea77e80",
+   "metadata": {
+    "lines_to_next_cell": 2
+   },
+   "outputs": [],
+   "source": [
+    "df_meta"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4bbf5dc4",
+   "metadata": {},
+   "source": [
+    "## Compute up and downregulated genes\n",
+    "These will be used to find enrichments in the set of both up and downregulated genes."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "231bb6da",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "group = \"Status\"\n",
+    "covariates = [\"PlatinumValue\"]\n",
+    "diff_reg = acore.differential_regulation.run_anova(\n",
+    "    df_omics.join(df_meta[[group]]),\n",
+    "    drop_cols=[],\n",
+    "    subject=None,\n",
+    "    group=group,\n",
+    ")\n",
+    "diff_reg.describe(exclude=[\"float\"])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "1e347b06",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "diff_reg.query(\"rejected == True\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d6c0a225",
+   "metadata": {},
+   "source": [
+    "## Find functional annotations, here pathways\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d2668415",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from acore.io.uniprot import (\n",
+    "    check_id_mapping_results_ready,\n",
+    "    get_id_mapping_results_link,\n",
+    "    get_id_mapping_results_search,\n",
+    "    submit_id_mapping,\n",
+    ")\n",
+    "\n",
+    "\n",
+    "def fetch_annotations(ids: pd.Index | list) -> pd.DataFrame:\n",
+    "    \"\"\"Fetch annotations for UniProt IDs. Combines several calls to the API of UniProt's\n",
+    "    knowledgebase (KB).\n",
+    "\n",
+    "    Parameters\n",
+    "    ----------\n",
+    "    ids : pd.Index | list\n",
+    "        Iterable of UniProt IDs. Fetches annotations as speecified by the specified fields.\n",
+    "    fields : str, optional\n",
+    "        Fields to fetch, by default \"accession,go_p,go_c. See for availble fields:\n",
+    "        https://www.uniprot.org/help/return_fields\n",
+    "\n",
+    "    Returns\n",
+    "    -------\n",
+    "    pd.DataFrame\n",
+    "        DataFrame with annotations of the UniProt IDs.\n",
+    "    \"\"\"\n",
+    "    job_id = submit_id_mapping(from_db=\"UniProtKB_AC-ID\", to_db=\"UniProtKB\", ids=ids)\n",
+    "\n",
+    "    if check_id_mapping_results_ready(job_id):\n",
+    "        link = get_id_mapping_results_link(job_id)\n",
+    "        # add fields to the link to get more information\n",
+    "        # From and Entry (accession) are the same for UniProt IDs.\n",
+    "        results = get_id_mapping_results_search(\n",
+    "            link + \"?fields=accession,go_p,go_c,go_f&format=tsv\"\n",
+    "        )\n",
+    "    header = results.pop(0).split(\"\\t\")\n",
+    "    results = [line.split(\"\\t\") for line in results]\n",
+    "    df = pd.DataFrame(results, columns=header)\n",
+    "    return df\n",
+    "\n",
+    "\n",
+    "fname_annotations = \"downloaded/annotations.csv\"\n",
+    "fname = Path(fname_annotations)\n",
+    "try:\n",
+    "    annotations = pd.read_csv(fname, index_col=0)\n",
+    "    print(f\"Loaded annotations from {fname}\")\n",
+    "except FileNotFoundError:\n",
+    "    print(f\"Fetching annotations for {df_omics.columns.size} UniProt IDs.\")\n",
+    "    annotations = fetch_annotations(df_omics.columns)\n",
+    "    annotations = (\n",
+    "        annotations.set_index(\"Entry\")\n",
+    "        .rename_axis(\"identifier\")\n",
+    "        .drop(\"From\", axis=1)\n",
+    "        .rename_axis(\"source\", axis=1)\n",
+    "        .stack()\n",
+    "        .to_frame(\"annotation\")\n",
+    "        .replace(\"\", pd.NA)\n",
+    "        .dropna()\n",
+    "        .sort_values([\"source\", \"annotation\"])\n",
+    "        .reset_index()\n",
+    "    )\n",
+    "    fname.parent.mkdir(exist_ok=True, parents=True)\n",
+    "    annotations.to_csv(fname, index=True)\n",
+    "\n",
+    "annotations"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4165bc94",
+   "metadata": {},
+   "source": [
+    "## Enrichment analysis\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f300c5b5",
+   "metadata": {
+    "lines_to_next_cell": 2
+   },
+   "outputs": [],
+   "source": [
+    "ret = acore.enrichment_analysis.run_regulation_enrichment(\n",
+    "    regulation_data=diff_reg,\n",
+    "    annotation=annotations,\n",
+    "    correction_alpha=0.01,\n",
+    ")\n",
+    "ret"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "6dd57b99",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "jupytext": {
+   "cell_metadata_filter": "tags,-all",
+   "main_language": "python",
+   "notebook_metadata_filter": "-all"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/docs/index.rst b/docs/index.rst
@@ -16,6 +16,7 @@
 
    api_examples/exploratory_analysis
    api_examples/normalization_analysis
+   api_examples/enrichment_analysis
 
 .. toctree::
    :maxdepth: 1