diff --git a/hw/hw8/eva-congrats.png b/hw/hw8/eva-congrats.png new file mode 100644 index 0000000..ecae959 Binary files /dev/null and b/hw/hw8/eva-congrats.png differ diff --git a/hw/hw8/hw8.ipynb b/hw/hw8/hw8.ipynb new file mode 100644 index 0000000..d8066c1 --- /dev/null +++ b/hw/hw8/hw8.ipynb @@ -0,0 +1,1889 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "e33c5b0c-d0f1-40c0-b794-3b3471ac73d2", + "metadata": {}, + "source": [ + "# CPSC 330 - Applied Machine Learning \n", + "\n", + "## Homework 8: Word embeddings, time series, and communication\n", + "### Associated lectures: Lectures 17, 19, 20, and ML communication \n", + "\n", + "**Due date: April 12, 2022 at 11:59pm**" + ] + }, + { + "cell_type": "markdown", + "id": "e851ce52-f981-4c4e-9e9e-a740edd12e41", + "metadata": {}, + "source": [ + "## Table of Contents\n", + "\n", + "- [Submission instructions](#sg) (4%)\n", + "- [Exercise 1 - Exploring pre-trained word embeddings](#1) (24%)\n", + "- [Exercise 2 - Exploring time series data](#2) (16%)\n", + "- [Exercise 3 - Short answer questions](#4) (10%)\n", + "- [Exercise 4 - Communication](#4) (46%)\n", + "- (Optional)[Exercise 5 - Course take away](#5)" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "a4651888-484b-42a0-95e1-d273e5069205", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "outputs": [], + "source": [ + "import os\n", + "\n", + "%matplotlib inline\n", + "import matplotlib.pyplot as plt\n", + "import numpy as np\n", + "import pandas as pd\n", + "import seaborn as sns\n", + "from sklearn.cluster import DBSCAN, KMeans\n", + "from sklearn.compose import ColumnTransformer, make_column_transformer\n", + "from sklearn.feature_extraction.text import CountVectorizer\n", + "from sklearn.impute import SimpleImputer\n", + "from sklearn.linear_model import LogisticRegression\n", + "from sklearn.metrics import r2_score\n", + "from sklearn.model_selection import (\n", + " GridSearchCV,\n", + " RandomizedSearchCV,\n", + " cross_validate,\n", + " train_test_split,\n", + ")\n", + "from sklearn.pipeline import Pipeline, make_pipeline\n", + "from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler\n", + "\n", + "pd.set_option(\"display.max_colwidth\", 0)" + ] + }, + { + "cell_type": "markdown", + "id": "086914c2-5de1-414a-8770-23bef9f312d0", + "metadata": {}, + "source": [ + "



" + ] + }, + { + "cell_type": "markdown", + "id": "afe22cb5-f825-4dba-b5e3-3538f4afe703", + "metadata": {}, + "source": [ + "## Instructions \n", + "
\n", + "rubric={points:4}\n", + "\n", + "Follow the [homework submission instructions](https://github.com/UBC-CS/cpsc330/blob/master/docs/homework_instructions.md). \n", + "\n", + "**You may work on this homework in a group and submit your assignment as a group.** Below are some instructions on working as a group. \n", + "- The maximum group size is 2. \n", + "- Use group work as an opportunity to collaborate and learn new things from each other. \n", + "- Be respectful to each other and make sure you understand all the concepts in the assignment well. \n", + "- It's your responsibility to make sure that the assignment is submitted by one of the group members before the deadline. \n", + "- You can find the instructions on how to do group submission on Gradescope [here](https://help.gradescope.com/article/m5qz2xsnjy-student-add-group-members)." + ] + }, + { + "cell_type": "markdown", + "id": "69be5b2d-1854-4c63-bcc6-9b6258b7293a", + "metadata": {}, + "source": [ + "



" + ] + }, + { + "cell_type": "markdown", + "id": "859b3f00-a3e5-45d8-b504-22a84ace38cd", + "metadata": {}, + "source": [ + "## Exercise 1: Exploring pre-trained word embeddings \n", + "
\n", + "\n", + "In lecture 17, we talked about natural language processing (NLP). Using pre-trained word embeddings is very common in NLP. It has been shown that pre-trained word embeddings [work well on a variety of text classification tasks](http://www.lrec-conf.org/proceedings/lrec2018/pdf/721.pdf). These embeddings are created by training a model like Word2Vec on a huge corpus of text such as a dump of Wikipedia or a dump of the web crawl. \n", + "\n", + "A number of pre-trained word embeddings are available out there. Some popular ones are: \n", + "\n", + "- [GloVe](https://nlp.stanford.edu/projects/glove/)\n", + " * trained using [the GloVe algorithm](https://nlp.stanford.edu/pubs/glove.pdf) \n", + " * published by Stanford University \n", + "- [fastText pre-trained embeddings for 294 languages](https://fasttext.cc/docs/en/pretrained-vectors.html) \n", + " * trained using the fastText algorithm\n", + " * published by Facebook\n", + " \n", + "In this exercise, you will be exploring GloVe Wikipedia pre-trained embeddings. The code below loads the word vectors trained on Wikipedia using an algorithm called Glove. You'll need `gensim` package for that in your cpsc330 conda environment. \n", + "\n", + "```\n", + "> conda activate cpsc330\n", + "> conda install -c anaconda gensim\n", + "```" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "b4823523-ca44-48a3-94bb-f6e453d27f1c", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "['fasttext-wiki-news-subwords-300', 'conceptnet-numberbatch-17-06-300', 'word2vec-ruscorpora-300', 'word2vec-google-news-300', 'glove-wiki-gigaword-50', 'glove-wiki-gigaword-100', 'glove-wiki-gigaword-200', 'glove-wiki-gigaword-300', 'glove-twitter-25', 'glove-twitter-50', 'glove-twitter-100', 'glove-twitter-200', '__testing_word2vec-matrix-synopsis']\n" + ] + } + ], + "source": [ + "import gensim\n", + "import gensim.downloader\n", + "\n", + "print(list(gensim.downloader.info()[\"models\"].keys()))" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "83e4717e-215b-4c1b-b08a-9f5adbb52467", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[==================================================] 100.0% 128.1/128.1MB downloaded\n" + ] + } + ], + "source": [ + "# This will take a while to run when you run it for the first time.\n", + "import gensim.downloader as api\n", + "\n", + "glove_wiki_vectors = api.load(\"glove-wiki-gigaword-100\")" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "76ec38c4-ce89-4372-b015-035f4d682132", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "400000" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "len(glove_wiki_vectors)" + ] + }, + { + "cell_type": "markdown", + "id": "8c78dafb-f712-447a-b870-1fac6c249e5f", + "metadata": {}, + "source": [ + "There are 400,000 word vectors in these pre-trained model. " + ] + }, + { + "cell_type": "markdown", + "id": "ce2a75ac-fd18-4a53-89d3-26f1051c4ef3", + "metadata": {}, + "source": [ + "

" + ] + }, + { + "cell_type": "markdown", + "id": "8119fb78-d2be-4ccf-8c8d-31026563e072", + "metadata": {}, + "source": [ + "### 1.1 Word similarity using pre-trained embeddings\n", + "rubric={points:4}\n", + "\n", + "Now that we have GloVe Wiki vectors (`glove_wiki_vectors`) loaded, let's explore the embeddings. \n", + "\n", + "**Your tasks:**\n", + "\n", + "1. Calculate the cosine similarity for the following word pairs (`word_pairs`) using the [`similarity`](https://radimrehurek.com/gensim/models/keyedvectors.html?highlight=similarity#gensim.models.keyedvectors.KeyedVectors.similarity) method of the model.\n", + "2. Do the similarities make sense? " + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "aadd1aa6-6bb8-48d7-a959-691e19d411ec", + "metadata": {}, + "outputs": [], + "source": [ + "word_pairs = [\n", + " (\"coast\", \"shore\"),\n", + " (\"clothes\", \"closet\"),\n", + " (\"old\", \"new\"),\n", + " (\"smart\", \"intelligent\"),\n", + " (\"dog\", \"cat\"),\n", + " (\"tree\", \"lawyer\"),\n", + "]" + ] + }, + { + "cell_type": "markdown", + "id": "b0331ffe-bf58-4198-bd1e-3b62a5d34319", + "metadata": {}, + "source": [ + "

" + ] + }, + { + "cell_type": "markdown", + "id": "4d528120-c1ff-4203-82b8-404b62ba6bb0", + "metadata": {}, + "source": [ + "### 1.2 Bias in embeddings\n", + "rubric={points:10}\n", + "\n", + "**Your tasks:**\n", + "1. In Lecture 17 we saw that our pre-trained word embedding model output an analogy that reinforced a gender stereotype. Give an example of how using such a model could cause harm in the real world.\n", + "2. Here we are using pre-trained embeddings which are built using a dump of Wikipedia data. Explore whether there are any worrisome biases present in these embeddings or not by trying out some examples. You can use the following two methods or other methods of your choice to explore what kind of stereotypes and biases are encoded in these embeddings. \n", + " - You can use the `analogy` function below which gives words analogies. \n", + " - You can also use [similarity](https://radimrehurek.com/gensim/models/keyedvectors.html?highlight=similarity#gensim.models.keyedvectors.KeyedVectors.similarity) or [distance](https://radimrehurek.com/gensim/models/keyedvectors.html?highlight=distance#gensim.models.keyedvectors.KeyedVectors.distances) methods. (An example is shown below.) \n", + "3. Discuss your observations. Do you observe the gender stereotype we observed in class with GloVe Wikipedia embeddings?\n", + "\n", + "> Note that most of the recent embeddings are de-biased. But you might still observe some biases in the embeddings. Also, not all stereotypes present in pre-trained embeddings are necessarily bad. But you should be aware of them when you use embeddings in your models. " + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "59f7ef2b-d6b4-4338-a153-77fd4fc9847e", + "metadata": {}, + "outputs": [], + "source": [ + "def analogy(word1, word2, word3, model=glove_wiki_vectors):\n", + " \"\"\"\n", + " Returns analogy word using the given model.\n", + "\n", + " Parameters\n", + " --------------\n", + " word1 : (str)\n", + " word1 in the analogy relation\n", + " word2 : (str)\n", + " word2 in the analogy relation\n", + " word3 : (str)\n", + " word3 in the analogy relation\n", + " model :\n", + " word embedding model\n", + "\n", + " Returns\n", + " ---------------\n", + " pd.dataframe\n", + " \"\"\"\n", + " print(\"%s : %s :: %s : ?\" % (word1, word2, word3))\n", + " sim_words = model.most_similar(positive=[word3, word2], negative=[word1])\n", + " return pd.DataFrame(sim_words, columns=[\"Analogy word\", \"Score\"])" + ] + }, + { + "cell_type": "markdown", + "id": "a0d900c0-3027-4b9f-a750-bca946cf7bdb", + "metadata": {}, + "source": [ + "An example of using similarity between words to explore biases and stereotypes. " + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "13f0bacc-ba70-43e3-a7be-07c263f48048", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0.447236" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "glove_wiki_vectors.similarity(\"white\", \"rich\")" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "2d2e6671-c6db-4cc9-9652-faba038e8e50", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0.51745194" + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "glove_wiki_vectors.similarity(\"black\", \"rich\")" + ] + }, + { + "cell_type": "markdown", + "id": "19f04b87-5fa0-4eb4-bb50-cb21aa7ffd1c", + "metadata": {}, + "source": [ + "

" + ] + }, + { + "cell_type": "markdown", + "id": "9e6ccd3e-25c1-413d-b079-043fb2949090", + "metadata": {}, + "source": [ + "### 1.3 Representation of all words in English\n", + "rubric={reasoning:2}\n", + "\n", + "**Your tasks:**\n", + "1. The vocabulary size of Wikipedia embeddings is quite large. Do you think it contains **all** words in English language? What would happen if you try to get a word vector that's not in the vocabulary (e.g., \"cpsc330\"). " + ] + }, + { + "cell_type": "markdown", + "id": "464a0fbf-3a9c-42b3-b9cc-5dc474bc2804", + "metadata": {}, + "source": [ + "

" + ] + }, + { + "cell_type": "markdown", + "id": "f1d3cd04-30c9-43f4-9b07-6443ab4ecd7d", + "metadata": {}, + "source": [ + "### 1.4 Classification with pre-trained embeddings\n", + "rubric={points:8}\n", + "\n", + "In lecture 16, we saw that you can conveniently get word vectors with `spaCy` with `en_core_web_md` model. In this exercise, you'll use word embeddings in multi-class text classification task. We will use [HappyDB](https://www.kaggle.com/ritresearch/happydb) corpus which contains about 100,000 happy moments classified into 7 categories: *affection, exercise, bonding, nature, leisure, achievement, enjoy_the_moment*. The data was crowd-sourced via [Amazon Mechanical Turk](https://www.mturk.com/). The ground truth label is not available for all examples, and in this lab, we'll only use the examples where ground truth is available (~15,000 examples). \n", + "\n", + "- Download the data from [here](https://www.kaggle.com/ritresearch/happydb).\n", + "- Unzip the file and copy it in the lab directory.\n", + "\n", + "The code below reads the data CSV (assuming that it's present in the current directory as *cleaned_hm.csv*), cleans it up a bit, and splits it into train and test splits. \n", + "\n", + "**Your tasks:**\n", + "\n", + "1. Train a logistic regression with bag-of-words features and show the classification report on the test set. \n", + "2. Train logistic regression with average embedding representation extracted using spaCy and classification report on the test set. (You can refer to lecture 17 notes for this. Hint: you may want to consider using different [solvers](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) if you see convergence issues). \n", + "3. Discuss your results. Which model would be more interpretable? \n", + "4. Are you observing any benefits of transfer learning here? Briefly discuss. " + ] + }, + { + "cell_type": "code", + "execution_count": 40, + "id": "b7b35845-7976-4cda-b798-0c0700868fea", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
widreflection_periodoriginal_hmcleaned_hmmodifiednum_sentenceground_truth_categorypredicted_category
hmid
2767620624hWe had a serious talk with some friends of ours who have been flaky lately. They understood and we had a good evening hanging out.We had a serious talk with some friends of ours who have been flaky lately. They understood and we had a good evening hanging out.True2bondingbonding
276784524hI meditated last night.I meditated last night.True1leisureleisure
2769749824hMy grandmother start to walk from the bed after a long time.My grandmother start to walk from the bed after a long time.True1affectionaffection
27705573224hI picked my daughter up from the airport and we have a fun and good conversation on the way home.I picked my daughter up from the airport and we have a fun and good conversation on the way home.True1bondingaffection
27715227224hwhen i received flowers from my best friendwhen i received flowers from my best friendTrue1bondingbonding
\n", + "
" + ], + "text/plain": [ + " wid reflection_period \\\n", + "hmid \n", + "27676 206 24h \n", + "27678 45 24h \n", + "27697 498 24h \n", + "27705 5732 24h \n", + "27715 2272 24h \n", + "\n", + " original_hm \\\n", + "hmid \n", + "27676 We had a serious talk with some friends of ours who have been flaky lately. They understood and we had a good evening hanging out. \n", + "27678 I meditated last night. \n", + "27697 My grandmother start to walk from the bed after a long time. \n", + "27705 I picked my daughter up from the airport and we have a fun and good conversation on the way home. \n", + "27715 when i received flowers from my best friend \n", + "\n", + " cleaned_hm \\\n", + "hmid \n", + "27676 We had a serious talk with some friends of ours who have been flaky lately. They understood and we had a good evening hanging out. \n", + "27678 I meditated last night. \n", + "27697 My grandmother start to walk from the bed after a long time. \n", + "27705 I picked my daughter up from the airport and we have a fun and good conversation on the way home. \n", + "27715 when i received flowers from my best friend \n", + "\n", + " modified num_sentence ground_truth_category predicted_category \n", + "hmid \n", + "27676 True 2 bonding bonding \n", + "27678 True 1 leisure leisure \n", + "27697 True 1 affection affection \n", + "27705 True 1 bonding affection \n", + "27715 True 1 bonding bonding " + ] + }, + "execution_count": 40, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df = pd.read_csv(\"cleaned_hm.csv\", index_col=0)\n", + "sample_df = df.dropna()\n", + "sample_df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 41, + "id": "384fa13b-83a5-4e23-9280-c4e529937143", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "yXKa7qfQXYPD", + "outputId": "8bbf5eeb-0151-4853-a49c-3876279bbeb7" + }, + "outputs": [], + "source": [ + "sample_df = sample_df.rename(\n", + " columns={\"cleaned_hm\": \"moment\", \"ground_truth_category\": \"target\"}\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 42, + "id": "de04d594-7174-409d-a9fa-c2fd738e2208", + "metadata": {}, + "outputs": [], + "source": [ + "train_df, test_df = train_test_split(sample_df, test_size=0.3, random_state=123)\n", + "X_train, y_train = train_df[\"moment\"], train_df[\"target\"]\n", + "X_test, y_test = test_df[\"moment\"], test_df[\"target\"]" + ] + }, + { + "cell_type": "code", + "execution_count": 43, + "id": "73f63a73-0d13-4276-9e79-7ad60575a198", + "metadata": {}, + "outputs": [], + "source": [ + "import spacy\n", + "\n", + "nlp = spacy.load(\"en_core_web_md\")" + ] + }, + { + "cell_type": "markdown", + "id": "161a6ab6-62ef-4fdd-ba0d-5e7e920154a3", + "metadata": {}, + "source": [ + "



" + ] + }, + { + "cell_type": "markdown", + "id": "ec620e19-016a-4476-bb7e-de0c402078d2", + "metadata": {}, + "source": [ + "## Exercise 2: Exploring time series data \n", + "
\n", + "\n", + "In this exercise we'll be looking at a [dataset of avocado prices](https://www.kaggle.com/neuromusic/avocado-prices). You should start by downloading the dataset. " + ] + }, + { + "cell_type": "code", + "execution_count": 67, + "id": "b22471fe-942e-49aa-8fb8-9d4fbf6b00b3", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
DateAveragePriceTotal Volume404642254770Total BagsSmall BagsLarge BagsXLarge Bagstypeyearregion
02015-12-271.3364236.621036.7454454.8548.168696.878603.6293.250.0conventional2015Albany
12015-12-201.3554876.98674.2844638.8158.339505.569408.0797.490.0conventional2015Albany
22015-12-130.93118220.22794.70109149.67130.508145.358042.21103.140.0conventional2015Albany
32015-12-061.0878992.151132.0071976.4172.585811.165677.40133.760.0conventional2015Albany
42015-11-291.2851039.60941.4843838.3975.786183.955986.26197.690.0conventional2015Albany
\n", + "
" + ], + "text/plain": [ + " Date AveragePrice Total Volume 4046 4225 4770 \\\n", + "0 2015-12-27 1.33 64236.62 1036.74 54454.85 48.16 \n", + "1 2015-12-20 1.35 54876.98 674.28 44638.81 58.33 \n", + "2 2015-12-13 0.93 118220.22 794.70 109149.67 130.50 \n", + "3 2015-12-06 1.08 78992.15 1132.00 71976.41 72.58 \n", + "4 2015-11-29 1.28 51039.60 941.48 43838.39 75.78 \n", + "\n", + " Total Bags Small Bags Large Bags XLarge Bags type year region \n", + "0 8696.87 8603.62 93.25 0.0 conventional 2015 Albany \n", + "1 9505.56 9408.07 97.49 0.0 conventional 2015 Albany \n", + "2 8145.35 8042.21 103.14 0.0 conventional 2015 Albany \n", + "3 5811.16 5677.40 133.76 0.0 conventional 2015 Albany \n", + "4 6183.95 5986.26 197.69 0.0 conventional 2015 Albany " + ] + }, + "execution_count": 67, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df = pd.read_csv(\"avocado.csv\", parse_dates=[\"Date\"], index_col=0)\n", + "df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 68, + "id": "3425e59d-9580-4512-8bd2-c8c9b836df8f", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(18249, 13)" + ] + }, + "execution_count": 68, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.shape" + ] + }, + { + "cell_type": "code", + "execution_count": 69, + "id": "6d0aa8d9-34b6-4401-8b91-77e1342510ab", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Timestamp('2015-01-04 00:00:00')" + ] + }, + "execution_count": 69, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df[\"Date\"].min()" + ] + }, + { + "cell_type": "code", + "execution_count": 70, + "id": "3b64f1d1-9614-44df-b625-52a77c00af9d", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Timestamp('2018-03-25 00:00:00')" + ] + }, + "execution_count": 70, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df[\"Date\"].max()" + ] + }, + { + "cell_type": "markdown", + "id": "9dae5238-ac94-4b1b-b368-d972b6582d8a", + "metadata": {}, + "source": [ + "It looks like the data ranges from the start of 2015 to March 2018 (~5 years ago), for a total of 3.25 years or so. Let's split the data so that we have a 6 months of test data." + ] + }, + { + "cell_type": "code", + "execution_count": 71, + "id": "5d4d6fb5-1cbb-47e0-a5c4-34b66d026d1a", + "metadata": {}, + "outputs": [], + "source": [ + "split_date = \"20170925\"\n", + "train_df = df[df[\"Date\"] <= split_date]\n", + "test_df = df[df[\"Date\"] > split_date]" + ] + }, + { + "cell_type": "code", + "execution_count": 72, + "id": "26a3b6ae-1406-48c3-8b5c-e7b65a041852", + "metadata": {}, + "outputs": [], + "source": [ + "assert len(train_df) + len(test_df) == len(df)" + ] + }, + { + "cell_type": "markdown", + "id": "2f5ed401-224b-42d3-a4bd-f67b60c3625b", + "metadata": {}, + "source": [ + "### 2.1\n", + "rubric={points:4}\n", + "\n", + "In the Rain in Australia dataset from lecture, we had different measurements for each Location. What about this dataset: for which categorical feature(s), if any, do we have separate measurements? Justify your answer by referencing the dataset." + ] + }, + { + "cell_type": "markdown", + "id": "ac17a187-bf66-4339-b5f4-3f79cb6948cc", + "metadata": {}, + "source": [ + "

" + ] + }, + { + "cell_type": "markdown", + "id": "a2b56b13-e2ff-45d9-b800-fb7006f92653", + "metadata": {}, + "source": [ + "### 2.2\n", + "rubric={points:4}\n", + "\n", + "In the Rain in Australia dataset, the measurements were generally equally spaced but with some exceptions. How about with this dataset? Justify your answer by referencing the dataset." + ] + }, + { + "cell_type": "markdown", + "id": "16dc1348-c2c6-46f6-bbdf-a77810930ac1", + "metadata": {}, + "source": [ + "

" + ] + }, + { + "cell_type": "markdown", + "id": "d3ab120c-6b2b-4dfb-8765-7367a9577482", + "metadata": {}, + "source": [ + "### 2.3\n", + "rubric={points:4}\n", + "\n", + "In the Rain in Australia dataset, each location was a different place in Australia. For this dataset, look at the names of the regions. Do you think the regions are all distinct, or are there overlapping regions? Justify your answer by referencing the data." + ] + }, + { + "cell_type": "markdown", + "id": "331ec42a-3093-46c5-b708-ca7716f939dc", + "metadata": {}, + "source": [ + "



" + ] + }, + { + "cell_type": "markdown", + "id": "49c9e680-d2b3-432b-884f-f05c3ed5e761", + "metadata": {}, + "source": [] + }, + { + "cell_type": "markdown", + "id": "eeb32828-c8f6-4344-90bc-7ec214e448e7", + "metadata": {}, + "source": [ + "## Preparation for models\n", + "\n", + "We will use the entire dataset despite any location-based weirdness uncovered in the previous part.\n", + "\n", + "We would like to forecast the avocado price, which is the `AveragePrice` column. The function below is adapted from Lecture 19, with some improvements." + ] + }, + { + "cell_type": "code", + "execution_count": 85, + "id": "87ef9e53-9170-4a0c-a249-7cf0715496b8", + "metadata": {}, + "outputs": [], + "source": [ + "def create_lag_feature(\n", + " df, orig_feature, lag, groupby, new_feature_name=None, clip=False\n", + "):\n", + " \"\"\"\n", + " Creates a new feature that's a lagged version of an existing one.\n", + "\n", + " NOTE: assumes df is already sorted by the time columns and has unique indices.\n", + "\n", + " Parameters\n", + " ----------\n", + " df : pandas.core.frame.DataFrame\n", + " The dataset.\n", + " orig_feature : str\n", + " The column name of the feature we're copying\n", + " lag : int\n", + " The lag; negative lag means values from the past, positive lag means values from the future\n", + " groupby : list\n", + " Column(s) to group by in case df contains multiple time series\n", + " new_feature_name : str\n", + " Override the default name of the newly created column\n", + " clip : bool\n", + " If True, remove rows with a NaN values for the new feature\n", + "\n", + " Returns\n", + " -------\n", + " pandas.core.frame.DataFrame\n", + " A new dataframe with the additional column added.\n", + "\n", + " TODO: could/should simplify this function by using `df.shift()`\n", + " \"\"\"\n", + "\n", + " if new_feature_name is None:\n", + " if lag < 0:\n", + " new_feature_name = \"%s_lag%d\" % (orig_feature, -lag)\n", + " else:\n", + " new_feature_name = \"%s_ahead%d\" % (orig_feature, lag)\n", + "\n", + " new_df = df.assign(**{new_feature_name: np.nan})\n", + " for name, group in new_df.groupby(groupby):\n", + " if lag < 0: # take values from the past\n", + " new_df.loc[group.index[-lag:], new_feature_name] = group.iloc[:lag][\n", + " orig_feature\n", + " ].values\n", + " else: # take values from the future\n", + " new_df.loc[group.index[:-lag], new_feature_name] = group.iloc[lag:][\n", + " orig_feature\n", + " ].values\n", + "\n", + " if clip:\n", + " new_df = new_df.dropna(subset=[new_feature_name])\n", + "\n", + " return new_df" + ] + }, + { + "cell_type": "markdown", + "id": "89cbe62a-05a6-4bed-8ae9-b1bb7299c16a", + "metadata": {}, + "source": [ + "We first sort our dataframe properly:" + ] + }, + { + "cell_type": "code", + "execution_count": 86, + "id": "c0049d46-d7e2-40a4-9d41-74ce2e875a05", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
DateAveragePriceTotal Volume404642254770Total BagsSmall BagsLarge BagsXLarge Bagstypeyearregion
02015-01-041.2240873.282819.5028287.4249.909716.469186.93529.530.0conventional2015Albany
12015-01-111.2441195.081002.8531640.34127.128424.778036.04388.730.0conventional2015Albany
22015-01-181.1744511.28914.1431540.32135.7711921.0511651.09269.960.0conventional2015Albany
32015-01-251.0645147.50941.3833196.16164.1410845.8210103.35742.470.0conventional2015Albany
42015-02-010.9970873.601353.9060017.20179.329323.189170.82152.360.0conventional2015Albany
..........................................
182442018-02-251.5718421.241974.262482.650.0013964.3313698.27266.060.0organic2018WestTexNewMexico
182452018-03-041.5417393.301832.241905.570.0013655.4913401.93253.560.0organic2018WestTexNewMexico
182462018-03-111.5622128.422162.673194.258.9316762.5716510.32252.250.0organic2018WestTexNewMexico
182472018-03-181.5615896.382055.351499.550.0012341.4812114.81226.670.0organic2018WestTexNewMexico
182482018-03-251.6215303.402325.302171.660.0010806.4410569.80236.640.0organic2018WestTexNewMexico
\n", + "

18249 rows × 13 columns

\n", + "
" + ], + "text/plain": [ + " Date AveragePrice Total Volume 4046 4225 4770 \\\n", + "0 2015-01-04 1.22 40873.28 2819.50 28287.42 49.90 \n", + "1 2015-01-11 1.24 41195.08 1002.85 31640.34 127.12 \n", + "2 2015-01-18 1.17 44511.28 914.14 31540.32 135.77 \n", + "3 2015-01-25 1.06 45147.50 941.38 33196.16 164.14 \n", + "4 2015-02-01 0.99 70873.60 1353.90 60017.20 179.32 \n", + "... ... ... ... ... ... ... \n", + "18244 2018-02-25 1.57 18421.24 1974.26 2482.65 0.00 \n", + "18245 2018-03-04 1.54 17393.30 1832.24 1905.57 0.00 \n", + "18246 2018-03-11 1.56 22128.42 2162.67 3194.25 8.93 \n", + "18247 2018-03-18 1.56 15896.38 2055.35 1499.55 0.00 \n", + "18248 2018-03-25 1.62 15303.40 2325.30 2171.66 0.00 \n", + "\n", + " Total Bags Small Bags Large Bags XLarge Bags type year \\\n", + "0 9716.46 9186.93 529.53 0.0 conventional 2015 \n", + "1 8424.77 8036.04 388.73 0.0 conventional 2015 \n", + "2 11921.05 11651.09 269.96 0.0 conventional 2015 \n", + "3 10845.82 10103.35 742.47 0.0 conventional 2015 \n", + "4 9323.18 9170.82 152.36 0.0 conventional 2015 \n", + "... ... ... ... ... ... ... \n", + "18244 13964.33 13698.27 266.06 0.0 organic 2018 \n", + "18245 13655.49 13401.93 253.56 0.0 organic 2018 \n", + "18246 16762.57 16510.32 252.25 0.0 organic 2018 \n", + "18247 12341.48 12114.81 226.67 0.0 organic 2018 \n", + "18248 10806.44 10569.80 236.64 0.0 organic 2018 \n", + "\n", + " region \n", + "0 Albany \n", + "1 Albany \n", + "2 Albany \n", + "3 Albany \n", + "4 Albany \n", + "... ... \n", + "18244 WestTexNewMexico \n", + "18245 WestTexNewMexico \n", + "18246 WestTexNewMexico \n", + "18247 WestTexNewMexico \n", + "18248 WestTexNewMexico \n", + "\n", + "[18249 rows x 13 columns]" + ] + }, + "execution_count": 86, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_sort = df.sort_values(by=[\"region\", \"type\", \"Date\"]).reset_index(drop=True)\n", + "df_sort" + ] + }, + { + "cell_type": "markdown", + "id": "fbaee71e-d45c-48dc-81a3-ebf7195cbea4", + "metadata": {}, + "source": [ + "We then call `create_lag_feature`. This creates a new column in the dataset `AveragePriceNextWeek`, which is the following week's `AveragePrice`. We have set `clip=True` which means it will remove rows where the target would be missing." + ] + }, + { + "cell_type": "code", + "execution_count": 87, + "id": "90736df7-04b7-40d4-a835-e8eb899da7f3", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
DateAveragePriceTotal Volume404642254770Total BagsSmall BagsLarge BagsXLarge BagstypeyearregionAveragePriceNextWeek
02015-01-041.2240873.282819.5028287.4249.909716.469186.93529.530.0conventional2015Albany1.24
12015-01-111.2441195.081002.8531640.34127.128424.778036.04388.730.0conventional2015Albany1.17
22015-01-181.1744511.28914.1431540.32135.7711921.0511651.09269.960.0conventional2015Albany1.06
32015-01-251.0645147.50941.3833196.16164.1410845.8210103.35742.470.0conventional2015Albany0.99
42015-02-010.9970873.601353.9060017.20179.329323.189170.82152.360.0conventional2015Albany0.99
.............................................
182432018-02-181.5617597.121892.051928.360.0013776.7113553.53223.180.0organic2018WestTexNewMexico1.57
182442018-02-251.5718421.241974.262482.650.0013964.3313698.27266.060.0organic2018WestTexNewMexico1.54
182452018-03-041.5417393.301832.241905.570.0013655.4913401.93253.560.0organic2018WestTexNewMexico1.56
182462018-03-111.5622128.422162.673194.258.9316762.5716510.32252.250.0organic2018WestTexNewMexico1.56
182472018-03-181.5615896.382055.351499.550.0012341.4812114.81226.670.0organic2018WestTexNewMexico1.62
\n", + "

18141 rows × 14 columns

\n", + "
" + ], + "text/plain": [ + " Date AveragePrice Total Volume 4046 4225 4770 \\\n", + "0 2015-01-04 1.22 40873.28 2819.50 28287.42 49.90 \n", + "1 2015-01-11 1.24 41195.08 1002.85 31640.34 127.12 \n", + "2 2015-01-18 1.17 44511.28 914.14 31540.32 135.77 \n", + "3 2015-01-25 1.06 45147.50 941.38 33196.16 164.14 \n", + "4 2015-02-01 0.99 70873.60 1353.90 60017.20 179.32 \n", + "... ... ... ... ... ... ... \n", + "18243 2018-02-18 1.56 17597.12 1892.05 1928.36 0.00 \n", + "18244 2018-02-25 1.57 18421.24 1974.26 2482.65 0.00 \n", + "18245 2018-03-04 1.54 17393.30 1832.24 1905.57 0.00 \n", + "18246 2018-03-11 1.56 22128.42 2162.67 3194.25 8.93 \n", + "18247 2018-03-18 1.56 15896.38 2055.35 1499.55 0.00 \n", + "\n", + " Total Bags Small Bags Large Bags XLarge Bags type year \\\n", + "0 9716.46 9186.93 529.53 0.0 conventional 2015 \n", + "1 8424.77 8036.04 388.73 0.0 conventional 2015 \n", + "2 11921.05 11651.09 269.96 0.0 conventional 2015 \n", + "3 10845.82 10103.35 742.47 0.0 conventional 2015 \n", + "4 9323.18 9170.82 152.36 0.0 conventional 2015 \n", + "... ... ... ... ... ... ... \n", + "18243 13776.71 13553.53 223.18 0.0 organic 2018 \n", + "18244 13964.33 13698.27 266.06 0.0 organic 2018 \n", + "18245 13655.49 13401.93 253.56 0.0 organic 2018 \n", + "18246 16762.57 16510.32 252.25 0.0 organic 2018 \n", + "18247 12341.48 12114.81 226.67 0.0 organic 2018 \n", + "\n", + " region AveragePriceNextWeek \n", + "0 Albany 1.24 \n", + "1 Albany 1.17 \n", + "2 Albany 1.06 \n", + "3 Albany 0.99 \n", + "4 Albany 0.99 \n", + "... ... ... \n", + "18243 WestTexNewMexico 1.57 \n", + "18244 WestTexNewMexico 1.54 \n", + "18245 WestTexNewMexico 1.56 \n", + "18246 WestTexNewMexico 1.56 \n", + "18247 WestTexNewMexico 1.62 \n", + "\n", + "[18141 rows x 14 columns]" + ] + }, + "execution_count": 87, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_hastarget = create_lag_feature(\n", + " df_sort, \"AveragePrice\", +1, [\"region\", \"type\"], \"AveragePriceNextWeek\", clip=True\n", + ")\n", + "df_hastarget" + ] + }, + { + "cell_type": "markdown", + "id": "5dbfa73e-48d7-49be-af1c-dba397d9f946", + "metadata": {}, + "source": [ + "I will now split the data:" + ] + }, + { + "cell_type": "code", + "execution_count": 88, + "id": "8e952769-dbd1-4052-855f-2d2f2b4101d0", + "metadata": {}, + "outputs": [], + "source": [ + "train_df = df_hastarget[df_hastarget[\"Date\"] <= split_date]\n", + "test_df = df_hastarget[df_hastarget[\"Date\"] > split_date]" + ] + }, + { + "cell_type": "markdown", + "id": "65fb19b2-0a17-4133-bf78-3dea9aa61526", + "metadata": {}, + "source": [ + "

" + ] + }, + { + "cell_type": "markdown", + "id": "848e951f-8dde-4a34-bf4e-b2c3a6270bb0", + "metadata": {}, + "source": [ + "### 2.4 Baseline\n", + "rubric={points:4}\n", + "\n", + "Let's try a baseline. Previously we used `DummyClassifier` or `DummyRegressor` as a baseline. This time, we'll do something else as a baseline: we'll assume the price stays the same from this week to next week. So, we'll set our prediction of \"AveragePriceNextWeek\" exactly equal to \"AveragePrice\", assuming no change. That is kind of like saying, \"If it's raining today then I'm guessing it will be raining tomorrow\". This simplistic approach will not get a great score but it's a good starting point for reference. If our model does worse than this, it must not be very good. \n", + "\n", + "Using this baseline approach, what $R^2$ do you get?" + ] + }, + { + "cell_type": "markdown", + "id": "a0d241cd-4669-46d9-bb21-1fd8006a39e8", + "metadata": {}, + "source": [ + "

" + ] + }, + { + "cell_type": "markdown", + "id": "6d0615dc-63c2-458f-844b-f17d30758ab1", + "metadata": {}, + "source": [ + "### (Optional) 2.5 Modeling\n", + "rubric={points:2}\n", + "\n", + "Now that the baseline is done, let's build some models to forecast the average avocado price a week later. Experiment with a few approachs for encoding the date. Justify the decisions you make. Which approach worked best? Report your test score and briefly discuss your results.\n", + "\n", + "> because we only have 2 splits here, we need to be a bit wary of overfitting on the test set. Try not to test on it a ridiculous number of times. If you are interested in some proper ways of dealing with this, see for example sklearn's [TimeSeriesSplit](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html), which is like cross-validation for time series data." + ] + }, + { + "cell_type": "markdown", + "id": "51fe2285-38d1-409a-9f4c-3f5dfd50622a", + "metadata": {}, + "source": [ + "



" + ] + }, + { + "cell_type": "markdown", + "id": "15ad9e61-e5f6-4025-8356-f48f651a8e1e", + "metadata": {}, + "source": [ + "## Exercise 3: Short answer questions \n", + "\n", + "Each question is worth 2 points." + ] + }, + { + "cell_type": "markdown", + "id": "972060d9-742d-47ae-82c5-74b258de16e7", + "metadata": {}, + "source": [ + "### 3.1\n", + "rubric={points:4}\n", + "\n", + "The following questions pertain to Lecture 18 on time series data:\n", + "\n", + "1. Sometimes a time series has missing time points or, worse, time points that are unequally spaced in general. Give an example of a real world situation where the time series data would have unequally spaced time points.\n", + "2. In class we discussed two approaches to using temporal information: encoding the date as one or more features, and creating lagged versions of features. Which of these (one/other/both/neither) two approaches would struggle with unequally spaced time points? Briefly justify your answer." + ] + }, + { + "cell_type": "markdown", + "id": "ef053d93-f20e-417f-81ae-35edb701020c", + "metadata": {}, + "source": [ + "

" + ] + }, + { + "cell_type": "markdown", + "id": "bc3703ae-7d56-4987-b70b-4e222f1d8b15", + "metadata": {}, + "source": [ + "### 3.2\n", + "rubric={points:6}\n", + "\n", + "The following questions pertain to Lecture 19 on survival analysis. We'll consider the use case of customer churn analysis.\n", + "\n", + "1. What is the problem with simply labeling customers are \"churned\" or \"not churned\" and using standard supervised learning techniques, as we did in hw4?\n", + "2. Consider customer A who just joined last week vs. customer B who has been with the service for a year. Who do you expect will leave the service first: probably customer A, probably customer B, or we don't have enough information to answer?\n", + "3. If a customer's survival function is almost flat during a certain period, how do we interpret that?" + ] + }, + { + "cell_type": "markdown", + "id": "9d8ffdd0-8c1e-4042-8414-ac8e57caf980", + "metadata": {}, + "source": [ + "



" + ] + }, + { + "cell_type": "markdown", + "id": "ed594c68-91c3-45d9-baec-6ac38a02c971", + "metadata": {}, + "source": [ + "## Exercise 4: Communication \n", + "
" + ] + }, + { + "cell_type": "markdown", + "id": "5767926d-c17d-4a93-b59a-7242a4c76ff0", + "metadata": {}, + "source": [ + "### Exercise 4.1 Blog post \n", + "rubric={points:40}\n", + "\n", + "Write up your analysis from hw6 or any other assignment or your side machine learning related project in a \"blog post\" or report format. **You can write the post in Markdown in the notebook**, no need to write a real blog post (though you can if you want too!).\n", + "\n", + "The target audience for your blog post is someone like yourself right before you took this course. They don't necessarily have ML knowledge, but they have a solid foundation in technical matters. The post should focus on explaining **your results and what you did** in a way that's understandable to such a person, **not** a lesson trying to teach someone about machine learning. Again: focus on the results and why they are interesting; avoid pedagogical content.\n", + "\n", + "Your post must include the following elements (not necessarily in this order):\n", + "\n", + "- Description of the problem/decision.\n", + "- Description of the dataset (the raw data and/or some EDA).\n", + "- Description of the model.\n", + "- Description your results, both quantitatively and qualitatively. Make sure to refer to the original problem/decision.\n", + "- A section on caveats, describing at least 3 reasons why your results might be incorrect, misleading, overconfident, or otherwise problematic. Make reference to your specific dataset, model, approach, etc. To check that your reasons are specific enough, make sure they would not make sense, if left unchanged, to most students' submissions; for example, do not just say \"overfitting\" without explaining why you might be worried about overfitting in your specific case.\n", + "- At least 3 visualizations. These visualizations must be embedded/interwoven into the text, not pasted at the end. The text must refer directly to each visualization. For example \"as shown below\" or \"the figure demonstrates\" or \"take a look at Figure 1\", etc. It is **not** sufficient to put a visualization in without referring to it directly.\n", + "\n", + "A reasonable length for your entire post would be **800 words**. The maximum allowed is **1000 words**." + ] + }, + { + "cell_type": "markdown", + "id": "6169eefb-18a8-4e13-b7ca-1b3539a79215", + "metadata": {}, + "source": [ + "#### Example blog posts\n", + "\n", + "Here are some examples of applied ML blog posts that you may find useful as inspiration. The target audiences of these posts aren't necessarily the same as yours, and these posts are longer than yours, but they are well-structured and engaging. You are **not required to read these** posts as part of this assignment - they are here only as examples if you'd find that useful.\n", + "\n", + "From the UBC Master of Data Science blog, written by a past student:\n", + "\n", + "- https://ubc-mds.github.io/2019-07-26-predicting-customer-probabilities/\n", + "\n", + "This next one uses R instead of Python, but that might be good in a way, as you can see what it's like for a reader that doesn't understand the code itself (the target audience for your post here):\n", + "\n", + "- https://rpubs.com/RosieB/taylorswiftlyricanalysis\n", + "\n", + "Finally, here are a couple interviews with winners from Kaggle competitions. The format isn't quite the same as a blog post, but you might find them interesting/relevant:\n", + "\n", + "- https://medium.com/kaggle-blog/instacart-market-basket-analysis-feda2700cded\n", + "- https://medium.com/kaggle-blog/winner-interview-with-shivam-bansal-data-science-for-good-challenge-city-of-los-angeles-3294c0ed1fb2\n" + ] + }, + { + "cell_type": "markdown", + "id": "8bfdd094-eeb6-4f00-a3bc-5a6105eedb12", + "metadata": {}, + "source": [ + "#### A note on plagiarism\n", + "\n", + "You may **NOT** include text or visualizations that were not written/created by you. If you are in any doubt as to what constitutes plagiarism, please just ask. For more information see the [UBC Academic Misconduct policies](http://www.calendar.ubc.ca/vancouver/index.cfm?tree=3,54,111,959). Please don't copy this from somewhere 🙏. If you can't do it." + ] + }, + { + "cell_type": "markdown", + "id": "4052395d-a695-4063-97b6-46c4e13016d8", + "metadata": {}, + "source": [ + "

" + ] + }, + { + "cell_type": "markdown", + "id": "d59667c9-db6a-4c12-a556-5b9815ef3564", + "metadata": {}, + "source": [ + "### Exercise 4.2\n", + "rubric={points:6}\n", + "\n", + "Describe one effective communication technique that you used in your post, or an aspect of the post that you are particularly satisfied with.\n", + "\n", + "Max 3 sentences" + ] + }, + { + "cell_type": "markdown", + "id": "87ea9c37-34c9-4b3e-a2df-e00cedd3e8ae", + "metadata": {}, + "source": [ + "



" + ] + }, + { + "cell_type": "markdown", + "id": "04cefc8e-cf76-4c27-9aa6-c560cbc0fc2b", + "metadata": {}, + "source": [ + "### (Optional) Exercise 5 \n", + "rubric={points:1}\n", + "\n", + "**Your tasks:**\n", + "\n", + "What is your biggest takeaway from this course? \n", + "\n", + "> I'm looking forward to read your answers. " + ] + }, + { + "cell_type": "markdown", + "id": "a2fb9e2f-a2d2-4f56-9fb4-c91492e4801b", + "metadata": {}, + "source": [ + "



" + ] + }, + { + "cell_type": "markdown", + "id": "ab723dc5-4ea6-4c44-ace9-bf345bf8c120", + "metadata": {}, + "source": [ + "## Submission instructions \n", + "\n", + "**PLEASE READ:** When you are ready to submit your assignment do the following:\n", + "\n", + "1. Run all cells in your notebook to make sure there are no errors by doing `Kernel -> Restart Kernel and Clear All Outputs` and then `Run -> Run All Cells`. \n", + "2. Notebooks with cell execution numbers out of order or not starting from “1” will have marks deducted. Notebooks without the output displayed may not be graded at all (because we need to see the output in order to grade your work).\n", + "3. Upload the assignment using Gradescope's drag and drop tool. Check out this [Gradescope Student Guide](https://lthub.ubc.ca/guides/gradescope-student-guide/) if you need help with Gradescope submission. " + ] + }, + { + "cell_type": "markdown", + "id": "1b4e160c-d947-4123-8e67-fa3c89c9aa8f", + "metadata": {}, + "source": [ + "### Congratulations on finishing all homework assignments! :clap: :clap: " + ] + }, + { + "cell_type": "code", + "execution_count": 57, + "id": "15f3bee4-0171-4465-838f-e5ac8a943e10", + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "" + ] + }, + "execution_count": 57, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from IPython.display import Image\n", + "\n", + "Image(\"eva-congrats.png\")" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python [conda env:cpsc330]", + "language": "python", + "name": "conda-env-cpsc330-py" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.9" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +}