Skip to content

Commit 323a7b3

Browse files
Created using Colaboratory
1 parent c7cf551 commit 323a7b3

File tree

1 file changed

+264
-0
lines changed

1 file changed

+264
-0
lines changed

TF_IDF.ipynb

Lines changed: 264 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,264 @@
1+
{
2+
"nbformat": 4,
3+
"nbformat_minor": 0,
4+
"metadata": {
5+
"colab": {
6+
"name": "TF-IDF.ipynb",
7+
"provenance": [],
8+
"collapsed_sections": [],
9+
"authorship_tag": "ABX9TyO9FenMhAIC41DvapEM6kFf",
10+
"include_colab_link": true
11+
},
12+
"kernelspec": {
13+
"name": "python3",
14+
"display_name": "Python 3"
15+
},
16+
"language_info": {
17+
"name": "python"
18+
}
19+
},
20+
"cells": [
21+
{
22+
"cell_type": "markdown",
23+
"metadata": {
24+
"id": "view-in-github",
25+
"colab_type": "text"
26+
},
27+
"source": [
28+
"<a href=\"https://colab.research.google.com/github/DataMinati/NLP-Legion/blob/main/TF_IDF.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
29+
]
30+
},
31+
{
32+
"cell_type": "markdown",
33+
"metadata": {
34+
"id": "zW9eFT2NP_zs"
35+
},
36+
"source": [
37+
"### Downloading the packages"
38+
]
39+
},
40+
{
41+
"cell_type": "code",
42+
"metadata": {
43+
"colab": {
44+
"base_uri": "https://localhost:8080/"
45+
},
46+
"id": "M8JglsAuNYT7",
47+
"outputId": "6e147693-84d1-4c4e-b20c-e84f589b7991"
48+
},
49+
"source": [
50+
"nltk.download('punkt')\n",
51+
"nltk.download('stopwords')\n",
52+
"nltk.download('wordnet')"
53+
],
54+
"execution_count": null,
55+
"outputs": [
56+
{
57+
"output_type": "stream",
58+
"text": [
59+
"[nltk_data] Downloading package punkt to /root/nltk_data...\n",
60+
"[nltk_data] Package punkt is already up-to-date!\n",
61+
"[nltk_data] Downloading package stopwords to /root/nltk_data...\n",
62+
"[nltk_data] Package stopwords is already up-to-date!\n",
63+
"[nltk_data] Downloading package wordnet to /root/nltk_data...\n",
64+
"[nltk_data] Package wordnet is already up-to-date!\n"
65+
],
66+
"name": "stdout"
67+
},
68+
{
69+
"output_type": "execute_result",
70+
"data": {
71+
"text/plain": [
72+
"True"
73+
]
74+
},
75+
"metadata": {
76+
"tags": []
77+
},
78+
"execution_count": 9
79+
}
80+
]
81+
},
82+
{
83+
"cell_type": "markdown",
84+
"metadata": {
85+
"id": "oTYLK6Z2QDkH"
86+
},
87+
"source": [
88+
"### Importing the libraries"
89+
]
90+
},
91+
{
92+
"cell_type": "code",
93+
"metadata": {
94+
"id": "cef79IiWNyXX"
95+
},
96+
"source": [
97+
"import nltk\n",
98+
"import re\n",
99+
"from nltk.corpus import stopwords\n",
100+
"from nltk.stem.porter import PorterStemmer\n",
101+
"from nltk.stem import WordNetLemmatizer\n",
102+
"from sklearn.feature_extraction.text import TfidfVectorizer"
103+
],
104+
"execution_count": null,
105+
"outputs": []
106+
},
107+
{
108+
"cell_type": "markdown",
109+
"metadata": {
110+
"id": "MopqatR1QGPr"
111+
},
112+
"source": [
113+
"### Storing the text"
114+
]
115+
},
116+
{
117+
"cell_type": "code",
118+
"metadata": {
119+
"id": "zItIfRnJPQJd"
120+
},
121+
"source": [
122+
"paragraph = \"\"\"I have three visions for India. In 3000 years of our history, people from all over \n",
123+
" the world have come and invaded us, captured our lands, conquered our minds. \n",
124+
" From Alexander onwards, the Greeks, the Turks, the Moguls, the Portuguese, the British,\n",
125+
" the French, the Dutch, all of them came and looted us, took over what was ours. \n",
126+
" Yet we have not done this to any other nation. We have not conquered anyone. \n",
127+
" We have not grabbed their land, their culture, \n",
128+
" their history and tried to enforce our way of life on them. \n",
129+
" Why? Because we respect the freedom of others.That is why my \n",
130+
" first vision is that of freedom. I believe that India got its first vision of \n",
131+
" this in 1857, when we started the War of Independence. It is this freedom that\n",
132+
" we must protect and nurture and build on. If we are not free, no one will respect us.\n",
133+
" My second vision for India’s development. For fifty years we have been a developing nation.\n",
134+
" It is time we see ourselves as a developed nation. We are among the top 5 nations of the world\n",
135+
" in terms of GDP. We have a 10 percent growth rate in most areas. Our poverty levels are falling.\n",
136+
" Our achievements are being globally recognised today. Yet we lack the self-confidence to\n",
137+
" see ourselves as a developed nation, self-reliant and self-assured. Isn’t this incorrect?\n",
138+
" I have a third vision. India must stand up to the world. Because I believe that unless India \n",
139+
" stands up to the world, no one will respect us. Only strength respects strength. We must be \n",
140+
" strong not only as a military power but also as an economic power. Both must go hand-in-hand. \n",
141+
" My good fortune was to have worked with three great minds. Dr. Vikram Sarabhai of the Dept. of \n",
142+
" space, Professor Satish Dhawan, who succeeded him and Dr. Brahm Prakash, father of nuclear material.\n",
143+
" I was lucky to have worked with all three of them closely and consider this the great opportunity of my life. \n",
144+
" I see four milestones in my career\"\"\"\n",
145+
" "
146+
],
147+
"execution_count": null,
148+
"outputs": []
149+
},
150+
{
151+
"cell_type": "markdown",
152+
"metadata": {
153+
"id": "sG8etbOJQIyH"
154+
},
155+
"source": [
156+
"### Cleaning the texts"
157+
]
158+
},
159+
{
160+
"cell_type": "code",
161+
"metadata": {
162+
"id": "DCU-SRAuPRjx"
163+
},
164+
"source": [
165+
"# Cleaning the texts\n",
166+
"ps = PorterStemmer()\n",
167+
"wordnet=WordNetLemmatizer()\n",
168+
"sentences = nltk.sent_tokenize(paragraph)\n",
169+
"corpus = []\n",
170+
"for i in range(len(sentences)):\n",
171+
" review = re.sub('[^a-zA-Z]', ' ', sentences[i])\n",
172+
" review = review.lower()\n",
173+
" review = review.split()\n",
174+
" review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]\n",
175+
" review = ' '.join(review)\n",
176+
" corpus.append(review)"
177+
],
178+
"execution_count": null,
179+
"outputs": []
180+
},
181+
{
182+
"cell_type": "markdown",
183+
"metadata": {
184+
"id": "2NGHmaRxQMzx"
185+
},
186+
"source": [
187+
"### Creating the TF-IDF model"
188+
]
189+
},
190+
{
191+
"cell_type": "code",
192+
"metadata": {
193+
"id": "cIKWmdNdN7E0"
194+
},
195+
"source": [
196+
"# Creating the TF-IDF model\n",
197+
"cv = TfidfVectorizer()\n",
198+
"X = cv.fit_transform(corpus).toarray()"
199+
],
200+
"execution_count": null,
201+
"outputs": []
202+
},
203+
{
204+
"cell_type": "markdown",
205+
"metadata": {
206+
"id": "diGhJq-sQO-t"
207+
},
208+
"source": [
209+
"### Displaying the result"
210+
]
211+
},
212+
{
213+
"cell_type": "code",
214+
"metadata": {
215+
"colab": {
216+
"base_uri": "https://localhost:8080/"
217+
},
218+
"id": "xPEARKRHPZTQ",
219+
"outputId": "5cf19576-495e-4df2-b6c7-7a2e8b83f052"
220+
},
221+
"source": [
222+
"X"
223+
],
224+
"execution_count": null,
225+
"outputs": [
226+
{
227+
"output_type": "execute_result",
228+
"data": {
229+
"text/plain": [
230+
"array([[0. , 0. , 0. , ..., 0. , 0. ,\n",
231+
" 0. ],\n",
232+
" [0. , 0. , 0. , ..., 0.25057734, 0.29539106,\n",
233+
" 0. ],\n",
234+
" [0. , 0.28201784, 0. , ..., 0. , 0. ,\n",
235+
" 0. ],\n",
236+
" ...,\n",
237+
" [0. , 0. , 0. , ..., 0. , 0. ,\n",
238+
" 0. ],\n",
239+
" [0. , 0. , 0. , ..., 0. , 0. ,\n",
240+
" 0. ],\n",
241+
" [0. , 0. , 0. , ..., 0. , 0. ,\n",
242+
" 0. ]])"
243+
]
244+
},
245+
"metadata": {
246+
"tags": []
247+
},
248+
"execution_count": 15
249+
}
250+
]
251+
},
252+
{
253+
"cell_type": "code",
254+
"metadata": {
255+
"id": "GAGZWkLfPaa6"
256+
},
257+
"source": [
258+
""
259+
],
260+
"execution_count": null,
261+
"outputs": []
262+
}
263+
]
264+
}

0 commit comments

Comments
 (0)