1
+ {
2
+ "nbformat" : 4 ,
3
+ "nbformat_minor" : 0 ,
4
+ "metadata" : {
5
+ "colab" : {
6
+ "name" : " TF-IDF.ipynb" ,
7
+ "provenance" : [],
8
+ "collapsed_sections" : [],
9
+ "authorship_tag" : " ABX9TyO9FenMhAIC41DvapEM6kFf" ,
10
+ "include_colab_link" : true
11
+ },
12
+ "kernelspec" : {
13
+ "name" : " python3" ,
14
+ "display_name" : " Python 3"
15
+ },
16
+ "language_info" : {
17
+ "name" : " python"
18
+ }
19
+ },
20
+ "cells" : [
21
+ {
22
+ "cell_type" : " markdown" ,
23
+ "metadata" : {
24
+ "id" : " view-in-github" ,
25
+ "colab_type" : " text"
26
+ },
27
+ "source" : [
28
+ " <a href=\" https://colab.research.google.com/github/DataMinati/NLP-Legion/blob/main/TF_IDF.ipynb\" target=\" _parent\" ><img src=\" https://colab.research.google.com/assets/colab-badge.svg\" alt=\" Open In Colab\" /></a>"
29
+ ]
30
+ },
31
+ {
32
+ "cell_type" : " markdown" ,
33
+ "metadata" : {
34
+ "id" : " zW9eFT2NP_zs"
35
+ },
36
+ "source" : [
37
+ " ### Downloading the packages"
38
+ ]
39
+ },
40
+ {
41
+ "cell_type" : " code" ,
42
+ "metadata" : {
43
+ "colab" : {
44
+ "base_uri" : " https://localhost:8080/"
45
+ },
46
+ "id" : " M8JglsAuNYT7" ,
47
+ "outputId" : " 6e147693-84d1-4c4e-b20c-e84f589b7991"
48
+ },
49
+ "source" : [
50
+ " nltk.download('punkt')\n " ,
51
+ " nltk.download('stopwords')\n " ,
52
+ " nltk.download('wordnet')"
53
+ ],
54
+ "execution_count" : null ,
55
+ "outputs" : [
56
+ {
57
+ "output_type" : " stream" ,
58
+ "text" : [
59
+ " [nltk_data] Downloading package punkt to /root/nltk_data...\n " ,
60
+ " [nltk_data] Package punkt is already up-to-date!\n " ,
61
+ " [nltk_data] Downloading package stopwords to /root/nltk_data...\n " ,
62
+ " [nltk_data] Package stopwords is already up-to-date!\n " ,
63
+ " [nltk_data] Downloading package wordnet to /root/nltk_data...\n " ,
64
+ " [nltk_data] Package wordnet is already up-to-date!\n "
65
+ ],
66
+ "name" : " stdout"
67
+ },
68
+ {
69
+ "output_type" : " execute_result" ,
70
+ "data" : {
71
+ "text/plain" : [
72
+ " True"
73
+ ]
74
+ },
75
+ "metadata" : {
76
+ "tags" : []
77
+ },
78
+ "execution_count" : 9
79
+ }
80
+ ]
81
+ },
82
+ {
83
+ "cell_type" : " markdown" ,
84
+ "metadata" : {
85
+ "id" : " oTYLK6Z2QDkH"
86
+ },
87
+ "source" : [
88
+ " ### Importing the libraries"
89
+ ]
90
+ },
91
+ {
92
+ "cell_type" : " code" ,
93
+ "metadata" : {
94
+ "id" : " cef79IiWNyXX"
95
+ },
96
+ "source" : [
97
+ " import nltk\n " ,
98
+ " import re\n " ,
99
+ " from nltk.corpus import stopwords\n " ,
100
+ " from nltk.stem.porter import PorterStemmer\n " ,
101
+ " from nltk.stem import WordNetLemmatizer\n " ,
102
+ " from sklearn.feature_extraction.text import TfidfVectorizer"
103
+ ],
104
+ "execution_count" : null ,
105
+ "outputs" : []
106
+ },
107
+ {
108
+ "cell_type" : " markdown" ,
109
+ "metadata" : {
110
+ "id" : " MopqatR1QGPr"
111
+ },
112
+ "source" : [
113
+ " ### Storing the text"
114
+ ]
115
+ },
116
+ {
117
+ "cell_type" : " code" ,
118
+ "metadata" : {
119
+ "id" : " zItIfRnJPQJd"
120
+ },
121
+ "source" : [
122
+ " paragraph = \"\"\" I have three visions for India. In 3000 years of our history, people from all over \n " ,
123
+ " the world have come and invaded us, captured our lands, conquered our minds. \n " ,
124
+ " From Alexander onwards, the Greeks, the Turks, the Moguls, the Portuguese, the British,\n " ,
125
+ " the French, the Dutch, all of them came and looted us, took over what was ours. \n " ,
126
+ " Yet we have not done this to any other nation. We have not conquered anyone. \n " ,
127
+ " We have not grabbed their land, their culture, \n " ,
128
+ " their history and tried to enforce our way of life on them. \n " ,
129
+ " Why? Because we respect the freedom of others.That is why my \n " ,
130
+ " first vision is that of freedom. I believe that India got its first vision of \n " ,
131
+ " this in 1857, when we started the War of Independence. It is this freedom that\n " ,
132
+ " we must protect and nurture and build on. If we are not free, no one will respect us.\n " ,
133
+ " My second vision for India’s development. For fifty years we have been a developing nation.\n " ,
134
+ " It is time we see ourselves as a developed nation. We are among the top 5 nations of the world\n " ,
135
+ " in terms of GDP. We have a 10 percent growth rate in most areas. Our poverty levels are falling.\n " ,
136
+ " Our achievements are being globally recognised today. Yet we lack the self-confidence to\n " ,
137
+ " see ourselves as a developed nation, self-reliant and self-assured. Isn’t this incorrect?\n " ,
138
+ " I have a third vision. India must stand up to the world. Because I believe that unless India \n " ,
139
+ " stands up to the world, no one will respect us. Only strength respects strength. We must be \n " ,
140
+ " strong not only as a military power but also as an economic power. Both must go hand-in-hand. \n " ,
141
+ " My good fortune was to have worked with three great minds. Dr. Vikram Sarabhai of the Dept. of \n " ,
142
+ " space, Professor Satish Dhawan, who succeeded him and Dr. Brahm Prakash, father of nuclear material.\n " ,
143
+ " I was lucky to have worked with all three of them closely and consider this the great opportunity of my life. \n " ,
144
+ " I see four milestones in my career\"\"\"\n " ,
145
+ " "
146
+ ],
147
+ "execution_count" : null ,
148
+ "outputs" : []
149
+ },
150
+ {
151
+ "cell_type" : " markdown" ,
152
+ "metadata" : {
153
+ "id" : " sG8etbOJQIyH"
154
+ },
155
+ "source" : [
156
+ " ### Cleaning the texts"
157
+ ]
158
+ },
159
+ {
160
+ "cell_type" : " code" ,
161
+ "metadata" : {
162
+ "id" : " DCU-SRAuPRjx"
163
+ },
164
+ "source" : [
165
+ " # Cleaning the texts\n " ,
166
+ " ps = PorterStemmer()\n " ,
167
+ " wordnet=WordNetLemmatizer()\n " ,
168
+ " sentences = nltk.sent_tokenize(paragraph)\n " ,
169
+ " corpus = []\n " ,
170
+ " for i in range(len(sentences)):\n " ,
171
+ " review = re.sub('[^a-zA-Z]', ' ', sentences[i])\n " ,
172
+ " review = review.lower()\n " ,
173
+ " review = review.split()\n " ,
174
+ " review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]\n " ,
175
+ " review = ' '.join(review)\n " ,
176
+ " corpus.append(review)"
177
+ ],
178
+ "execution_count" : null ,
179
+ "outputs" : []
180
+ },
181
+ {
182
+ "cell_type" : " markdown" ,
183
+ "metadata" : {
184
+ "id" : " 2NGHmaRxQMzx"
185
+ },
186
+ "source" : [
187
+ " ### Creating the TF-IDF model"
188
+ ]
189
+ },
190
+ {
191
+ "cell_type" : " code" ,
192
+ "metadata" : {
193
+ "id" : " cIKWmdNdN7E0"
194
+ },
195
+ "source" : [
196
+ " # Creating the TF-IDF model\n " ,
197
+ " cv = TfidfVectorizer()\n " ,
198
+ " X = cv.fit_transform(corpus).toarray()"
199
+ ],
200
+ "execution_count" : null ,
201
+ "outputs" : []
202
+ },
203
+ {
204
+ "cell_type" : " markdown" ,
205
+ "metadata" : {
206
+ "id" : " diGhJq-sQO-t"
207
+ },
208
+ "source" : [
209
+ " ### Displaying the result"
210
+ ]
211
+ },
212
+ {
213
+ "cell_type" : " code" ,
214
+ "metadata" : {
215
+ "colab" : {
216
+ "base_uri" : " https://localhost:8080/"
217
+ },
218
+ "id" : " xPEARKRHPZTQ" ,
219
+ "outputId" : " 5cf19576-495e-4df2-b6c7-7a2e8b83f052"
220
+ },
221
+ "source" : [
222
+ " X"
223
+ ],
224
+ "execution_count" : null ,
225
+ "outputs" : [
226
+ {
227
+ "output_type" : " execute_result" ,
228
+ "data" : {
229
+ "text/plain" : [
230
+ " array([[0. , 0. , 0. , ..., 0. , 0. ,\n " ,
231
+ " 0. ],\n " ,
232
+ " [0. , 0. , 0. , ..., 0.25057734, 0.29539106,\n " ,
233
+ " 0. ],\n " ,
234
+ " [0. , 0.28201784, 0. , ..., 0. , 0. ,\n " ,
235
+ " 0. ],\n " ,
236
+ " ...,\n " ,
237
+ " [0. , 0. , 0. , ..., 0. , 0. ,\n " ,
238
+ " 0. ],\n " ,
239
+ " [0. , 0. , 0. , ..., 0. , 0. ,\n " ,
240
+ " 0. ],\n " ,
241
+ " [0. , 0. , 0. , ..., 0. , 0. ,\n " ,
242
+ " 0. ]])"
243
+ ]
244
+ },
245
+ "metadata" : {
246
+ "tags" : []
247
+ },
248
+ "execution_count" : 15
249
+ }
250
+ ]
251
+ },
252
+ {
253
+ "cell_type" : " code" ,
254
+ "metadata" : {
255
+ "id" : " GAGZWkLfPaa6"
256
+ },
257
+ "source" : [
258
+ " "
259
+ ],
260
+ "execution_count" : null ,
261
+ "outputs" : []
262
+ }
263
+ ]
264
+ }
0 commit comments