|
| 1 | + |
| 2 | +# Generating Text Using LSTM |
| 3 | + |
| 4 | +This repository contains code and resources for generating text using Long Short-Term Memory (LSTM) neural networks. The project demonstrates how to build and train an LSTM model for text generation, using a sample dataset. |
| 5 | + |
| 6 | +## Repository Structure |
| 7 | + |
| 8 | +``` |
| 9 | +Generating-Text-Using-LSTM/ |
| 10 | +│ |
| 11 | +├── .gitattributes |
| 12 | +├── Harshraj_Jadeja_HW3_LSTM_TEXT_GEN.ipynb |
| 13 | +└── README.md |
| 14 | +``` |
| 15 | + |
| 16 | +- `.gitattributes`: Configuration file to ensure consistent handling of files across different operating systems. |
| 17 | +- `Harshraj_Jadeja_HW3_LSTM_TEXT_GEN.ipynb`: Jupyter Notebook containing the code for building and training the LSTM model, as well as the text generation process. |
| 18 | +- `README.md`: This file. Provides an overview of the project and instructions for getting started. |
| 19 | + |
| 20 | +## Getting Started |
| 21 | + |
| 22 | +To get started with this project, follow the steps below: |
| 23 | + |
| 24 | +### Prerequisites |
| 25 | + |
| 26 | +Make sure you have the following installed: |
| 27 | + |
| 28 | +- Python 3.x |
| 29 | +- Jupyter Notebook |
| 30 | +- Required Python libraries (listed in `requirements.txt`) |
| 31 | + |
| 32 | +### Installation |
| 33 | + |
| 34 | +1. Clone this repository to your local machine: |
| 35 | + |
| 36 | +```bash |
| 37 | +git clone https://github.com/Harshraj1301/Generating-Text-Using-LSTM.git |
| 38 | +``` |
| 39 | + |
| 40 | +2. Navigate to the project directory: |
| 41 | + |
| 42 | +```bash |
| 43 | +cd Generating-Text-Using-LSTM |
| 44 | +``` |
| 45 | + |
| 46 | +3. Install the required Python libraries: |
| 47 | + |
| 48 | +```bash |
| 49 | +pip install -r requirements.txt |
| 50 | +``` |
| 51 | + |
| 52 | +### Usage |
| 53 | + |
| 54 | +1. Open the Jupyter Notebook: |
| 55 | + |
| 56 | +```bash |
| 57 | +jupyter notebook Harshraj_Jadeja_HW3_LSTM_TEXT_GEN.ipynb |
| 58 | +``` |
| 59 | + |
| 60 | +2. Follow the instructions in the notebook to run the code cells and generate text using the LSTM model. |
| 61 | + |
| 62 | +### Code Explanation |
| 63 | + |
| 64 | +The notebook `Harshraj_Jadeja_HW3_LSTM_TEXT_GEN.ipynb` includes the following steps: |
| 65 | + |
| 66 | +1. **Data Preprocessing**: Loading and preprocessing the text data to make it suitable for training the LSTM model. |
| 67 | +2. **Model Building**: Constructing the LSTM model using Keras. |
| 68 | +3. **Model Training**: Training the LSTM model on the preprocessed text data. |
| 69 | +4. **Text Generation**: Using the trained model to generate new text sequences. |
| 70 | + |
| 71 | +Here are the contents of the notebook: |
| 72 | + |
| 73 | +# Harshraj Jadeja |
| 74 | + |
| 75 | +# Long Short-term Memory for Text Generation |
| 76 | + |
| 77 | +This notebook uses LSTM neural network to generate text from Nietzsche's writings. |
| 78 | + |
| 79 | +## Dataset |
| 80 | + |
| 81 | +### Get the data |
| 82 | +Nietzsche's writing dataset is available online. The following code download the dataset. |
| 83 | + |
| 84 | +### Visualize data |
| 85 | + |
| 86 | +### Clean data |
| 87 | + |
| 88 | +We cut the text in sequences of maxlen characters with a jump size of 3. |
| 89 | +The features for each example is a matrix of size maxlen*num of chars. |
| 90 | +The label for each example is a vector of size num of chars, which represents the next character. |
| 91 | + |
| 92 | +## The model |
| 93 | + |
| 94 | +### Build the model - fill in this box |
| 95 | + |
| 96 | +we need a recurrent layer with input shape (maxlen, len(chars)) and a dense layer with output size len(chars) |
| 97 | + |
| 98 | +### Inspect the model |
| 99 | + |
| 100 | +Use the `.summary` method to print a simple description of the model |
| 101 | + |
| 102 | +### Train the model |
| 103 | + |
| 104 | +## Code Cells |
| 105 | + |
| 106 | +```python |
| 107 | +import matplotlib.pyplot as plt |
| 108 | +import numpy as np |
| 109 | +import pandas as pd |
| 110 | +import time |
| 111 | +import random |
| 112 | +import sys |
| 113 | +import io |
| 114 | +import tensorflow as tf |
| 115 | +from tensorflow import keras |
| 116 | +from tensorflow.keras import layers |
| 117 | +from tensorflow.keras import optimizers |
| 118 | +from tensorflow.keras.callbacks import LambdaCallback |
| 119 | +from tensorflow.keras.utils import get_file |
| 120 | +``` |
| 121 | + |
| 122 | +```python |
| 123 | +path = get_file( |
| 124 | + 'nietzsche.txt', |
| 125 | + origin='https://s3.amazonaws.com/text-datasets/nietzsche.txt') |
| 126 | +with io.open(path, encoding='utf-8') as f: |
| 127 | + text = f.read().lower() |
| 128 | +``` |
| 129 | + |
| 130 | +```python |
| 131 | +print('corpus length:', len(text)) |
| 132 | +``` |
| 133 | + |
| 134 | +```python |
| 135 | +print(text[10:513]) |
| 136 | +``` |
| 137 | + |
| 138 | +```python |
| 139 | +chars = sorted(list(set(text))) |
| 140 | +# total nomber of characters |
| 141 | +print('total chars:', len(chars)) |
| 142 | +``` |
| 143 | + |
| 144 | +```python |
| 145 | +# create (character, index) and (index, character) dictionary |
| 146 | +char_indices = dict((c, i) for i, c in enumerate(chars)) |
| 147 | +indices_char = dict((i, c) for i, c in enumerate(chars)) |
| 148 | +``` |
| 149 | + |
| 150 | +```python |
| 151 | +# cut the text in semi-redundant sequences of maxlen characters |
| 152 | +maxlen = 40 |
| 153 | +step = 3 |
| 154 | +sentences = [] |
| 155 | +next_chars = [] |
| 156 | +for i in range(0, len(text) - maxlen, step): |
| 157 | + sentences.append(text[i: i + maxlen]) |
| 158 | + next_chars.append(text[i + maxlen]) |
| 159 | +print('nb sequences:', len(sentences)) |
| 160 | +``` |
| 161 | + |
| 162 | +```python |
| 163 | +print('Vectorization...') |
| 164 | +x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool_) |
| 165 | +y = np.zeros((len(sentences), len(chars)), dtype=np.bool_) |
| 166 | +for i, sentence in enumerate(sentences): |
| 167 | + for t, char in enumerate(sentence): |
| 168 | + x[i, t, char_indices[char]] = 1 |
| 169 | + y[i, char_indices[next_chars[i]]] = 1 |
| 170 | +``` |
| 171 | + |
| 172 | +```python |
| 173 | +# Define the number of units in the LSTM layer. |
| 174 | +# This is a hyperparameter that represents the dimensionality of the output space. |
| 175 | +# More units can allow the model to capture more complex patterns but also increases computational complexity. |
| 176 | +lstm_units = 128 # Adjust this number based on the complexity of the task and computational constraints. |
| 177 | + |
| 178 | +# Initialize the Sequential model |
| 179 | +model = tf.keras.Sequential([ |
| 180 | + # Add an LSTM layer as the first layer of the model |
| 181 | + # input_shape is required as the LSTM layer's first layer to let it know the shape of the input it should expect |
| 182 | + # Here, input_shape=(maxlen, len(chars)) means each input sequence will be of length 'maxlen' |
| 183 | + # and each character in the sequence is represented as a one-hot encoded vector of length 'len(chars)' |
| 184 | + tf.keras.layers.LSTM(lstm_units, input_shape=(maxlen, len(chars))), |
| 185 | + |
| 186 | + # Add a Dense output layer |
| 187 | + # The number of units equals the number of unique characters (len(chars)) |
| 188 | + # This is because we want to output a probability distribution over all possible characters |
| 189 | + # Softmax activation function is used to output probabilities |
| 190 | + tf.keras.layers.Dense(len(chars), activation='softmax'), |
| 191 | +]) |
| 192 | + |
| 193 | +# Compile the model |
| 194 | +# 'categorical_crossentropy' is used as the loss function since this is a multi-class classification problem |
| 195 | +# 'adam' optimizer is chosen for efficient stochastic gradient descent optimization |
| 196 | +# Accuracy is monitored as a metric to observe the performance of the model during training |
| 197 | +model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) |
| 198 | + |
| 199 | +# Display the model's architecture |
| 200 | +model.summary() |
| 201 | +``` |
| 202 | + |
| 203 | +```python |
| 204 | +model.summary() |
| 205 | +``` |
| 206 | + |
| 207 | +```python |
| 208 | +def sample(preds, temperature=1.0): |
| 209 | + # helper function to sample an index from a probability array |
| 210 | + preds = np.asarray(preds).astype('float64') |
| 211 | + preds = np.log(preds) / temperature |
| 212 | + exp_preds = np.exp(preds) |
| 213 | + preds = exp_preds / np.sum(exp_preds) |
| 214 | + probas = np.random.multinomial(1, preds, 1) |
| 215 | + return np.argmax(probas) |
| 216 | +``` |
| 217 | + |
| 218 | +```python |
| 219 | +class PrintLoss(keras.callbacks.Callback): |
| 220 | + def on_epoch_end(self, epoch, _): |
| 221 | + # Function invoked at end of each epoch. Prints generated text. |
| 222 | + print() |
| 223 | + print('----- Generating text after Epoch: %d' % epoch) |
| 224 | + |
| 225 | + start_index = random.randint(0, len(text) - maxlen - 1) |
| 226 | + for diversity in [0.5, 1.0]: |
| 227 | + print('----- diversity:', diversity) |
| 228 | + |
| 229 | + generated = '' |
| 230 | + sentence = text[start_index: start_index + maxlen] |
| 231 | + generated += sentence |
| 232 | + print('----- Generating with seed: "' + sentence + '"') |
| 233 | + sys.stdout.write(generated) |
| 234 | + |
| 235 | + for i in range(400): |
| 236 | + x_pred = np.zeros((1, maxlen, len(chars))) |
| 237 | + for t, char in enumerate(sentence): |
| 238 | + x_pred[0, t, char_indices[char]] = 1. |
| 239 | + |
| 240 | + preds = model.predict(x_pred, verbose=0)[0] |
| 241 | + next_index = sample(preds, diversity) |
| 242 | + next_char = indices_char[next_index] |
| 243 | + |
| 244 | + sentence = sentence[1:] + next_char |
| 245 | + |
| 246 | + sys.stdout.write(next_char) |
| 247 | + sys.stdout.flush() |
| 248 | + print() |
| 249 | +``` |
| 250 | + |
| 251 | +```python |
| 252 | +EPOCHS = 60 |
| 253 | +BATCH = 128 |
| 254 | + |
| 255 | +early_stop = keras.callbacks.EarlyStopping(monitor='val_loss', patience=2) |
| 256 | + |
| 257 | +history = model.fit(x, y, |
| 258 | + batch_size = BATCH, |
| 259 | + epochs = EPOCHS, |
| 260 | + validation_split = 0.2, |
| 261 | + verbose = 1, |
| 262 | + callbacks = [early_stop, PrintLoss()]) |
| 263 | +``` |
| 264 | + |
| 265 | +## Results |
| 266 | + |
| 267 | +The notebook includes the results of the text generation process, showcasing how the trained LSTM model generates sequences of text based on the input data. |
| 268 | + |
| 269 | +## Contributing |
| 270 | + |
| 271 | +If you'd like to contribute to this project, please follow these steps: |
| 272 | + |
| 273 | +1. Fork the repository. |
| 274 | +2. Create a new branch: `git checkout -b feature-branch-name` |
| 275 | +3. Make your changes and commit them: `git commit -m 'Add some feature'` |
| 276 | +4. Push to the branch: `git push origin feature-branch-name` |
| 277 | +5. Submit a pull request. |
| 278 | + |
| 279 | +## License |
| 280 | + |
| 281 | +This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details. |
| 282 | + |
| 283 | +## Acknowledgements |
| 284 | + |
| 285 | +- This project was created as part of an assignment by Harshraj Jadeja. |
| 286 | +- Thanks to the open-source community for providing valuable resources and libraries for machine learning. |
| 287 | + |
| 288 | +--- |
| 289 | + |
| 290 | +Feel free to modify this `README.md` file as per your specific requirements and project details. |
0 commit comments