Skip to content

Commit 9aeb625

Browse files
authored
Merge pull request #45 from PranavBhatP/intel_blog
Create new blog at content/blogs/intel_multimodal_blog.md
2 parents b1445f2 + a720a97 commit 9aeb625

File tree

1 file changed

+386
-0
lines changed

1 file changed

+386
-0
lines changed

Diff for: content/blogs/intel_multimodal_blog.md

+386
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,386 @@
1+
[//]: <> (Authors: Pranav Bhat, Pranav Vinodh, Vanshika Mittal)
2+
3+
![multimodal_ml](https://github.com/user-attachments/assets/055c3836-e185-463a-b10c-6a600e576709)
4+
5+
# MultiModal Magic: Integrating Diverse Data for Smarter AI systems
6+
7+
## Introduction- What does multimodal mean?
8+
9+
Traditional machine learning often focuses on one specific data type, like text or images. But, what if we could combine these forms of data to give our model a chance at making more complete predictions? Multimodal learning in machine learning is a type of learning where the model is trained to understand and work with multiple forms of input data, such as text, images, and audio.
10+
11+
![image](https://github.com/user-attachments/assets/4dae9b6b-81d9-418e-9f16-9a39e1821d3d)
12+
13+
## Why even use multiple modalities?
14+
15+
Multimodal machine learning is gaining traction because of how it can imitate the way humans naturally process information from several different sources. These different types of data sources correspond to different modalities of the world. The world can be seen, heard, or described in words. For a ML model to be able to perceive the world in all of its complexity and understanding different modalities is a useful skill.
16+
It is an approach that is becoming increasingly relevant in applications in fields such as healthcare. For example, combining patient records with medical images can lead to more accurate diagnoses.
17+
18+
## Architectures for Multi-modal Learning
19+
20+
### 1. Early Fusion
21+
- Early fusion approach combines raw data from multiple sensors before any high-level processing or decision-making. It helps capture and process interactions between modalities at the data level.
22+
- The advantage here is that we don’t have to perform dedicated processing for each modality (i.e, it only requires a single learning phase)
23+
- The downside to this approach is that raw input data may not contain rich semantic information. This means that the model is not able to capture complex interactions between the modalities.
24+
25+
### 2. Late Fusion
26+
- In this approach, all the modalities are learned independently and are combined right before the model makes a decision. In this type of multimodal, individual models are trained separately for different modes (Text, Vision, etc.) to make a local prediction. These individual results are then combined at a higher level to make the final fused prediction.
27+
- The advantage of late fusion is its simplicity and isolation. Each model gets to learn super rich information on its modality.
28+
- The downside is that the system is not able to learn complex modal interactions, and thus does not benefit directly from the complementary information each modality might offer.
29+
- Another downside is the high compute cost for processing data of each mode separately.
30+
31+
<div style="display: flex; text-align: center;">
32+
<img src="https://github.com/user-attachments/assets/1ac8c672-625b-4dfc-81b5-54469ac9110e" alt="Image 1" style="margin-bottom: 20px;">
33+
<img src="https://github.com/user-attachments/assets/1bc6402c-a804-44e9-ad2d-13d5e7c001eb" alt="Image 2">
34+
</div>
35+
36+
## Use cases:
37+
38+
1. **Medical Diagnosis**: Multimodal AI assists in medical diagnosis by combining data from various sources. It includes patient records, medical scans, and textual reports. Further, it aids doctors and medical professionals to diagnose and formulate effective patient treatment plans and improve patient care.
39+
2. **Video Summarization**: The Multimodal Artificial intelligence model facilitates video summarization by extracting audio and visual characteristics. It speeds up content consumption, improves video content management systems, and makes browsing more efficient.
40+
3. **Sentiment Analysis**: Multimodal AI can detect and understand human emotions from certain sources, including voice tone, text sentiment, and facial expressions. It assists in sentiment analysis on social media and the mental health support system to gauge and respond to users’ emotional support.
41+
42+
## Hands on project using PyTorch
43+
44+
### Problem statement
45+
46+
[Fakeddit](https://fakeddit.netlify.app/) is a fine-grained multimodal fake news detection dataset, designed to advance efforts to combat the spread of misinformation in multiple modalities, namely text and image data. The following model was built to classify the data in Fakeddit into 6 pre-defined classes:
47+
48+
![image](https://github.com/user-attachments/assets/094731e1-5f0d-4c5f-93b0-04b789964a99)
49+
50+
- Authentic/true news content
51+
- Satire/Parody
52+
- Content with false connection
53+
- Imposter content
54+
- Manipulated content
55+
- Misleading content
56+
57+
The following model combines features extracted from text (using BERT) and images (using ResNet50). These features are processed through fully connected layers and combined. The combined features are then passed through a softmax layer to predict the probabilities of each class, which represent whether the news is fake or not. The model is initialized and moved to the specified device for training or inference.
58+
59+
### Let's code
60+
61+
Before you proceed with this section, it is expected that you have a decent knowledge with the implementation of `pytorch` and its associated deep-learning libraries. For a starting point, you can refer the [documentation](https://pytorch.org/docs/stable/index.html).
62+
63+
#### Import the required libraries
64+
First let's ensure you have the right python and deep learning libraries ready.
65+
```python
66+
import numpy as np
67+
import pandas as pd
68+
import os
69+
import urllib.request
70+
import sys
71+
import random
72+
from PIL import Image
73+
import matplotlib.pyplot as plt
74+
75+
from sklearn.model_selection import train_test_split
76+
from sklearn.utils.class_weight import compute_class_weight
77+
from sklearn.metrics import precision_score, recall_score
78+
79+
import torch
80+
import torch.nn as nn
81+
from torch.utils.data import Dataset, DataLoader
82+
import torch.nn.functional as f
83+
import torch.optim as optim
84+
85+
import torchvision
86+
from torchvision.transforms import v2
87+
from torchvision import models
88+
from torchvision.models import resnet50, ResNet50_Weights
89+
import torch.optim.lr_scheduler as lr_scheduler
90+
from transformers import BertModel, BertTokenizer
91+
```
92+
93+
#### Text Feature Extraction
94+
In the Text-Feature Extractor, we used a pre-trained [BERT](https://huggingface.co/docs/transformers/en/model_doc/bert) Model. `BERT`, or Bidirectional Encoder Representations from Transformers, is a machine learning framework for natural language processing (NLP) trained on the Toronto BookCorpus (800M words) and English Wikipedia (2,500M words).
95+
96+
##### Why use BERT and BERT Embeddings in this specific use-case?
97+
- `BERT` uses a bi-directional approach considering both the left and right context of words in a sentence, instead of analyzing the text sequentially.
98+
- We use BERT to extract features, namely word and sentence embedding vectors, from text data.
99+
- These vectors are used as high-quality feature inputs to downstream models. NLP models such as LSTMs or CNNs require inputs in the form of numerical vectors, hence BERT is a good option for encoding variable length text strings.
100+
101+
##### Model Working
102+
- The model takes the title_input_ids and title_attention_mask as inputs and processes them using BERT.
103+
- Extracts the [CLS] token representation from the last hidden states of BERT, which serves as a summary of the input text.
104+
- Applies dropout to the extracted text features.
105+
- Passes the features through a fully connected layer to map them to the number of classes.
106+
107+
```python
108+
# Load pre-trained BERT model and tokenizer
109+
model_name = 'bert-base-uncased'
110+
tokenizer = BertTokenizer.from_pretrained(model_name)
111+
bert_model = BertModel.from_pretrained(model_name, output_hidden_states = True)
112+
113+
# Put the model in evaluation mode, which turns off dropout regularization which is used in training.
114+
bert_model.eval()
115+
```
116+
117+
```python
118+
def get_bert_embedding(text):
119+
# Tokenize input text and get token IDs and attention mask
120+
inputs = tokenizer.encode_plus(text, add_special_tokens = True, return_tensors='pt', max_length=80, truncation=True, padding='max_length')
121+
122+
return inputs['input_ids'].squeeze(0), inputs['attention_mask'].squeeze(0)
123+
```
124+
125+
#### Image Feature Extraction Process
126+
In the Image-Feature Extractor, we used a pre-trained [ResNet50](https://huggingface.co/microsoft/resnet-50) model trained on the ImageNet dataset for image classification tasks.
127+
128+
##### Why use ResNet50?
129+
- ResNet50 is a deep learning model launched in 2015 by Microsoft Research for the purpose of visual recognition. The model is 50 layers deep.
130+
- ResNet50's architecture (including shortcut connections between layers) significantly improves on the vanishing gradient problems that arise during backpropagation which allows for higher accuracy.
131+
- The skip connections in ResNet50 facilitate smoother training and faster convergence. Thus making it easier for the model to learn and update weights during training.
132+
133+
##### Model working?
134+
- The model processes the input image using ResNet50 to extract features.
135+
- Applies dropout to the extracted image features.
136+
- Passes the features through a fully connected layer to map them to the number of classes.
137+
138+
#### Feature fusion:
139+
We’ve implemented a late fusion architecture for the model. This combines the feature of the two separate modalities.
140+
141+
##### Model working
142+
- Combines the text and image features using element-wise maximum operation.
143+
- Applies the softmax function to the combined features to obtain class probabilities.
144+
145+
This is one of the most crucial steps of building the model and requires careful tuning of the hyperparameters. Refer this [source](https://pytorch.org/tutorials/beginner/hyperparameter_tuning_tutorial.html#sphx-glr-beginner-hyperparameter-tuning-tutorial-py) to learn more about hyperparameter tuning in deep learning models using PyTorch.
146+
Here we have set the hyperparameters for you to ensure optimal results.
147+
```python
148+
class BERTResNetClassifier(nn.Module):
149+
def __init__(self, num_classes=6):
150+
super(BERTResNetClassifier, self).__init__()
151+
152+
self.num_classes = num_classes
153+
154+
# Image processing (ResNet)
155+
self.image_model = resnet50(weights=ResNet50_Weights.IMAGENET1K_V1)
156+
157+
# Image processing (Fully Connected Layer)
158+
self.fc_image = nn.Linear(in_features=1000, out_features=num_classes, bias=True)
159+
160+
# Dropout layer
161+
self.drop = nn.Dropout(p=0.3)
162+
163+
# Text processing (using the 768-dimensional BERT arrays)
164+
self.text_model = BertModel.from_pretrained("bert-base-uncased")
165+
166+
# Text processing (Fully Connected Layer)
167+
self.fc_text = nn.Linear(in_features=self.text_model.config.hidden_size, out_features=num_classes, bias=True)
168+
169+
# Fusion and classification
170+
self.softmax = nn.Softmax(dim=1)
171+
172+
def forward(self, image, text_input_ids, text_attention_mask,):
173+
# Image branch
174+
x_img = self.image_model(image)
175+
x_img = self.drop(x_img)
176+
x_img = self.fc_image(x_img)
177+
178+
# Text branch
179+
x_text_last_hidden_states = self.text_model(
180+
input_ids = text_input_ids,
181+
attention_mask = text_attention_mask,
182+
return_dict=False
183+
)
184+
x_text_pooled_output = x_text_last_hidden_states[0][:, 0, :]
185+
x_text = self.drop(x_text_pooled_output)
186+
x_text = self.fc_text(x_text_pooled_output)
187+
188+
# Fusion and max merge
189+
x = torch.max(x_text, x_img)
190+
191+
# Classification
192+
#x = self.softmax(x) #-> already applied in crossentropy loss
193+
194+
return x
195+
```
196+
Now we proceed to the training and evaluation stage of the model that we have created above.
197+
198+
#### Training and Evaluation Loop
199+
The below code implements all the necessary steps to train the model.
200+
```python
201+
class EarlyStopping:
202+
def __init__(self, patience=4, verbose=False, delta=0):
203+
self.patience = patience
204+
self.verbose = verbose
205+
self.counter = 0
206+
self.best_loss = None
207+
self.early_stop = False
208+
self.delta = delta
209+
210+
def __call__(self, val_loss):
211+
if self.best_loss is None:
212+
self.best_loss = val_loss
213+
elif val_loss > self.best_loss + self.delta:
214+
self.counter += 1
215+
if self.verbose:
216+
print(f"EarlyStopping counter: {self.counter} out of {self.patience}")
217+
if self.counter >= self.patience:
218+
self.early_stop = True
219+
else:
220+
self.best_loss = val_loss
221+
self.counter = 0
222+
```
223+
Set the `labels` and the `class_weights` for training the model.
224+
```python
225+
labels = df_train['6_way_label'].to_numpy()
226+
class_weights = compute_class_weight(class_weight='balanced', classes=np.unique(labels), y=labels)
227+
class_weights = torch.tensor(class_weights, dtype=torch.float).to(device)
228+
```
229+
The functions which implement the pipeline for training and evaluating the model are below
230+
```python
231+
def train_model(model, train_loader, val_loader, criterion, optimizer, scheduler, num_epochs):
232+
early_stopping = EarlyStopping(patience=5, verbose=True)
233+
234+
# Training loop
235+
for epoch in range(num_epochs):
236+
model.train()
237+
running_loss = 0.0
238+
239+
for input_ids, attention_mask, label, img in train_loader:
240+
input_ids = input_ids.to(device)
241+
attention_mask = attention_mask.to(device)
242+
label = label.to(device)
243+
img = img.to(device)
244+
245+
optimizer.zero_grad()
246+
247+
# Forward pass
248+
outputs = model(img, input_ids, attention_mask)
249+
loss = criterion(outputs, label)
250+
251+
# Backward pass and optimization
252+
loss.backward()
253+
optimizer.step()
254+
255+
running_loss += loss.item()* img.size(0)
256+
257+
# Validating model and ensuring loss is decreasing
258+
model.eval()
259+
val_loss = 0.0
260+
correct_preds = 0
261+
with torch.no_grad():
262+
for input_ids, attention_mask, label, img in val_loader:
263+
input_ids = input_ids.to(device)
264+
attention_mask = attention_mask.to(device)
265+
label = label.to(device)
266+
img = img.to(device)
267+
268+
outputs = model(img, input_ids, attention_mask)
269+
loss = criterion(outputs, label)
270+
val_loss += loss.item() * img.size(0)
271+
272+
_, preds = torch.max(outputs, 1)
273+
correct_preds += torch.sum(preds == label)
274+
275+
val_loss = val_loss / len(val_loader.dataset)
276+
accuracy = correct_preds.double() / len(val_loader.dataset)
277+
scheduler.step(val_loss)
278+
print(f'Epoch {epoch+1}/{num_epochs}, Training Loss: {running_loss/len(train_loader.dataset):.4f}, Validation Loss: {val_loss:.4f}, Accuracy: {accuracy:.4f}')
279+
280+
# Early stopping
281+
early_stopping(val_loss)
282+
if early_stopping.early_stop:
283+
print("Early stopping triggered. Stopping training.")
284+
break
285+
```
286+
287+
Now let's evaluate the model
288+
```python
289+
def evaluate_model(model, test_loader, criterion):
290+
model.eval()
291+
val_losses = []
292+
correct_preds = 0
293+
294+
all_preds = []
295+
all_labels = []
296+
297+
with torch.no_grad():
298+
for input_ids, attention_mask, label, img in test_loader:
299+
input_ids = input_ids.to(device)
300+
attention_mask = attention_mask.to(device)
301+
label = label.to(device)
302+
img = img.to(device)
303+
304+
outputs = model(
305+
image = img,
306+
text_input_ids = input_ids,
307+
text_attention_mask = attention_mask
308+
)
309+
310+
# Final Softmax layer returns class predictions per sample in batch
311+
# Highest probability value resembles class prediction and is assigned to preds variable
312+
_, preds = torch.max(outputs, dim=1)
313+
#print(outputs)
314+
315+
# Loss is calculated by applying Cross Entropy Loss
316+
val_loss = criterion(outputs, label)
317+
318+
# Counting correct model predictions and incrementing correct prediction count
319+
correct_preds += torch.sum(preds == label)
320+
321+
# Appending current loss per batch to list of losses per epoch
322+
val_losses.append(val_loss.item())
323+
324+
all_preds.extend(preds.cpu().numpy())
325+
all_labels.extend(label.cpu().numpy())
326+
327+
328+
accuracy = float((correct_preds.double() / len(df_test)) * 100)
329+
precision = precision_score(all_labels, all_preds, average='weighted')
330+
recall = recall_score(all_labels, all_preds, average='weighted')
331+
332+
print("\nAccuracy: ", accuracy)
333+
print("Precision: ", precision)
334+
print("Recall: ", recall)
335+
```
336+
Finally, use this snippet of code to run the entire pipeline of training and evaluating the model.
337+
338+
```python
339+
device = 'cuda' if torch.cuda.is_available() else 'cpu'
340+
model = BERTResNetClassifier(num_classes=6)
341+
model= model.to(device)
342+
343+
criterion = nn.CrossEntropyLoss(weight=class_weights)
344+
optimizer = torch.optim.Adam(model.parameters(), lr=1e-5)
345+
scheduler = lr_scheduler.ReduceLROnPlateau(optimizer, min_lr=1e-6, factor=0.5, patience=1, verbose=True)
346+
num_epochs = 20
347+
348+
train_model(model, train_loader,val_loader, criterion, optimizer, scheduler, num_epochs)
349+
350+
evaluate_model(model, test_loader, criterion)
351+
```
352+
353+
You are setup. Well done!
354+
355+
### Inferences
356+
After training the model on the images of the dataset along with the text modality, we obtain the following metrics after 20 straight epochs.
357+
358+
![image](https://github.com/user-attachments/assets/45d2b7aa-04e3-4641-92b1-7d4172d45d18)
359+
360+
The model obtains an overall accuracy of 77.47%. If we were to calculate the F1 score as well, we can employ the formula below:
361+
362+
![image](https://github.com/user-attachments/assets/32dd25fb-39d9-4c65-90d1-f9839667cf5a)
363+
364+
The F1 score is approximately 0.774, which is quite a decent score for a multi-modal task of this scale and underlines the usefulness and future scope for improvement in this field.
365+
366+
## Conclusion
367+
368+
By far, multimodal machine learning provides powerful tools for integrating diverse data types, enhancing the accuracy and robustness of models. Through the hands-on project with the FakeEdit dataset, we explored how combining visual and textual data can improve fake news detection. As multimodal approaches continue to evolve, they hold the potential to revolutionize industries by enabling more comprehensive and context-aware AI systems.
369+
370+
## References
371+
372+
These are some of the references that we have used to write this blog.
373+
374+
[1] https://keras.io/examples/nlp/multimodal_entailment/
375+
376+
[2] _I. Gallo, G. Ria, N. Landro and R. L. Grassa, “Image and Text fusion for UPMC Food-101 using BERT and CNNs,” 2020 35th International Conference on Image and Vision Computing New Zealand (IVCNZ), 2020, pp. 1–6, doi: 10.1109/IVCNZ51579.2020.9290622._
377+
378+
[3] _CommerceMM: Large-Scale Commerce MultiModal Representation Learning with Omni Retrieval Licheng Yu, Jun Chen, Animesh Sinha, Mengjiao MJ Wang, Hugo Chen, Tamara L. Berg, Ning Zhang_
379+
380+
[4] _“FLAVA: A Foundational Language And Vision Alignment Model Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, Douwe Kiela_
381+
382+
[5] _Multi-Modal Classification Using Images and Text Stuart Miller, Justin Howard, Paul Adams,_
383+
384+
[6] _Baltrušaitis, Tadas, Chaitanya Ahuja, and Louis-Philippe Morency. “Multimodal machine learning: A survey and taxonomy.” IEEE transactions on pattern analysis and machine intelligence 41.2 (2018): 423–443._
385+
386+

0 commit comments

Comments
 (0)