model update

sitamgithub-MSIT · sitamgithub-MSIT · commit 5ef139009cc2 · 2025-01-16T01:01:40.000+05:30
diff --git a/README.md b/README.md
@@ -1,18 +1,18 @@
 # PicQ
 
-Visual Question Answering is the task of answering open-ended questions based on an image. They output natural language responses to natural language questions about the content of an image. This project uses one of the popular multimodal models, [**MiniCPM-V-2_6**](https://huggingface.co/openbmb/MiniCPM-V-2_6) from the Hugging Face model hub for visual question answering.
+Visual Question Answering is the task of answering open-ended questions based on an image. They output natural language responses to natural language questions about the content of an image. This project uses one of the popular multimodal models, [**MiniCPM-o 2.6**](https://huggingface.co/openbmb/MiniCPM-o-2_6) from the Hugging Face model hub.
 
-[**MiniCPM-V-2_6**](https://huggingface.co/openbmb/MiniCPM-V-2_6) is the latest model in the MiniCPM-V series, built on **SigLip-400M** and **Qwen2-7B** with a total of 8B parameters. It introduces new features for multi-image and video understanding. It also supports multilingual capabilities and produces fewer tokens than most models, improving inference speed, first-token latency, memory usage, and power consumption. It is easy to use in various ways, including CPU inference, quantized models, and online demos.
+[**MiniCPM-o 2.6**](https://huggingface.co/openbmb/MiniCPM-o-2_6) is the latest and most capable model in the MiniCPM-o series, built on **SigLip-400M**, **Whisper-medium-300M**, **ChatTTS-200M**, and **Qwen2.5-7B** with a total of 8B parameters. MiniCPM-o 2.6 significantly improves upon its predecessor, boasting advanced real-time speech conversation and multimodal live streaming capabilities. It surpasses proprietary models in visual and speech understanding, offers efficient processing, and provides easy usage options.
 
 ## Project Structure
 
 The project is structured as follows:
 
 - `src\`: The folder that contains the source code for the project.
 
-  - `app\`: The folder containing the source code for the application's main functionality.
+  - `minicpm\`: The folder containing the source code for the application's main functionality.
 
-    - `model.py`: The file that contains the code for loading the model and the tokenizer.
+    - `model.py`: The file that contains the code for loading the model, tokenizer and processor.
     - `response.py`: The file that contains the function for generating the response for the input image and question.
 
   - `config.py`: This file contains the configuration for the used model.
@@ -49,7 +49,7 @@ Now, open up your local host and see the web application running. For more infor
 
 **Note**: You need a Hugging Face access token to run the application. You can get the token by signing up on the Hugging Face website and creating a new token from the settings page. After getting the token, you can set it as an environment variable `ACCESS_TOKEN` in your system by creating a `.env` file in the project's root directory. Check the `.env.example` file for reference.
 
-The application is hosted on Hugging Face Spaces running on a GPU. You are expected to have a GPU for local use when running the application. If you do not have a GPU, you can explore the CPU inference option provided by the model [here](https://huggingface.co/collections/openbmb/minicpm-65d48bf958302b9fd25b698f).
+The application is hosted on Hugging Face Spaces running on a GPU. You are expected to have a GPU for local use when running the application. If you do not have a GPU, you can explore the local inference option provided by the model [here](https://github.com/OpenBMB/llama.cpp/blob/minicpm-omni/examples/llava/README-minicpmo2.6.md).
 
 ## Usage
 
diff --git a/app.py b/app.py
@@ -3,7 +3,7 @@
 warnings.filterwarnings("ignore")
 
 import gradio as gr
-from src.app.response import describe_image
+from src.minicpm.response import describe_image
 
 
 # Image, text query, and input parameters
@@ -49,8 +49,8 @@
 
 # Title, description, and article for the interface
 title = "Visual Question Answering"
-description = "Gradio Demo for the MiniCPM-V 2.6 Vision Language Understanding and Generation model. This model can answer questions about images in natural language. To use it, upload your image, type a question, select associated parameters, use the default values, click 'Submit', or click one of the examples to load them. You can read more at the links below."
-article = "<p style='text-align: center'><a href='https://github.com/OpenBMB/MiniCPM-V' target='_blank'>Model GitHub Repo</a> | <a href='https://huggingface.co/openbmb/MiniCPM-V-2_6' target='_blank'>Model Page</a></p>"
+description = "Gradio Demo for the MiniCPM-o 2.6: A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming. This model can answer questions about images in natural language. To use it, upload your image, type a question, select associated parameters, use the default values, click 'Submit', or click one of the examples to load them. You can read more at the links below."
+article = "<p style='text-align: center'><a href='https://github.com/OpenBMB/MiniCPM-o' target='_blank'>Model GitHub Repo</a> | <a href='https://huggingface.co/openbmb/MiniCPM-o-2_6' target='_blank'>Model Page</a></p>"
 
 
 # Launch the interface
diff --git a/requirements.txt b/requirements.txt
@@ -2,9 +2,15 @@ python-dotenv==1.0.1
 numpy==1.26.4
 Pillow==10.1.0
 torch==2.1.2
+torchaudio==2.1.2
 torchvision==0.16.2
-transformers==4.40.2
+transformers==4.44.2
 sentencepiece==0.1.99
 https://github.com/Dao-AILab/flash-attention/releases/download/v2.6.2/flash_attn-2.6.2+cu123torch2.1cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
-gradio
-decord
+decord
+librosa==0.9.0
+soundfile==0.12.1
+vector-quantize-pytorch==1.18.5
+vocos==0.1.0
+moviepy
+gradio
diff --git a/src/config.py b/src/config.py
@@ -1,8 +1,8 @@
 # Model settings
 device = "cuda"
-model_name = "openbmb/MiniCPM-V-2_6"
+model_name = "openbmb/MiniCPM-o-2_6"
 
 # Decoding settings
-sampling = True
-stream = True
+sampling = False
+stream = False
 repetition_penalty = 1.05
diff --git a/src/minicpm/__init__.py b/src/minicpm/__init__.py
diff --git a/src/minicpm/model.py b/src/minicpm/model.py
@@ -38,6 +38,9 @@ def load_model_tokenizer_and_processor(model_name: str, device: str) -> Any:
             trust_remote_code=True,
             attn_implementation="sdpa",
             torch_dtype=torch.bfloat16,
+            init_vision=True,
+            init_audio=False,
+            init_tts=False,
             token=access_token
         )
         model = model.eval().to(device=device)
diff --git a/src/minicpm/response.py b/src/minicpm/response.py
@@ -11,7 +11,7 @@
     stream,
     repetition_penalty,
 )
-from src.app.model import load_model_tokenizer_and_processor
+from src.minicpm.model import load_model_tokenizer_and_processor
 from src.logger import logging
 from src.exception import CustomExceptionHandling
 

Original file line number	Diff line number	Diff line change
`@@ -11,7 +11,7 @@`
`11`	`11`	`stream,`
`12`	`12`	`repetition_penalty,`
`13`	`13`	`)`
`14`		`-from src.app.model import load_model_tokenizer_and_processor`
	`14`	`+from src.minicpm.model import load_model_tokenizer_and_processor`
`15`	`15`	`from src.logger import logging`
`16`	`16`	`from src.exception import CustomExceptionHandling`
`17`	`17`