Skip to content

Commit 5ef1390

Browse files
model update
1 parent f1ade46 commit 5ef1390

File tree

7 files changed

+24
-15
lines changed

7 files changed

+24
-15
lines changed

README.md

+5-5
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,18 @@
11
# PicQ
22

3-
Visual Question Answering is the task of answering open-ended questions based on an image. They output natural language responses to natural language questions about the content of an image. This project uses one of the popular multimodal models, [**MiniCPM-V-2_6**](https://huggingface.co/openbmb/MiniCPM-V-2_6) from the Hugging Face model hub for visual question answering.
3+
Visual Question Answering is the task of answering open-ended questions based on an image. They output natural language responses to natural language questions about the content of an image. This project uses one of the popular multimodal models, [**MiniCPM-o 2.6**](https://huggingface.co/openbmb/MiniCPM-o-2_6) from the Hugging Face model hub.
44

5-
[**MiniCPM-V-2_6**](https://huggingface.co/openbmb/MiniCPM-V-2_6) is the latest model in the MiniCPM-V series, built on **SigLip-400M** and **Qwen2-7B** with a total of 8B parameters. It introduces new features for multi-image and video understanding. It also supports multilingual capabilities and produces fewer tokens than most models, improving inference speed, first-token latency, memory usage, and power consumption. It is easy to use in various ways, including CPU inference, quantized models, and online demos.
5+
[**MiniCPM-o 2.6**](https://huggingface.co/openbmb/MiniCPM-o-2_6) is the latest and most capable model in the MiniCPM-o series, built on **SigLip-400M**, **Whisper-medium-300M**, **ChatTTS-200M**, and **Qwen2.5-7B** with a total of 8B parameters. MiniCPM-o 2.6 significantly improves upon its predecessor, boasting advanced real-time speech conversation and multimodal live streaming capabilities. It surpasses proprietary models in visual and speech understanding, offers efficient processing, and provides easy usage options.
66

77
## Project Structure
88

99
The project is structured as follows:
1010

1111
- `src\`: The folder that contains the source code for the project.
1212

13-
- `app\`: The folder containing the source code for the application's main functionality.
13+
- `minicpm\`: The folder containing the source code for the application's main functionality.
1414

15-
- `model.py`: The file that contains the code for loading the model and the tokenizer.
15+
- `model.py`: The file that contains the code for loading the model, tokenizer and processor.
1616
- `response.py`: The file that contains the function for generating the response for the input image and question.
1717

1818
- `config.py`: This file contains the configuration for the used model.
@@ -49,7 +49,7 @@ Now, open up your local host and see the web application running. For more infor
4949

5050
**Note**: You need a Hugging Face access token to run the application. You can get the token by signing up on the Hugging Face website and creating a new token from the settings page. After getting the token, you can set it as an environment variable `ACCESS_TOKEN` in your system by creating a `.env` file in the project's root directory. Check the `.env.example` file for reference.
5151

52-
The application is hosted on Hugging Face Spaces running on a GPU. You are expected to have a GPU for local use when running the application. If you do not have a GPU, you can explore the CPU inference option provided by the model [here](https://huggingface.co/collections/openbmb/minicpm-65d48bf958302b9fd25b698f).
52+
The application is hosted on Hugging Face Spaces running on a GPU. You are expected to have a GPU for local use when running the application. If you do not have a GPU, you can explore the local inference option provided by the model [here](https://github.com/OpenBMB/llama.cpp/blob/minicpm-omni/examples/llava/README-minicpmo2.6.md).
5353

5454
## Usage
5555

app.py

+3-3
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
warnings.filterwarnings("ignore")
44

55
import gradio as gr
6-
from src.app.response import describe_image
6+
from src.minicpm.response import describe_image
77

88

99
# Image, text query, and input parameters
@@ -49,8 +49,8 @@
4949

5050
# Title, description, and article for the interface
5151
title = "Visual Question Answering"
52-
description = "Gradio Demo for the MiniCPM-V 2.6 Vision Language Understanding and Generation model. This model can answer questions about images in natural language. To use it, upload your image, type a question, select associated parameters, use the default values, click 'Submit', or click one of the examples to load them. You can read more at the links below."
53-
article = "<p style='text-align: center'><a href='https://github.com/OpenBMB/MiniCPM-V' target='_blank'>Model GitHub Repo</a> | <a href='https://huggingface.co/openbmb/MiniCPM-V-2_6' target='_blank'>Model Page</a></p>"
52+
description = "Gradio Demo for the MiniCPM-o 2.6: A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming. This model can answer questions about images in natural language. To use it, upload your image, type a question, select associated parameters, use the default values, click 'Submit', or click one of the examples to load them. You can read more at the links below."
53+
article = "<p style='text-align: center'><a href='https://github.com/OpenBMB/MiniCPM-o' target='_blank'>Model GitHub Repo</a> | <a href='https://huggingface.co/openbmb/MiniCPM-o-2_6' target='_blank'>Model Page</a></p>"
5454

5555

5656
# Launch the interface

requirements.txt

+9-3
Original file line numberDiff line numberDiff line change
@@ -2,9 +2,15 @@ python-dotenv==1.0.1
22
numpy==1.26.4
33
Pillow==10.1.0
44
torch==2.1.2
5+
torchaudio==2.1.2
56
torchvision==0.16.2
6-
transformers==4.40.2
7+
transformers==4.44.2
78
sentencepiece==0.1.99
89
https://github.com/Dao-AILab/flash-attention/releases/download/v2.6.2/flash_attn-2.6.2+cu123torch2.1cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
9-
gradio
10-
decord
10+
decord
11+
librosa==0.9.0
12+
soundfile==0.12.1
13+
vector-quantize-pytorch==1.18.5
14+
vocos==0.1.0
15+
moviepy
16+
gradio

src/config.py

+3-3
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
# Model settings
22
device = "cuda"
3-
model_name = "openbmb/MiniCPM-V-2_6"
3+
model_name = "openbmb/MiniCPM-o-2_6"
44

55
# Decoding settings
6-
sampling = True
7-
stream = True
6+
sampling = False
7+
stream = False
88
repetition_penalty = 1.05
File renamed without changes.

src/app/model.py src/minicpm/model.py

+3
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,9 @@ def load_model_tokenizer_and_processor(model_name: str, device: str) -> Any:
3838
trust_remote_code=True,
3939
attn_implementation="sdpa",
4040
torch_dtype=torch.bfloat16,
41+
init_vision=True,
42+
init_audio=False,
43+
init_tts=False,
4144
token=access_token
4245
)
4346
model = model.eval().to(device=device)

src/app/response.py src/minicpm/response.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@
1111
stream,
1212
repetition_penalty,
1313
)
14-
from src.app.model import load_model_tokenizer_and_processor
14+
from src.minicpm.model import load_model_tokenizer_and_processor
1515
from src.logger import logging
1616
from src.exception import CustomExceptionHandling
1717

0 commit comments

Comments
 (0)