Thanks for your interest in our work. Currently, the number of users has exceeded our expectations. We provide alternative demo links here: Link1 Link2 Link3 Link4

Occasionally, we may need to re-run the model due to connection issues; kindly refresh the page to access the new validate link. We are currently preparing a lighter model runnable on a single 3090 GPU, which you will be able to run on your own machine. Please stay updated by visiting our Github page at Github. > 译文 谢谢你对我们工作的兴趣。目前,用户数量已经超出了我们的预期。我们提供 替代演示链接 这里: 链接1 链接2 链接3 链接4 有时,由于连接问题,我们可能需要重新运行模型; 刷新 访问新验证链接的页面。 我们目前正在准备一个更轻的模型,可在单个3090 GPU上运行,您将能够在自己的机器上运行。请通过访问我们的Github页面保持最新 Github

API Docs

No named API Routes found for https://31c7cdb7e3594e851e.gradio.live/

To expose an API endpoint of your app in this page, set the api_name parameter of the event listener.
For more information, visit the API Page guide . To hide the API documentation button and this page, set show_api=False in the Blocks.launch() method.

找不到指定的API路由 https:// 31c7cdb7e3594e851e.gradio.live/ 要在此页面中公开应用程序的API端点,请设置 api_name 事件监听器的参数。 有关更多信息,请访问 API页面指南 。要隐藏API文档按钮和此页面,请设置 show_api = False 在 块。启动 () 方法。

Abstract

The recent GPT-4 has demonstrated extraordinary multi-modal abilities, such as directly generating websites from handwritten text and identifying humorous elements within images. These features are rarely observed in previous vision-language models. We believe the primary reason for GPT-4’s advanced multi-modal generation capabilities lies in the utilization of a more advanced large language model (LLM). To examine this phenomenon, we present MiniGPT-4, which aligns a frozen visual encoder with a frozen LLM, Vicuna, using just one projection layer. Our findings reveal that MiniGPT-4 possesses many capabilities similar to those exhibited by GPT-4 like detailed image description generation and website creation from hand-written drafts. Furthermore, we also observe other emerging capabilities in MiniGPT-4, including writing stories and poems inspired by given images, providing solutions to problems shown in images, teaching users how to cook based on food photos, etc. In our experiment, we found that only performing the pretraining on raw image-text pairs could produce unnatural language outputs that lack coherency including repetition and fragmented sentences. To address this problem, we curate a high-quality, well-aligned dataset in the second stage to finetune our model using a conversational template. This step proved crucial for augmenting the model’s generation reliability and overall usability. Notably, our model is highly computationally efficient, as we only train a projection layer utilizing approximately 5 million aligned image-text pairs.
摘要 最近的GPT-4展示了非凡的多模式能力,例如直接从手写文本生成网站并识别图像中的幽默元素。这些特征在以前的视觉语言模型中很少观察到。我们认为,GPT-4先进的多模式生成功能的主要原因在于使用了更先进的大型语言模型 (LLM)。为了检查这种现象,我们提出了一种MiniGPT-4,该方法仅使用一个投影层将冻结的视觉编码器与冻结的LLM Vicuna对齐。我们的发现表明,MiniGPT-4具有许多与GPT-4所展示的功能相似的功能,例如通过手写草稿生成详细图像描述和创建网站。此外,我们还观察了MiniGPT-4中的其他新兴功能,包括撰写受给定图像启发的故事和诗歌,为图像中显示的问题提供解决方案,教用户如何基于食物照片做饭,等。在我们的实验中,我们发现,仅对原始图像-文本对执行预训练可能会产生不自然的语言输出,这些输出缺乏连贯性,包括重复和片段化的句子。为了解决这个问题,我们在第二阶段策划了一个高质量,对齐良好的数据集,以使用对话模板微调我们的模型。事实证明,这一步骤对于提高模型的生成可靠性和整体可用性至关重要。值得注意的是,我们的模型在计算上非常有效,因为我们仅利用大约500万个对齐的图像-文本对来训练投影层。

MiniGPT-4 consists of a vision encoder with a pretrained ViT and Q-Former, a single linear projection layer, and an advanced Vicuna large language model. MiniGPT-4 only requires training the linear layer to align the visual features with the Vicuna.:

MiniGPT-4由具有预先训练的ViT和Q形成器的视觉编码器,单个线性投影层和高级Vicuna大语言模型组成。MiniGPT-4仅需要训练线性层以使视觉特征与骆马对齐。 :