|
- LLaVA: Large Language and Vision Assistant - GitHub
[10 5] 🔥 LLaVA-1 5 is out! Achieving SoTA on 11 benchmarks, with just simple modifications to the original LLaVA, utilizes all public data, completes training in ~1 day on a single 8-A100 node, and surpasses methods like Qwen-VL-Chat that use billion-scale data Check out the technical report, and explore the demo!
- 【LLM多模态】LLava模型架构和训练过程 | CLIP模型-CSDN博客
LLaVA的模型结构非常简单,就是CLIP+LLM (Vicuna,LLaMA结构),利用Vison Encoder将图片转换为 [N=1, grid_H x grid_W, hidden_dim] 的feature map,然后接一个插值层Projection W,将图像特征和文本特征进行维度对齐。
- LLaVA(Large Language and Vision Assistant)大模型 - 知乎
研究者通过连接 CLIP 的开源视觉编码器和语言解码器 LLaMA,开发了一个大型多模态模型(LMM)—— LLaVA,并在生成的视觉 - 语言指令数据上进行端到端微调。
- LLaVA
LLaVA represents a novel end-to-end trained large multimodal model that combines a vision encoder and Vicuna for general-purpose visual and language understanding, achieving impressive chat capabilities mimicking spirits of the multimodal GPT-4 and setting a new state-of-the-art accuracy on Science QA
- LLaVA: Large Language and Vision Assistant - Microsoft Research
LLaVA represents a cost-efficient approach to building general-purpose multimodal assistant It is a novel end-to-end trained large multimodal model that combines a vision encoder and Vicuna for general-purpose visual and language understanding, achieving impressive chat capabilities mimicking spirits of the multimodal GPT-4 and setting a new
- LLaVa - Hugging Face
Overview LLaVa is an open-source chatbot trained by fine-tuning LlamA Vicuna on GPT-generated multimodal instruction-following data It is an auto-regressive language model, based on the transformer architecture In other words, it is an multi-modal version of LLMs fine-tuned for chat instructions
- LLaVA系列——LLaVA、LLaVA-1. 5、LLaVA-NeXT、LLaVA-OneVision
LLaVA是一系列结构极简的多模态大模型。 不同于Flamingo的交叉注意力机制、BLIP系列的Q-Former,LLaVA直接 使用简单的线性层将视觉特征映射为文本特征,在一系列的多模态任务上取得了很好的效果。
- GitHub - ictnlp LLaVA-Mini: LLaVA-Mini is a unified large multimodal . . .
LLaVA-Mini is a unified large multimodal model that can support the understanding of images, high-resolution images, and videos in an efficient manner Guided by the interpretability within LMM, LLaVA-Mini significantly improves efficiency while ensuring vision capabilities
|
|
|