- Qwen-VL: A Versatile Vision-Language Model for Understanding . . .
In this work, we introduce the Qwen-VL series, a set of large-scale vision-language models (LVLMs) designed to perceive and understand both texts and images Starting from the Qwen-LM as a
- Gated Attention for Large Language Models: Non-linearity, Sparsity,. . .
Gating mechanisms have been widely utilized, from early models like LSTMs and Highway Networks to recent state space models, linear attention, and also softmax attention Yet, existing literature
- LLaVA-MoD: Making LLaVA Tiny via MoE-Knowledge Distillation
Remarkably, LLaVA-MoD-2B surpasses Qwen-VL-Chat-7B with an average gain of 8 8\%, using merely $0 3\%$ of the training data and 23\% trainable parameters The results underscore LLaVA-MoD's ability to effectively distill comprehensive knowledge from its teacher model, paving the way for developing efficient MLLMs
- TwinFlow: Realizing One-step Generation on Large Models with. . .
Qwen-Image-Lightning is 1 step leader on the DPG benchmark and should be marked like this in Table 2 Distillation Fine Tuning vs Full training method: Qwen-Image-TwinFlow (and possibly also TwinFlow-0 6B and TwinFlow-1 6B, see question below) leverages a pretrained model that is fine-tuned
- Q -VL: A VERSATILE V M FOR UNDERSTANDING, L ING AND EYOND QWEN-VL: A . . .
(iii) 3-stage training pipeline, and (iv) multilingual multimodal cleaned corpus Beyond the conventional image description and question-answering, we imple-ment the grounding and text-reading ability of Qwen-VLs by aligning image-caption-box tuples The resulting models, including Qwen-VL and Qwen-VL-Chat, set new records for generalist models under similar model scales on a broad range of
- MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context . . .
Large Language Models (LLMs) have become more prevalent in long-context applications such as interactive chatbots, document analysis, and agent workflows, but it is challenging to serve long-context requests with low latency and high throughput Speculative decoding (SD) is a widely used technique to reduce latency losslessly, but the conventional wisdom suggests that its efficacy is limited
- Towards Understanding Distilled Reasoning Models: A. . .
To explore this, we train a crosscoder on Qwen-series models and their fine-tuned variants Our results suggest that the crosscoder learns features corresponding to various types of reasoning, including self-reflection and computation verification
- LiveVQA: Assessing Models with Live Visual Knowledge
We introduce LiveVQA, an automatically collected dataset of latest visual knowledge from the Internet with synthesized VQA problems LiveVQA consists of 3,602 single- and multi-hop visual questions from 6 news websites across 14 news categories, featuring high-quality image-text coherence and authentic information Our evaluation across 15 MLLMs (e g , GPT-4o, Gemma-3, and Qwen-2 5-VL family
|