|
- Learning Transferable Visual Models From Natural Language Supervision
State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision We
- arXiv. org e-Print archive
This paper explores pre-training models for learning state-of-the-art image representations using natural language captions paired with images
- Learning Transferable Visual Models From Natural Language Supervision
Abstract State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision
- Learning Transferable Visual Models From Natural Language Supervision
In Figure 6, we visualize how zero-shot CLIP compares to few-shot logistic regres-sion on the features of many image models including the best publicly available ImageNet models, self-supervised learning methods, and CLIP itself
- LLM2CLIP: Powerful Language Model Unlocks Richer Visual Representation
Abstract CLIP is a foundational multimodal model that aligns image and text features into a shared space using contrastive learning on large-scale image-text pairs Its strength lies in leveraging natural language as a rich supervisory signal With the rapid progress of large language models (LLMs), we explore their potential to further enhance CLIP’s multimodal representation learning This
- LLM2CLIP: Powerful Language Model Unlocks Richer Visual Representation
Abstract CLIP is a foundational multimodal model that aligns image and text features into a shared representation space via contrastive learning on large-scale image-text pairs Its effectiveness primarily stems from the use of natural language as rich supervision Motivated by the remarkable advancements in large language models (LLMs), this work explores how LLMs’ superior text
- LLM2CLIP: Powerful Language Model Unlocks Richer Visual Representation
Abstract CLIP is one of the most important multimodal foundational models today, aligning visual and textual signals into a shared feature space using a simple contrastive learning loss on large-scale image-text pairs What powers CLIP’s capabilities? The rich supervision signals provided by natural language — the carrier of human knowledge — shape a powerful cross-modal representation
- Scaling Language-Free Visual Representation Learning
Consequently, we observe visual SSL methods achieve CLIP-level performance on a wide range of VQA and classic vision benchmarks These findings demonstrate that pure visual SSL can match language-supervised visual pretraining at scale, opening new opportunities for vision-centric representation learning
|
|
|