Last updated on October 1, 2024
Context Optimization (CoOp) is one of the first methods in the field of prompt learning. It automates the process of generating prompts. Instead of manually tuning words, CoOp uses learnable vectors to model context, making it easier and faster to adapt the model to different image classification tasks.
By keeping the pre-trained model parameters fixed and only learning the prompt context, CoOp requires very few labeled images (shots) to outperform manually crafted prompts.
CoOp learns continuously from data, unlike Zero-Shot methods, and still maintains strong domain generalization, meaning it can handle tasks across different datasets without significant loss of performance.
CoOp can be applied to vision-language models like CLIP for various downstream image classification tasks. Here is an example of how it can be used with the two prompt types:
[V]1 [V]2 ... [V]M [CLASS]
Where [V]1 ... [V]M
are the learnable context vectors, is a hyperparameter specifying the number of context tokens, and [CLASS]
is the target class name (e.g., "cat").
[V_class1]1 [V_class1]2 ... [V_class1]M for "cat"
Here, each class (e.g., "cat" and "dog") has its own set of context vectors optimized for that class.
For open-source code, check this link.
CoOp has been tested on 11 datasets, covering a wide range of visual tasks including object recognition, fine-grained classification (e.g., flowers, pets), and more specialized tasks like texture and satellite image recognition.
CoOp significantly outperforms other fine-tuning methods on ImageNet when using 16 training examples per class. In particular, it achieves a 4.77% increase over the Zero-Shot performance of CLIP, which starts at 58.18%. Other methods, such as fine-tuning the image encoder or optimizing specific text transformations, show smaller improvements or even performance drops.
Method | ImageNet Accuracy | ∆ with Zero-Shot |
---|---|---|
Zero-Shot CLIP | 58.18 | - |
Linear probe | 55.87 | -2.31 |
Fine-tuning image encoder | 18.28 | -39.90 |
Optimizing transformation layer | 58.86 | +0.68 |
Optimizing bias (text) | 60.93 | +2.75 |
CoOp | 62.95 | +4.77 |
Key Takeaways:
Zhou, K., Yang, J., Loy, C. C., & Liu, Z. (2022). Learning to Prompt for Vision-Language Models. International Journal of Computer Vision, 130(9), 2337–2348. https://doi.org/10.1007/s11263-022-01653-1 ↩