Last updated on October 1, 2024
In vision-language models like CLIP, learning to prompt or prompt learning is a method for improving how models handle visual recognition tasks by optimizing how they are "prompted" to process images and text. In other words, it's prompt engineering tailored to vision-language models. Typically, vision-language models align images and texts in a shared feature space, allowing the models to classify new images by comparing them with text descriptions, rather than relying on pre-defined categories.
A major challenge with these models is prompt engineering, which involves finding the right words to describe image classes. This process can be time-consuming and requires expertise because small changes in wording can significantly affect performance.
Zhou, K., Yang, J., Loy, C. C., & Liu, Z. (2022). Learning to Prompt for Vision-Language Models. International Journal of Computer Vision, 130(9), 2337–2348. https://doi.org/10.1007/s11263-022-01653-1 ↩