Active Learning via Vision-Language Model Adaptation with Open Data

1University of Macau
  2Shanghai AI Lab
  3Institute of Collaborative Innovation

Overview

In this work, we propose leveraging VLM’s pretraining data by retrieving samples closely related to the downstream task, using them to augment the task-specific data for AL. As expected, incorporating this data into existing AL methods leads to significant performance improvements. Given that our method exploits open-source VLM and open data, we refer to it as Active Learning with Open Resources (ALOR).

we propose novel Tail First Sampling (TFS) strategy for AL, an embarrassingly simple yet effective method that prioritizes sampling data from underrepresented classes to label. Extensive experiments on standard benchmark datasets demonstrate that our ALOR achieves state-of-the-art performance, significantly surpassing existing methods.

Our Contributions

1

Our results reveal three major contributions:

  1. We study AL by embracing both a VLM and its pretraining data (instantiating the open data). Particularly, we present retrieval-based data augmentation (RDA) to retrieve VLM’s pretraining data relevant to the downstream task, greatly enhancing existing AL methods.
  2. We observe that the retrieved data follows imbalanced distributions, implying how the VLM is biased and how unlabeled task-specific data is similarly imbalanced distributed. This insight motivates our simple yet novel Tail First Sampling (TFS) strategy that prioritizes rare classes in data selection. Extensive experiments show that TFS outperforms prior AL methods.
  3. We rigorously compare different VLM adaptation approaches including finetuning (FT), contrastive tuning (CT), linear probing (LP) and prompt tuning (PT). We show that CT significantly outperforms the others, even when retrieved data is not used. Our final method (ALOR) assembles CT, RDA and TFS, achieving the state-of-the-art on five benchmarks.

Active Learning with Open Resources (ALOR)

1

Active Learning with Open Resources (ALOR) exploits open-source VLM and open data (especially VLM’s pretraining data), unlike recent AL methods which use only the former. Specifically, for task-specific class names, ALOR retrieves relevant pretraining data to augment the limited number of task-specific labeled data. The retrieved data not only reveals an imbalanced distribution but also unveils how the VLM is biased and how task-specific unlabeled data is (similarly) imbalanced. Leveraging these insights, ALOR adopts Tail First Sampling (TFS) that prioritizes sampling unlabeled examples data from underrepresented classes to label.

Tail First Sampling (TFS)

1

Tail First Sampling (TFS) utilizes the data distribution computed in the retrieved data to sample unlabeled data for the most under-represented class. Specifically, it first identifies this class over the current set of labeled data (including retrieved examples). Then, it finds the unlabeled examples classified as this classes. From this set, it selects the example that has the highest uncertainty of its prediction, e.g., the largest entropy.

Results

RDA significantly boosts active learning methods

comparison with SOTA.

For each method in each round, we report the averaged accuracy and standard deviation over three random runs across five datasets. In Round-0, methods with RDA achieve 1.7× higher accuracy than without! In the last round (Round-6), RDA helps each method obtain >8% accuracy gains

TFS achieves state-of-the-art AL performance

1

Results show that our TFS outperforms existing AL methods when using PT for VLM adaptation. Moreover, as our final ALOR method, “TFS w/ CT” (with RDA) significantly boosts performance, achieving ∼7 points gains (in both accuracy and macro F1 averaged on the five datasets) over “TFS w/ PT” and existing AL methods.

CT consistently outperforms other adaptation approaches

comparison with SOTA.

We run each combination of adaptation approach and AL method (without RDA) on the challenging Semi-Aves dataset for three random runs, and report the mean accuracy after Round-6. CT consistently outperforms other adaptation approaches regardless of what AL methods are used.

Visualization of per-round (x-axis) per-class (y-axis) accuracies on semi-Aves benchmark

comparison with SOTA.

For each method, we sort its per-class accuracies in round-0 and track the accuracies over time. Results of different methods with and without RDA are shown in the top and bottom rows, respectively. When adopting RDA, all AL methods suffer from learning for under-represented classes, which have limited retrieved data compared to common classes. However, it is worth noting that our TFS can quickly improve on the under-represented classes as it prioritizes sampling data for them (see the quick improvements of under-represented classes pointed by the black circle). Without RDA, all methods do not have notable patterns in accuracy change although they yield better accuracies than previous rounds. This suggests the benefit of RDA that indicates how to sample data to improve on underrepresented classes and eventually enhance overall accuracy.

BibTeX

If you find our work useful, please consider citing our papers:


@misc{wang2025activelearningvisionlanguagemodel,
      title={Active Learning via Vision-Language Model Adaptation with Open Data}, 
      author={Tong Wang and Jiaqi Wang and Shu Kong},
      year={2025},
      eprint={2506.01724},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2506.01724}, 
}