In this work, we propose leveraging VLM’s pretraining data by retrieving samples closely related to the downstream task,
using them to augment the task-specific data for AL.
As expected, incorporating this data into existing AL methods leads to significant performance improvements.
Given that our method exploits open-source VLM and open data, we refer to it as Active Learning with Open Resources (ALOR).
we propose novel Tail First Sampling (TFS) strategy for AL, an embarrassingly simple yet effective method that prioritizes sampling data from underrepresented classes to label.
Extensive experiments on standard benchmark datasets demonstrate that our ALOR achieves state-of-the-art performance,
significantly surpassing existing methods.
Our results reveal three major contributions:
Active Learning with Open Resources (ALOR) exploits open-source VLM and open data (especially VLM’s pretraining data), unlike recent AL methods which use only the former. Specifically, for task-specific class names, ALOR retrieves relevant pretraining data to augment the limited number of task-specific labeled data. The retrieved data not only reveals an imbalanced distribution but also unveils how the VLM is biased and how task-specific unlabeled data is (similarly) imbalanced. Leveraging these insights, ALOR adopts Tail First Sampling (TFS) that prioritizes sampling unlabeled examples data from underrepresented classes to label.
Tail First Sampling (TFS) utilizes the data distribution computed in the retrieved data to sample unlabeled data for the most under-represented class. Specifically, it first identifies this class over the current set of labeled data (including retrieved examples). Then, it finds the unlabeled examples classified as this classes. From this set, it selects the example that has the highest uncertainty of its prediction, e.g., the largest entropy.
For each method in each round, we report the averaged accuracy and standard deviation over three random runs across five datasets. In Round-0, methods with RDA achieve 1.7× higher accuracy than without! In the last round (Round-6), RDA helps each method obtain >8% accuracy gains
Results show that our TFS outperforms existing AL methods when using PT for VLM adaptation. Moreover, as our final ALOR method, “TFS w/ CT” (with RDA) significantly boosts performance, achieving ∼7 points gains (in both accuracy and macro F1 averaged on the five datasets) over “TFS w/ PT” and existing AL methods.
We run each combination of adaptation approach and AL method (without RDA) on the challenging Semi-Aves dataset for three random runs, and report the mean accuracy after Round-6. CT consistently outperforms other adaptation approaches regardless of what AL methods are used.
For each method, we sort its per-class accuracies in round-0 and track the accuracies over time. Results of different methods with and without RDA are shown in the top and bottom rows, respectively. When adopting RDA, all AL methods suffer from learning for under-represented classes, which have limited retrieved data compared to common classes. However, it is worth noting that our TFS can quickly improve on the under-represented classes as it prioritizes sampling data for them (see the quick improvements of under-represented classes pointed by the black circle). Without RDA, all methods do not have notable patterns in accuracy change although they yield better accuracies than previous rounds. This suggests the benefit of RDA that indicates how to sample data to improve on underrepresented classes and eventually enhance overall accuracy.
If you find our work useful, please consider citing our papers:
@misc{wang2025activelearningvisionlanguagemodel,
title={Active Learning via Vision-Language Model Adaptation with Open Data},
author={Tong Wang and Jiaqi Wang and Shu Kong},
year={2025},
eprint={2506.01724},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2506.01724},
}