Icon

Cream of the Crop

Harvesting Rich, Scalable and Transferable Multi-Modal Data for Instruction Fine-Tuning

Mengyao Lyu1,2, Yan Li3, Huasong Zhong3, Wenhao Yang3, Hui Chen1,2, Jungong Han1,2, Guiguang Ding1,2†, Zhenheng Yang3
1Tsinghua University, 2BNRist, 3Bytedance
Corresponding author

Abstract

Data Icon
The rapid yet inefficient expansion of multi-modal data, combined with the sheer token volume and increased heterogeneity of sources, amplifies both the significance and complexity of multi-modal data selection at scale.
Rate Icon
We redefine the granularity of data valuation by decomposing quality into 14 VL capabilities and formulating diversity into superficial interaction styles, such that multi-modal rich scorers and styler (mmSSR) guarantee that high-scoring information is conveyed to users in diversified forms.
Res Icon
mmSSR is the first to scale to the 2.6M open data pool of LLaVA-OVSI, achieving 99.1% of full performance with only 30% of the data. Across 10+ experimental settings, validated by 14 multi-modal benchmarks, we demonstrate consistent improvements with varying budget constraints, general or specific capability customization and acquisition, and training-free generalization to new domains for curation.

Our Method

Framework image.

(L) We assess GPT-4o's evaluations of rich capabilities on a scale from 0 to 5, while meantime prompting the identification of the user-model interaction style.
(R-top) The small amount of derived sample-scores-style triplets is employed to instruct the pretrained task model to rich capability scorers and styler, ie our mmSSR. It facilitates the analysis and sampling of candidate data points at the scale of millions, ensuring a subset that is both high-quality and diverse, while maintaining minimal time and resource expenditure.
(R-down) The fine-grained mmSSR can also directly generalizes to other data domains, and support efficient scaling in data quantity and capabilities.

Experiments

Baseline Comparison

MMBenchen-v1.1MMStarMMMUMMVetBLINKMMT-BenchMMEAI2DScienceQAMathVistaMINI >Rand /FULL
5% Budget
Random 73.7447.9843.7042.3450.6158.872004.5073.0781.5245.47-89.29
PPL-mid 67.3445.2738.9830.1845.2754.331887.7166.7474.7631.400/1078.31
PPL-si 71.9844.6738.4835.1454.1057.981856.7967.8478.2436.501/1083.10
Deita 72.9147.4741.2840.2352.5956.571956.5070.7679.5736.101/1085.79
CLIP 74.2347.2740.0835.7352.9656.731902.6573.6178.6339.803/1085.41
E5-V 70.9043.0038.7838.4449.9454.651810.4766.5877.5437.400/1081.87
COINCIDE 72.7648.3343.1745.6049.4357.501852.6673.1579.6245.403/10 88.47
mmSSR 77.7953.3343.2743.5351.8359.161938.6877.6688.4552.008/1093.20
10% Budget
Random 74.5751.5744.7242.9152.5958.992033.2874.4284.3347.800/1091.70
PPL-mid 63.5446.8739.0836.9345.9054.301831.0367.2373.8739.500/1080.72
PPL-si 74.6949.8041.2840.6053.0957.951841.1175.1680.7140.403/1087.63
Deita 75.3948.8043.7742.2554.4857.401996.3471.6078.3340.802/1088.72
CLIP 75.2349.8740.3837.1653.5959.351921.0476.6280.0741.004/1087.69
E5-V 70.5145.1338.7839.5950.5755.101787.9468.9477.5437.200/1082.76
COINCIDE 75.2349.7344.7742.5250.6958.712027.5874.7782.0547.003/1090.66
mmSSR 77.3253.2745.0642.9854.1059.612045.0078.7689.9452.4010/1094.75
30% Budget
Random 78.2554.6044.4046.1055.2359.612092.6078.2888.3252.57-95.82
PPL-mid 73.9954.9343.9741.0153.0958.782036.5477.2087.0156.402/1093.77
PPL-si 72.5248.3342.5743.6251.8355.071976.4676.5578.4842.200/1088.22
Deita 76.9354.1343.6744.0455.1159.662042.6379.5083.5450.302/1094.05
CLIP 74.3053.8043.0745.8751.9559.162039.1480.0283.9948.801/1093.07
E5-V 74.3046.0743.2747.8050.3257.851955.1374.4581.6143.701/1089.52
COINCIDE 78.0255.4745.6646.2452.8459.802047.3779.7384.3355.106/1095.82
mmSSR 79.5757.5344.8748.4956.2459.832132.9381.2592.4657.4010/1099.11
FULL
LLaVAOVSI 80.5759.4045.1647.1656.8760.732117.5681.8792.7659.60-100

Why our mmSSR for Scoring and Styling?


Why rich capabilities over the straight-forward quality metric? The comparison between mmSSP(oor) and mmSSR(ich) (proposed) demonstrates that the latter is more effective in capturing the richness and diversity of multi-modal data, which is crucial for the selection.

Can we directly adopt the task model for rich scoring and styling? Not recommended. The comparison between mmSSR + LLaVA-OVSI (finetuned) and mmSSR(ich) (proposed) shows that the open-source model is not as effective as proprietary models (ie, GPT4o we used) in the instruction following of the scoring and styling tasks.

Can we use open multi-modal models for rich scoring and styling? Acceptable when the time/computation/API budget is very constrained. Note that the performance of this approach still lags noticeably behind the specialized models we have developed from state-of-the-art proprietary model, as indicated by mmSSR + Qwen2-VL vs mmSSR(ich).


MMBench-en-v1.1MMStarMMMUMMVetBLINKMMT-BenchMMEAI2DScienceQAMathVista-MINI >Rand /FULL
5% Budget
Random 73.7447.9843.7042.3450.6158.872004.5073.0781.5245.47-89.29
mmSSP(oor) 75.8551.2742.9744.2751.9558.141940.2773.6181.4645.005/1090.14
mmSSR + LLaVAOVSI 77.4050.6044.7741.1054.3558.621952.9775.8187.7540.406/1090.68
mmSSR + Qwen2-VL 75.0851.0045.1642.5752.7157.371955.7874.7484.8848.908/1091.37
mmSSR(ich) 77.7953.3343.2743.5351.8359.161938.6877.6688.4552.008/1093.20
10% Budget
Random 74.5751.5744.7242.9152.5958.992033.2874.4284.3347.80-91.70
mmSSP(oor) 77.2450.4044.2742.5253.4759.482084.3976.0781.3646.105/1091.73
mmSSR + LLaVAOVSI 77.7954.4044.6742.0254.9858.232013.7478.8589.5942.005/1092.72
mmSSR + Qwen2-VL 76.2453.3344.8745.6055.1159.162012.9476.7587.1152.709/1094.59
mmSSR(ich) 77.3253.2745.0642.9854.1059.612045.0078.7689.9452.4010/1094.75
30% Budget
Random 78.2554.6044.4046.1055.2359.612092.6078.2888.3252.57-95.82
mmSSP(oor) 77.8653.1345.7648.0354.8558.782050.6978.9286.9155.804/1096.31
mmSSR + LLaVAOVSI 77.5554.5343.3744.7255.2358.591980.4881.0291.8749.602/1094.73
mmSSR + Qwen2-VL 78.0257.1343.0747.3955.4960.892096.6081.6490.2857.408/1097.91
mmSSR(ich) 79.5757.5344.8748.4956.2459.832132.9381.2592.4657.4010/1099.11
FULL
LLaVAOVSI 80.5759.4045.1647.1656.8760.732117.5681.8792.7659.60-100

Scalability in Data Budget

mmSSR in Cold, Warm and Hot Settings

Data selection methods, from CNNs to LLMs, are often suseptible to sensitivity to settings, such as benchmarks, target dataset, data volume and model arch etc.

Following the main experiments, we additionally validate mmSSR under varying data budget: colder (1%) and hot (40%, 50%) scenarios, achieving consistently superior performance when scaling the data budget volume.

Scalability in Capability

mmSSR for Specialized Capability Acquisition

We consider a data expansion scenario commonly encountered in real-world applications, scaling up the capability dimension within the existing data pool.

The OCR-favored samples newly acquired by our pipeline lead to steady improvements in specialized benchmarks, and they also contribute to the growth of general benchmarks or sustain their advantageous positions.

Single concept removal (transfer) image.


Model Transfer to Larger Data Pool


You might want to expand your data pool by adding new subdomains or sources to an existing dataset. This scenario plays out by first leveraging a subset of LLaVA-665K (12 sources) to train mmSSR and then generalize them directly to LLaVA-OVSI (90+ sources) for both inference and sampling. The models exhibit robust generalization capability to the larger data pool with open sources and novel knowledge.

Data Transfer to Different Model Arch


We also expect the selected subset to be generally applicable, instead of being dependent on specific architecture or data pool.

To verify the effectiveness of the subset selected by mmSSR that are finetuned from LLaVA-OVSI-7B, we use it to train a 0.5B model. Results show that its superiority remains, demonstrating strong robustness.

The Reasons Behind the Strong Generalizability of mmSSR

  • Scores and styles are more generalizable than model responses: While model-based methods rely on their specific model responses (e.g., perplexity and embeddings) for data valuation, our mmSSR is instructed to score and identify instructional styles characterized by general semantics.
  • Rich scores and styles are more generalizable than coarse-grained quality-like descriptors: For pretrained MLLMs to be finetuned, while the understanding of quality might shift, the intrinsic knowledge of fundamental capabilities and styles is more readily shared and transferable.
Thus, the finetuned mmSSR and the selected subsets consistently guarantee strong and robust performance.

Analysis

Source distribution of LLaVA-OVSI vs 10% mmSSR selection



Source distribution of 10% mmSSR selection



Style distribution of 10% mmSSR selection



Score distribution of 10% (1-2) and 30% (3-4) mmSSR selection

Image 1 Image 2
Image 3 Image 4


BibTeX


      @article{lyu2025mmssr,
        title={Cream of the Crop: Harvesting Rich, Scalable and Transferable Multi-Modal Data for Instruction Fine-Tuning},
        author={Mengyao Lyu, Yan Li, Huasong Zhong, Wenhao Yang, Hui Chen, Jungong Han, Guiguang Ding, Zhenheng Yang},
        journal={arXiv preprint arXiv:2503.13383},
        year={2025}
      }