International Journal of Innovative Research in Engineering and Management
Year: 2026, Volume: 13, Issue: 2
First page : ( 135) Last page : ( 142)
Online ISSN : 2350-0557
Mukthikka V
DOI: 10.55524/ijirem.2026.13.2.18 |
DOI URL: https://doi.org/10.55524/ijirem.2026.13.2.18
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0) (http://creativecommons.org/licenses/by/4.0)
Article Tools: Print the Abstract | Indexing metadata | How to cite item | Email this article | Post a Comment
Mukthikka V , Piyush, Gurpreet Singh
Multimodal large language models (MLLMs) are currently dominated by visual instruction tuning (VIT), where frozen vision and language backbones are bridged by lightweight trainable modules. We curated and downloaded 30 recent papers (2022–2025) in this direction, with a cumulative 4,555 citations as of April 28, 2026, and analyzed architecture and training trends. The strongest recurring pattern is a shift from full-model tuning to parameter-efficient adaptation, token compression, and data selection. Motivated by this, we propose TinyBridge-TriFuse, a frozen-backbone connector family that combines linear pooled-feature alignment, a small MLP expert, and a token-aware bridge expert. We provide a full IEEE-style formulation, a data-efficient training recipe, and real measured experiments. On frozen OpenCLIP ViT-B/32 features for CIFAR-10, TinyBridge-TriFuse reaches 0.9390 test accuracy, exceeding LinearAlign (0.9350) by +0.0040 and improving over zero-shot CLIP by +0.0728. We also report TinyBridge-DynaFuse, a tiny-gate variant (1,035 extra parameters) that improves calibration to 0.0326 ECE.
Msc Scholar, AI & Big Data, Woosong University, Daejeon, South Korea
No. of Downloads: 12 | No. of Views: 139
