Abstract
Image captioning has seen remarkable advancements with the integration of deep learning techniques, notably Convolutional Neural Networks (CNNs) and Long Short-Term Memory (LSTM) networks, for generating descriptive captions for images. Despite these improvements, capturing intricate details and context remains a challenge. This project introduces an enhanced image captioning model that integrates transformers with an attention mechanism to address these limitations. By leveraging CNNs for feature extraction and LSTMs for sequence generation, while utilizing transformers to apply sophisticated attention to significant image regions, the proposed model aims to generate more contextually rich and coherent captions. Experimental results indicate that incorporating transformers with attention mechanisms leads to a significant enhancement in caption accuracy and descriptiveness, surpassing traditional CNN-LSTM models. This advancement is particularly beneficial in various applications, including assistive technologies for the visually impaired, content-based image retrieval systems, automatic image annotation for digital asset management, and improved human-computer interaction. This approach represents a substantial step forward in achieving more precise and detailed image captioning, with potential impacts across numerous fields.