1 result for “vision-transformer”
The Vision Transformer (ViT) is a deep learning model that applies the transformer architecture originally designed for NLP directly to sequences of image patches, achieving state-of-the-art results on visual recognition tasks.