Abstract:
Since the self-attention-based Vision transformer (ViT) model shows its powerful ability of feature extraction and pattern representation in both natural language processing (NLP) and computer vision areas, and due to the obvious difference between SAR image features and natural object image features, a method using ViT model is proposed for SAR image target classification to explore the feasibility and effectiveness of self-attention model in SAR image intelligent processing. In this paper the ViT architecture is similar to the former NLP model, and has the advantages of simple setting, good scalability and out-of-the-box deployment.The ViT model is mainly composed of five components: image splits, patch embedding, position embedding, self-attention module sequencing and multilayer perceptron (MLP) classification. The open MSTAR dataset is selected as the experimental dataset, and the training samples of the dataset are augmented. The ViT model is trained on the augmented dataset by minimizing the training loss and maximizing the classification accuracy on the verification dataset to ensure the convergence of the network. The trained ViT model is used to classify SAR image on the testing dataset. The experiments result show that ViT model has a high accuracy and good generalization ability for SAR image classification, and the self-attention method can play an iportant role in the field of SAR image automatic processing.