Diabetic Retinopathy Classification with Deep Learning: Swin Transformer, ViT, and YOLOv11m

Aakash Walavalkar

7 hours ago4 min read

What is Diabetic Retinopathy and why should we care?

One of the main preventable causes of blindness in today's lifestyle is diabetic retinopathy, or DR. If prolonged years of elevated blood glucose levels, diabetic retinopathy (DR) damages the retina, the light-sensitive tissue at the back of the eye. Early detection is essential because DR cannot be identified by naked eye signs in its early stages, fundus imaging is required.

So, how can we make an early diagnosis? Previously, ophthalmologists have used high-resolution fundus photos to identify DR markers such as cotton wool spots, hemorrhages, and microaneurysms. Hand-crafting thousands of those images is laborious, challenging, and prone to human error because looking at such images for hours and hours might bring an scenario of ignored symptoms on ground truth image.

Diabetes retinopathy class 2 classification. Swin transfe

How can Deep Learning help detect Diabetic Retinopathy?

Owing to the progress made in computer vision, it is possible for deep learning models to be trained to automatically screen fundus images and diagnose DR with great precision. By training the models within large datasets where the correct solutions are marked, they can learn to recognize patterns and features relating to various levels of the disease.

This not only speeds up the diagnosis but also provides consistent, scalable, and impartial assessments - something even seasoned experts can get wrong under pressure.

Our goal in this project was to find and compare the performance of three state-of-the-art deep learning models:

• Swin Transformer

• Vision Transformer (ViT)

• YOLOv11m

Each of the three models possesses a particular strength to provide, and through rigorous training and testing

Understanding the DR Severity Classes (Class 0 to Class 4)

Classified. We did in this project work on the Kaggle diabetic retinopathy dataset which classifies fundus images into 5 classes:

Class 0: No DR — normal, healthy eye
Class 1: Mild — first signs of DR begin to develop
Class 2: Moderate — more visible hemorrhages and changes
Class 3: Severe — widespread damage, risk of vision
Class 4: Proliferative DR — new blood vessels form, serious risk of blindness

The practical value of a deep learning model is predicated on its ability to predict these phases correctly

Will This Really Help Ophthalmologists?

This system is designed to aid ophthalmologists, not supersede them. This is how it works:

Rapid Screening: Helps triage large volumes of fundus images quickly
Decision Support: Flags severe cases for priority review
Rural Access: Enables screening in areas where access to eye specialists is limited
Consistency: Reduces variability in diagnosis

Imagine an ophthalmologist who needs to work through 500 cases in a day — with an AI partner, they need to deal with only the 10% most critical cases.

How We Preprocessed the Data for Model Training

High-quality data is crucial to train high-performing models. This is the way we prepared the data:

Normalization: Images were normalized to ImageNet standard mean and std
Data Augmentation: Horizontal flips, Gaussian noise, and elastic deformation were applied to simulate real-world variation
Image Cleaning: Corrupt and illegible images were removed
Organized Directory Structure: Images were class-wise arranged for effortless training

These steps enabled us to give our models balanced and real inputs to learn effectively.

The Models We Trained (And How)

Swin Transformer

The Swin Transformer adapts the Vision Transformer (ViT) architecture for hierarchical feature extraction. It splits images into non-overlapping patches and processes them through shifted windows — capturing both local and global features.

Optimizer: AdamW
Epochs: 30
Loss Function: CrossEntropyLoss
Scheduler: Cosine Annealing
Training Accuracy: ~87%
Validation Accuracy: ~72.78%

Confusion Matrix:

Vision Transformer (ViT)

ViT processes an image as a sequence of patches and uses transformer blocks (without CNNs). It performed well, especially in detecting early DR due to its global attention mechanism.

Optimizer: AdamW
Epochs: 30
Class Weights: Yes (for imbalance)
Validation Accuracy: ~70.42%

Confusion Matrix:

YOLOv11m (Adapted for Classification)

Although YOLO is mainly used for object detection, we adapted YOLOv11m for classification tasks by modifying the head layers.

• Challenges: Poor sensitivity in early stages

• Best Detection: Class 4 (advanced DR)

• Validation Accuracy: ~61.7%

Confusion Matrix:

Sensitivity and Specificity Comparison

Sensitivity (Recall) Across Classes

Swin Transformer had excellent recall for Class 3 and 4
ViT was consistent across all classes
YOLOv11m failed in Class 1 (sensitivity = 0)

Specificity Across Classes

ViT demonstrated the highest specificity across early stages
Swin showed high specificity in severe cases

This study shows that transformer-based models like Swin and ViT are very effective in medical image classification. Among these, Swin Transformer performed best overall — especially in detecting severe and proliferative DR, which are most clinically significant.

YOLOv11m was not good with early detection, however, and was constrained in class sensitivity. It can still be useful in cases where speed is more important than sensitivity.

Where to Find the Code?

The complete training, inference, and preprocessing code is available on GitHub:

https://github.com/tech-aakash/Diabetes-Rethinopathy-Classification

Feel free to fork, experiment, and build on it.

Diabetic Retinopathy Classification with Deep Learning: Swin Transformer, ViT, and YOLOv11m

What is Diabetic Retinopathy and why should we care?

How can Deep Learning help detect Diabetic Retinopathy?

Understanding the DR Severity Classes (Class 0 to Class 4)

Will This Really Help Ophthalmologists?

How We Preprocessed the Data for Model Training

The Models We Trained (And How)

Swin Transformer

Vision Transformer (ViT)

YOLOv11m (Adapted for Classification)

Sensitivity and Specificity Comparison

Sensitivity (Recall) Across Classes

Specificity Across Classes

Where to Find the Code?

Recent Posts