top of page

Diabetic Retinopathy Classification with Deep Learning: Swin Transformer, ViT, and YOLOv11m

Writer: Aakash WalavalkarAakash Walavalkar

What is Diabetic Retinopathy and why should we care?

One of the main preventable causes of blindness in today's lifestyle is diabetic retinopathy, or DR. If prolonged years of elevated blood glucose levels, diabetic retinopathy (DR) damages the retina, the light-sensitive tissue at the back of the eye. Early detection is essential because DR cannot be identified by naked eye signs in its early stages, fundus imaging is required.


So, how can we make an early diagnosis? Previously, ophthalmologists have used high-resolution fundus photos to identify DR markers such as cotton wool spots, hemorrhages, and microaneurysms. Hand-crafting thousands of those images is laborious, challenging, and prone to human error because looking at such images for hours and hours might bring an scenario of ignored symptoms on ground truth image.


Diabetes retinopathy class 2 classification. Swin transfe

How can Deep Learning help detect Diabetic Retinopathy?

Owing to the progress made in computer vision, it is possible for deep learning models to be trained to automatically screen fundus images and diagnose DR with great precision. By training the models within large datasets where the correct solutions are marked, they can learn to recognize patterns and features relating to various levels of the disease.


This not only speeds up the diagnosis but also provides consistent, scalable, and impartial assessments - something even seasoned experts can get wrong under pressure.


Our goal in this project was to find and compare the performance of three state-of-the-art deep learning models:

• Swin Transformer

• Vision Transformer (ViT)

• YOLOv11m


Each of the three models possesses a particular strength to provide, and through rigorous training and testing


Understanding the DR Severity Classes (Class 0 to Class 4)

Classified. We did in this project work on the Kaggle diabetic retinopathy dataset which classifies fundus images into 5 classes:

  1. Class 0: No DR — normal, healthy eye

  2. Class 1: Mild — first signs of DR begin to develop

  3. Class 2: Moderate — more visible hemorrhages and changes

  4. Class 3: Severe — widespread damage, risk of vision

  5. Class 4: Proliferative DR — new blood vessels form, serious risk of blindness


The practical value of a deep learning model is predicated on its ability to predict these phases correctly


Will This Really Help Ophthalmologists?

This system is designed to aid ophthalmologists, not supersede them. This is how it works:

  1. Rapid Screening: Helps triage large volumes of fundus images quickly

  2. Decision Support: Flags severe cases for priority review

  3. Rural Access: Enables screening in areas where access to eye specialists is limited

  4. Consistency: Reduces variability in diagnosis


Imagine an ophthalmologist who needs to work through 500 cases in a day — with an AI partner, they need to deal with only the 10% most critical cases.


How We Preprocessed the Data for Model Training

High-quality data is crucial to train high-performing models. This is the way we prepared the data:

  1. Normalization: Images were normalized to ImageNet standard mean and std

  2. Data Augmentation: Horizontal flips, Gaussian noise, and elastic deformation were applied to simulate real-world variation

  3. Image Cleaning: Corrupt and illegible images were removed

  4. Organized Directory Structure: Images were class-wise arranged for effortless training


These steps enabled us to give our models balanced and real inputs to learn effectively.


The Models We Trained (And How)


Swin Transformer

The Swin Transformer adapts the Vision Transformer (ViT) architecture for hierarchical feature extraction. It splits images into non-overlapping patches and processes them through shifted windows — capturing both local and global features.

  1. Optimizer: AdamW

  2. Epochs: 30

  3. Loss Function: CrossEntropyLoss

  4. Scheduler: Cosine Annealing

  5. Training Accuracy: ~87%

  6. Validation Accuracy: ~72.78%


Confusion Matrix:

swin transformer confusion matrix
















Vision Transformer (ViT)

ViT processes an image as a sequence of patches and uses transformer blocks (without CNNs). It performed well, especially in detecting early DR due to its global attention mechanism.

  1. Optimizer: AdamW

  2. Epochs: 30

  3. Class Weights: Yes (for imbalance)

  4. Validation Accuracy: ~70.42%


Confusion Matrix:

vit confusion matrix
















YOLOv11m (Adapted for Classification)

Although YOLO is mainly used for object detection, we adapted YOLOv11m for classification tasks by modifying the head layers.

Challenges: Poor sensitivity in early stages

Best Detection: Class 4 (advanced DR)

Validation Accuracy: ~61.7%


Confusion Matrix:

Yolov11m confusion matrix

















Sensitivity and Specificity Comparison

Sensitivity (Recall) Across Classes

  1. Swin Transformer had excellent recall for Class 3 and 4

  2. ViT was consistent across all classes

  3. YOLOv11m failed in Class 1 (sensitivity = 0)

sensitivity diabetes retinopathy















Specificity Across Classes

  1. ViT demonstrated the highest specificity across early stages

  2. Swin showed high specificity in severe cases

specificity diabetes retinopathy
















This study shows that transformer-based models like Swin and ViT are very effective in medical image classification. Among these, Swin Transformer performed best overall — especially in detecting severe and proliferative DR, which are most clinically significant.


YOLOv11m was not good with early detection, however, and was constrained in class sensitivity. It can still be useful in cases where speed is more important than sensitivity.


Where to Find the Code?

The complete training, inference, and preprocessing code is available on GitHub:

Feel free to fork, experiment, and build on it.


Empowering data science enthusiasts with vibrant discussions, expert Q&A, and specialized services for an accelerated journey towards success.

Thanks for subscribing!

bottom of page