As of preparing this module official CLIP repo is mainly structured for inference. This module adds the required changes for training keeping in mind all the tricks from the paper and the conversations from the github issues.

Algorithm

CLIP

CLIP

Absract: State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on. We release our code and pre-trained model weights at https://github.com/OpenAI/CLIP.

Tokenizer

class ClipTokenizer[source]

ClipTokenizer(context_length=77) :: DisplayedTransform

Tokenizer from https://github.com/openai/CLIP/blob/main/clip/simple_tokenizer.py

Model

vitb32_config[source]

vitb32_config(input_res, context_length, vocab_size)

ViT-B/32 configuration, uses 32x32 patches

vitl14_config[source]

vitl14_config(input_res, context_length, vocab_size)

ViT-L/14 configuration, uses 14x14 patches

class Bottleneck[source]

Bottleneck(inplanes, planes, stride=1) :: Module

Base class for all neural network modules.

Your models should also subclass this class.

Modules can also contain other Modules, allowing to nest them in a tree structure. You can assign the submodules as regular attributes::

import torch.nn as nn
import torch.nn.functional as F

class Model(nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        self.conv1 = nn.Conv2d(1, 20, 5)
        self.conv2 = nn.Conv2d(20, 20, 5)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        return F.relu(self.conv2(x))

Submodules assigned in this way will be registered, and will have their parameters converted too when you call :meth:to, etc.

:ivar training: Boolean represents whether this module is in training or evaluation mode. :vartype training: bool

class AttentionPool2d[source]

AttentionPool2d(spacial_dim:int, embed_dim:int, num_heads:int, output_dim:int=None) :: Module

Base class for all neural network modules.

Your models should also subclass this class.

Modules can also contain other Modules, allowing to nest them in a tree structure. You can assign the submodules as regular attributes::

import torch.nn as nn
import torch.nn.functional as F

class Model(nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        self.conv1 = nn.Conv2d(1, 20, 5)
        self.conv2 = nn.Conv2d(20, 20, 5)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        return F.relu(self.conv2(x))

Submodules assigned in this way will be registered, and will have their parameters converted too when you call :meth:to, etc.

:ivar training: Boolean represents whether this module is in training or evaluation mode. :vartype training: bool

class ModifiedResNet[source]

ModifiedResNet(layers, output_dim, heads, input_resolution=224, width=64) :: Module

A ResNet class that is similar to torchvision's but contains the following changes:

  • There are now 3 "stem" convolutions as opposed to 1, with an average pool instead of a max pool.
  • Performs anti-aliasing strided convolutions, where an avgpool is prepended to convolutions with stride > 1
  • The final pooling layer is a QKV attention instead of an average pool

class LayerNorm[source]

LayerNorm(normalized_shape:Union[int, List[int],Size\], **eps**:float=*1e-05*, **elementwise_affine**:bool=*True*, **device**=*None*, **dtype**=*None*) :: [LayerNorm`](/self_supervised/21 - clip-moco.html#LayerNorm)

Subclass torch's LayerNorm to handle fp16.

class QuickGELU[source]

QuickGELU() :: Module

Base class for all neural network modules.

Your models should also subclass this class.

Modules can also contain other Modules, allowing to nest them in a tree structure. You can assign the submodules as regular attributes::

import torch.nn as nn
import torch.nn.functional as F

class Model(nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        self.conv1 = nn.Conv2d(1, 20, 5)
        self.conv2 = nn.Conv2d(20, 20, 5)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        return F.relu(self.conv2(x))

Submodules assigned in this way will be registered, and will have their parameters converted too when you call :meth:to, etc.

:ivar training: Boolean represents whether this module is in training or evaluation mode. :vartype training: bool

class ResidualAttentionBlock[source]

ResidualAttentionBlock(d_model:int, n_head:int, attn_mask:Tensor=None) :: Module

Base class for all neural network modules.

Your models should also subclass this class.

Modules can also contain other Modules, allowing to nest them in a tree structure. You can assign the submodules as regular attributes::

import torch.nn as nn
import torch.nn.functional as F

class Model(nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        self.conv1 = nn.Conv2d(1, 20, 5)
        self.conv2 = nn.Conv2d(20, 20, 5)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        return F.relu(self.conv2(x))

Submodules assigned in this way will be registered, and will have their parameters converted too when you call :meth:to, etc.

:ivar training: Boolean represents whether this module is in training or evaluation mode. :vartype training: bool

class Transformer[source]

Transformer(width:int, layers:int, heads:int, attn_mask:Tensor=None, checkpoint=False, checkpoint_nchunks=2) :: Module

Base class for all neural network modules.

Your models should also subclass this class.

Modules can also contain other Modules, allowing to nest them in a tree structure. You can assign the submodules as regular attributes::

import torch.nn as nn
import torch.nn.functional as F

class Model(nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        self.conv1 = nn.Conv2d(1, 20, 5)
        self.conv2 = nn.Conv2d(20, 20, 5)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        return F.relu(self.conv2(x))

Submodules assigned in this way will be registered, and will have their parameters converted too when you call :meth:to, etc.

:ivar training: Boolean represents whether this module is in training or evaluation mode. :vartype training: bool

class VisualTransformer[source]

VisualTransformer(input_resolution:int, patch_size:int, width:int, layers:int, heads:int, output_dim:int, **kwargs) :: Module

Base class for all neural network modules.

Your models should also subclass this class.

Modules can also contain other Modules, allowing to nest them in a tree structure. You can assign the submodules as regular attributes::

import torch.nn as nn
import torch.nn.functional as F

class Model(nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        self.conv1 = nn.Conv2d(1, 20, 5)
        self.conv2 = nn.Conv2d(20, 20, 5)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        return F.relu(self.conv2(x))

Submodules assigned in this way will be registered, and will have their parameters converted too when you call :meth:to, etc.

:ivar training: Boolean represents whether this module is in training or evaluation mode. :vartype training: bool

class CLIP[source]

CLIP(embed_dim:int, image_resolution:int, vision_layers:Union[Tuple[int, int, int, int],int\], **vision_width**:int, **vision_patch_size**:int, **context_length**:int, **vocab_size**:int, **transformer_width**:int, **transformer_heads**:int, **transformer_layers**:int, **\*\*kwargs**) ::Module`

Base class for all neural network modules.

Your models should also subclass this class.

Modules can also contain other Modules, allowing to nest them in a tree structure. You can assign the submodules as regular attributes::

import torch.nn as nn
import torch.nn.functional as F

class Model(nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        self.conv1 = nn.Conv2d(1, 20, 5)
        self.conv2 = nn.Conv2d(20, 20, 5)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        return F.relu(self.conv2(x))

Submodules assigned in this way will be registered, and will have their parameters converted too when you call :meth:to, etc.

:ivar training: Boolean represents whether this module is in training or evaluation mode. :vartype training: bool

Type Default Details
embed_dim int No Content
image_resolution int vision
vision_layers Tuple[int, int, int, int], int] No Content
vision_width int No Content
vision_patch_size int No Content
context_length int text
vocab_size int No Content
transformer_width int No Content
transformer_heads int No Content
transformer_layers int No Content
kwargs No Content

Metric

A useful proxy metric for tracking training performance and convergence.

class RetrievalAtK[source]

RetrievalAtK(k=20, **kwargs) :: AccumMetric

Stores predictions and targets on CPU in accumulate to perform final calculations with func.

CLIP Callback

Training Tip: In my own experiments, using CLIPTrainer() leads to faster convergence than DistributedCLIPTrainer. You should combine CLIPTrainer, DistributedDataParallel, fp16 and ZeRO optimizer with maximum batch size that fits to your memory for optimal speed and performance.

Important

To train with gradient checkpointing + fp16 you need to add 2 lines of code to pytorch source code, until fastai moves to torch version>=1.8.

class CLIPTrainer[source]

CLIPTrainer(after_create=None, before_fit=None, before_epoch=None, before_train=None, before_batch=None, after_pred=None, after_loss=None, before_backward=None, before_step=None, after_cancel_step=None, after_step=None, after_cancel_batch=None, after_batch=None, after_cancel_train=None, after_train=None, before_validate=None, after_cancel_validate=None, after_validate=None, after_cancel_epoch=None, after_epoch=None, after_cancel_fit=None, after_fit=None) :: Callback

Can be used with or without DistributedDataParallel

class DistributedCLIPTrainer[source]

DistributedCLIPTrainer(after_create=None, before_fit=None, before_epoch=None, before_train=None, before_batch=None, after_pred=None, after_loss=None, before_backward=None, before_step=None, after_cancel_step=None, after_step=None, after_cancel_batch=None, after_batch=None, after_cancel_train=None, after_train=None, before_validate=None, after_cancel_validate=None, after_validate=None, after_cancel_epoch=None, after_epoch=None, after_cancel_fit=None, after_fit=None) :: Callback

Distributed implementation of InfoNCE loss, should be used with DistributedDataParallel

Example Usage

num2txt = {'3': 'three', '7': 'seven'}
def num_to_txt(o): return num2txt[o]
def dummy_targ(o): return 0 # loss func is not called without it
path = untar_data(URLs.MNIST_TINY)
items = get_image_files(path)
clip_tokenizer = ClipTokenizer()
tds = Datasets(items, [PILImage.create, [parent_label, num_to_txt], dummy_targ], n_inp=2, splits=GrandparentSplitter()(items))
dls = tds.dataloaders(bs=2, after_item=[Resize(224), clip_tokenizer, ToTensor()], after_batch=[IntToFloatTensor()], device='cpu')
vitb32_config_dict = vitb32_config(224, clip_tokenizer.context_length, clip_tokenizer.vocab_size)
clip_model = CLIP(**vitb32_config_dict, checkpoint=False, checkpoint_nchunks=0)
learner = Learner(dls, clip_model, loss_func=noop, cbs=[CLIPTrainer(), ShortEpochCallback(0.001)],
                  metrics=[RetrievalAtK(k=5), 
                           RetrievalAtK(k=20), 
                           RetrievalAtK(k="mean"),
                           RetrievalAtK(k="median")])
learner.fit(1)
epoch train_loss valid_loss retrieval_at_5 retrieval_at_20 mean_retrieval_ranking median_retrieval_ranking time
0 00:17
learner.recorder.losses
[TensorImage(0.6933)]