MoCo ImageWang Tutorial

First import fastai for training and other helpers, you can choose not to use wandb by setting WANDB=False.

from fastai.vision.all import *
from fastai.callback.wandb import WandbCallback
import wandb

torch.backends.cudnn.benchmark = True
WANDB = False

Then import self_supervised augmentations module for creating augmentations pipeline, layers module for creating encoder and model, and finally moco for self-supervised training.

from self_supervised.augmentations import *
from self_supervised.layers import *
from self_supervised.vision.moco import *

In this notebook we will take a look at ImageWang benchmark, how to train a self-supervised model using MoCo algorithm and then how to use this pretrained model for finetuning on the given downstream task.

Pretraining

def get_dls(size, bs, workers=None):
    path = URLs.IMAGEWANG_160 if size <= 160 else URLs.IMAGEWANG
    source = untar_data(path)
    
    files = get_image_files(source)
    tfms = [[PILImage.create, ToTensor, RandomResizedCrop(size, min_scale=1.)], 
            [parent_label, Categorize()]]
    
    dsets = Datasets(files, tfms=tfms, splits=RandomSplitter(valid_pct=0.1)(files))
    
    batch_tfms = [IntToFloatTensor]
    dls = dsets.dataloaders(bs=bs, num_workers=workers, after_batch=batch_tfms)
    return dls

ImageWang has several benchmarks for different image sizes, in this tutorial we will go for size=224 and also demonstrate how effectively you can utilize GPU memory.

Define batch size, resize resolution before batching and size for random cropping during self-supervised training. It's always good to use a batch size as high as it can fit the GPU memory or you can use the extra memory to run parallel experiments :)

bs, resize, size = 256, 256, 224

Select architecture to train on, remember all timm and fastai models are available! We need to set pretrained=False here because using imagenet weights for ImageWang data would be cheating.

arch = "xresnet34"
encoder = create_encoder(arch, pretrained=False, n_in=3)

if WANDB:
    xtra_config = {"Arch":arch, "Resize":resize, "Size":size, "Algorithm":"MOCO"}
    wandb.init(project="self-supervised-imagewang", config=xtra_config);

Initialize the Dataloaders using the function above.

dls = get_dls(resize, bs)

Set queue size, it needs to be a multiple of batch size but needs to be less than total dataset size.

K = bs*2**4
assert K < len(dls.train_ds)

Create MoCo model. You can change values of hidden_size, projection_size. For this problem, defaults work just fine so we don't do any changes.

model = create_moco_model(encoder)

Next step is perhaps the most critical step for achieving good results on a custom problem - data augmentation. For this, we will use utility function from self_supervised.vision.moco.get_moco_aug_pipelines but you can also use your own list of Pipeline augmentations. self_supervised.vision.moco.get_moco_aug_pipelinesshould be enough for most of the cases since under the hood it uses self_supervised.augmentations.get_multi_aug_pipelines and self_supervised.augmentations.get_batch_augs. You can do shift+tab and see all the arguments that can be passed to get_moco_aug_pipelines. You can simply pass anything that you could pass to get_batch_augs including custom xtra_tfms.

get_moco_aug_pipelines excepts size for random resized cropping of the 2 views of a given image and the rest of the arguments are coming from get_batch_augs()

aug_pipelines = get_moco_aug_pipelines(size, rotate=True, rotate_deg=10, jitter=True, bw=True, blur=False)

Here, we will feed the augmentation pipelines and queue size, and leave momentum and temperature parameters as default.

cbs=[MOCO(aug_pipelines, K=K)]
if WANDB: cbs += [WandbCallback(log_preds=False,log_model=False)]

learn = Learner(dls, model, cbs=cbs)

Before starting training let's check whether our augmentations makes sense or not. Since this step consumes GPU memory, once you are done with inspection, restart the notebook and skip this step. We can see that 2 views of the same image side by side and indeed augmentations look pretty good. Now, it's time restart the notebook and skip this step.

b = dls.one_batch()
learn._split(b)
learn.moco.show(n=5);

Use mixed precision with to_fp16() for more GPU memory, larger batch size and faster training . We could also use gradient checkpointing wrapper models from self_supervised.layers to save even more memory, e.g. CheckpointSequential().

learn.to_fp16();

Learning good representations via contrastive learning usually takes a lot of epochs. So here number epochs are set to 100. This might change depending on your data distribution and dataset size.

lr,wd,epochs=1e-2,1e-2,100

learn.unfreeze()
learn.fit_flat_cos(epochs, lr, wd=wd, pct_start=0.5)

if WANDB: wandb.finish()

save_name = f'moco_iwang_sz{size}_epc{epochs}'
learn.save(save_name)
torch.save(learn.model.encoder.state_dict(), learn.path/learn.model_dir/f'{save_name}_encoder.pth')

learn.recorder.plot_loss()

Downstream Task

optdict = dict(sqr_mom=0.99,mom=0.95,beta=0.,eps=1e-4)
opt_func = partial(ranger, **optdict)

bs, size

def get_dls(size, bs, workers=None):
    path = URLs.IMAGEWANG_160 if size <= 160 else URLs.IMAGEWANG
    source = untar_data(path)
    files = get_image_files(source, folders=['train', 'val'])
    splits = GrandparentSplitter(valid_name='val')(files)
    
    item_aug = [RandomResizedCrop(size, min_scale=0.35), FlipItem(0.5)]
    tfms = [[PILImage.create, ToTensor, *item_aug], 
            [parent_label, Categorize()]]
    
    dsets = Datasets(files, tfms=tfms, splits=splits)
    
    batch_tfms = [IntToFloatTensor, Normalize.from_stats(*imagenet_stats)]
    dls = dsets.dataloaders(bs=bs, num_workers=workers, after_batch=batch_tfms)
    return dls

def split_func(m): return L(m[0], m[1]).map(params)

def create_learner(size=size, arch='xresnet34', encoder_path="models/moco_iwang_sz128_epc100_encoder.pth"):
    
    dls = get_dls(size, bs=bs//2)
    pretrained_encoder = torch.load(encoder_path)
    encoder = create_encoder(arch, pretrained=False, n_in=3)
    encoder.load_state_dict(pretrained_encoder)
    nf = encoder(torch.randn(2,3,224,224)).size(-1)
    classifier = create_cls_module(nf, dls.c)
    model = nn.Sequential(encoder, classifier)
    learn = Learner(dls, model, opt_func=opt_func, splitter=split_func,
                metrics=[accuracy,top_k_accuracy], loss_func=LabelSmoothingCrossEntropy())
    return learn

def finetune(size, epochs, arch, encoder_path, lr=1e-2, wd=1e-2):
    learn = create_learner(size, arch, encoder_path)
    learn.unfreeze()
    learn.fit_flat_cos(epochs, lr, wd=wd)
    final_acc = learn.recorder.values[-1][-2]
    return final_acc

5 epochs

acc = []
runs = 5
for i in range(runs): acc += [finetune(size, epochs=5, arch='xresnet34', encoder_path=f'models/moco_iwang_sz{size}_epc100_encoder.pth')]

np.mean(acc)

20 epochs

acc = []
runs = 3
for i in range(runs): acc += [finetune(size, epochs=20, arch='xresnet34', encoder_path=f'models/moco_iwang_sz{size}_epc100_encoder.pth')]

np.mean(acc)

80 epochs

acc = []
runs = 1
for i in range(runs): acc += [finetune(size, epochs=80, arch='xresnet34',encoder_path=f'models/moco_iwang_sz{size}_epc100_encoder.pth')]

np.mean(acc)

200 epochs

acc = []
runs = 1
for i in range(runs): acc += [finetune(size, epochs=200, arch='xresnet34', encoder_path=f'models/moco_iwang_sz{size}_epc100_encoder.pth')]
np.mean(acc)

np.mean(acc)