CVPR2023-CLIP算法调研

扩展CLIP到视频模块（”a simple and effective temporal modeling mechanism”）
在 Kinetics-400 和 Something-Something-v2 两个视频行为识别 benchmark 上达到SOTA

Code: https://github.com/farewellthree/STAN

字节跳动
在文字检索图片（image-to-text），图片检索文字（text-to-image），物品分类（Product Classification），检索物品相关性（Product Retrieval）取得了zero-shot的效果
迁移到物品检测（object detection）上也有不错的效果

没有源码!!

CLIP2: Contrastive Language-Image-Point Pretraining From Real-World Point Cloud Data

CLIP在点云方面的应用

没有源码!!

目标检测

CORA: Adapting CLIP for Open-Vocabulary Detection With Region Prompting and Anchor Pre-Matching

将CLIP用于对象检测任务中的open-vocabulary detection（OVD）

（我的理解：实现CLIP在对象检测的泛化性）

结合了DETR目标检测框架与Prompt Engineer，使用区域分类器

code: GitHub - tgxs002/CORA: A DETR-style framework for open-vocabulary detection (OVD). CVPR 2023

DetCLIPv2: Scalable Open-Vocabulary Object Detection Pre-Training via Word-Region Alignment

港科 x 华为诺亚方舟 x 中山大学

open-vocabulary object detection（OVD任务）

端到端的方式从大量图像-文本对中学习细粒度的单词-区域对齐

知乎：DetCLIPv2: Scalable Open-Vocabulary Object Detection Pre-training via Word-Region Alignment阅读笔记 - 知乎 (zhihu.com)
没有源码！！

图像-语言检索（VLP）

CLIP for All Things Zero-Shot Sketch-Based Image Retrieval, Fine-Grained or Not

使用CLIP进行图像检索

Code：https://github.com/aneeshan95/Sketch_LVM

CLIPPING: Distilling CLIP-Based Models With a Student Base for Video-Language Retrieval

华为

知识蒸馏，视频文本检索

没有源码！！

CrowdCLIP: Unsupervised Crowd Counting via Vision-Language Model

CLIP用于无监督人群计数

知乎：CrowdCLIP: Unsupervised Crowd Counting via Vision-Language Model论文解读（CVPR 2023） - 知乎 (zhihu.com)

Code: https://github.com/dk-liang/CrowdCLIP

Learning Emotion Representations from Verbal and Nonverbal Communication

Code：https://github.com/Xeaver/EmotionCLIP

原文：https://openaccess.thecvf.com/content/CVPR2023/papers/Zhang_Learning_Emotion_Representations_From_Verbal_and_Nonverbal_Communication_CVPR_2023_paper.pdf

5月29日看到的，提出了一个MotionCLIP，内容就没有细看了

模型训练与调优

提出一种cross-modal adaptation的Few-shot微调方法，适用于CLIP

知乎上的分析：CVPR 2023 | Cross-modal Adaptation: 基于 CLIP 的微调新范式 - 知乎 (zhihu.com)

Code: https://github.com/linzhiqiu/cross_modal_adaptation

Fine-Tuned CLIP Models Are Efficient Video Learners

CLIP在视频方面的优化
证明CLIP只需微调就能在视频上取得良好性能（感觉是SOTA）

Name (configs)	Input	Base Acc.	Novel Acc.	HM	Model
CLIP image-FT	32x224	9.2	8.5	8.8	seed1/seed2/seed3
CLIP text-FT	32x224	12.4	9.5	10.8	seed1/seed2/seed3
ViFi-CLIP	32x224	16.2	12.1	13.9	seed1/seed2/seed3

Code: https://github.com/muzairkhattak/ViFi-CLIP

DisCo-CLIP: A Distributed Contrastive Loss for Memory Efficient CLIP Training

内存高效的CLIP训练方法
基于OpenAI开源模型改出

Revisiting Multimodal Representation in Contrastive Learning: From Patch and Token Embeddings to Finite Discrete Tokens

字节跳动x罗格斯大学
亮点：

提出了利用共享的离散 token (Finite Discrete Tokens, FDT) 来作为统一粒度的多模态表征，从而加强图片-文本模型的语义对齐

可以改进在图像分类和图文检索中语义对齐问题

Code：https://github.com/yuxiaochen1103/FDT

CLIPPO: Image-and-Language Understanding From Pixels Only

一种新架构

把文字转为图片，和图片一起参与卷积（共享一个Transformer模型）

在图像分类和检索上，比原版CLIP有轻微下滑(2%-3%)

知乎：CVPR 2023 | 谷歌提出CLIPPO：仅从像素理解图像和语言 - 知乎 (zhihu.com)

Code(and pretrain model) : https://github.com/google-research/big_vision

图片生成

CoralStyleCLIP: Co-Optimized Region and Layer Selection for Image Editing

基于clip的方法通过在StyleGAN的一个精心挑选的层中引入空间注意力来进行优化改进

没有源码

Shifted Diffusion for Text-to-Image Generation

字节跳动
偏移扩散模型（Shifted Diffusion）在文本生成图像（Text-to-image Generation）任务上的应用
提出一个名为Corgi的模型用于文生图
一大亮点在于：将CLIP纳入Diffusion Process

Code：https://github.com/drboog/Shifted_Diffusion