CVPR2023-CLIP算法调研 | Mox的笔记库

前段时间的工作，今天抽出来整理下

就选中的论文量来看，感觉这个方向上还能再研究个两三年😘

CVPR2023官网：https://openaccess.thecvf.com/CVPR2023

参考资料：CVPR 2023 最全整理：论文分方向汇总 / 代码 / 解读 / 直播 / 项目（更新中）【计算机视觉】-极市开发者社区 (cvmart.net)

CLIP2Scene: Towards Label-Efficient 3D Scene Understanding by CLIP

将CLIP从2D图像到3D图像的迁移

Code：https://github.com/runnanchen/CLIP2Scene

Vita-CLIP: Video and Text Adaptive CLIP via Multimodal Prompting

通过Prompt Engineering实现视频分类 Code: https://github.com/TalalWasim/Vita-CLIP

Turning a CLIP Model Into a Scene Text Detector

基于CLIP的场景文字识别（Scene Text Recognition，STR）检测方案，并提出一种名为TCM的方案

知乎：CVPR 2023｜白翔团队新作：借助CLIP完成场景文字检测 - 知乎 (zhihu.com)

在中间层把CLIP作为一个Text Decoder Code: https://github.com/wenwenyu/TCM

Revisiting Temporal Modeling for CLIP-Based Image-to-Video Knowledge Transferring

字节跳动与北京大学出品

提出一个 Spatial-Temporal Auxiliary Network (STAN) 的旁支结构

扩展CLIP到视频模块（“a simple and effective temporal modeling mechanism”）在 Kinetics-400 和 Something-Something-v2 两个视频行为识别 benchmark 上达到SOTA Code: https://github.com/farewellthree/STAN

Learning Instance-Level Representation for Large-Scale Multi-Modal Pretraining in E-Commerce

字节跳动在文字检索图片（image-to-text），图片检索文字（text-to-image），物品分类（Product Classification），检索物品相关性（Product Retrieval）取得了zero-shot的效果迁移到物品检测（object detection）上也有不错的效果 没有源码!!

CLIP2: Contrastive Language-Image-Point Pretraining From Real-World Point Cloud Data

CLIP在点云方面的应用

没有源码!!

目标检测

CORA: Adapting CLIP for Open-Vocabulary Detection With Region Prompting and Anchor Pre-Matching

将CLIP用于对象检测任务中的open-vocabulary detection（OVD）

（我的理解：实现CLIP在对象检测的泛化性）

结合了DETR目标检测框架与Prompt Engineer，使用区域分类器 code: GitHub - tgxs002/CORA: A DETR-style framework for open-vocabulary detection (OVD). CVPR 2023

DetCLIPv2: Scalable Open-Vocabulary Object Detection Pre-Training via Word-Region Alignment

港科 x 华为诺亚方舟 x 中山大学

open-vocabulary object detection（OVD任务）

端到端的方式从大量图像-文本对中学习细粒度的单词-区域对齐

知乎：DetCLIPv2: Scalable Open-Vocabulary Object Detection Pre-training via Word-Region Alignment阅读笔记 - 知乎 (zhihu.com) 没有源码！！

图像-语言检索（VLP）

CLIP for All Things Zero-Shot Sketch-Based Image Retrieval, Fine-Grained or Not

使用CLIP进行图像检索 Code：https://github.com/aneeshan95/Sketch_LVM

CLIPPING: Distilling CLIP-Based Models With a Student Base for Video-Language Retrieval

华为

知识蒸馏，视频文本检索

没有源码！！

CrowdCLIP: Unsupervised Crowd Counting via Vision-Language Model

CLIP用于无监督人群计数知乎：CrowdCLIP: Unsupervised Crowd Counting via Vision-Language Model论文解读（CVPR 2023） - 知乎 (zhihu.com)

Code: https://github.com/dk-liang/CrowdCLIP

Learning Emotion Representations from Verbal and Nonverbal Communication

Code：https://github.com/Xeaver/EmotionCLIP

原文：https://openaccess.thecvf.com/content/CVPR2023/papers/Zhang_Learning_Emotion_Representations_From_Verbal_and_Nonverbal_Communication_CVPR_2023_paper.pdf

5月29日看到的，提出了一个MotionCLIP，内容就没有细看了

模型训练与调优

Multimodality Helps Unimodality: Cross-Modal Few-Shot Learning With Multimodal Models

提出一种cross-modal adaptation的Few-shot微调方法，适用于CLIP

知乎上的分析：CVPR 2023 | Cross-modal Adaptation: 基于 CLIP 的微调新范式 - 知乎 (zhihu.com)

Code: https://github.com/linzhiqiu/cross_modal_adaptation

Fine-Tuned CLIP Models Are Efficient Video Learners

CLIP在视频方面的优化证明CLIP只需微调就能在视频上取得良好性能（感觉是SOTA）

Name (configs)	Input	Base Acc.	Novel Acc.	HM	Model
CLIP image-FT	32x224	9.2	8.5	8.8	seed1/seed2/seed3
CLIP text-FT	32x224	12.4	9.5	10.8	seed1/seed2/seed3
ViFi-CLIP	32x224	16.2	12.1	13.9	seed1/seed2/seed3

Code: https://github.com/muzairkhattak/ViFi-CLIP

DisCo-CLIP: A Distributed Contrastive Loss for Memory Efficient CLIP Training

内存高效的CLIP训练方法基于OpenAI开源模型改出

Revisiting Multimodal Representation in Contrastive Learning: From Patch and Token Embeddings to Finite Discrete Tokens

字节跳动x罗格斯大学亮点：

提出了利用共享的离散 token (Finite Discrete Tokens, FDT) 来作为统一粒度的多模态表征，从而加强图片-文本模型的语义对齐

可以改进在图像分类和图文检索中语义对齐问题

Code：https://github.com/yuxiaochen1103/FDT

CLIPPO: Image-and-Language Understanding From Pixels Only

一种新架构

把文字转为图片，和图片一起参与卷积（共享一个Transformer模型）

在图像分类和检索上，比原版CLIP有轻微下滑(2%-3%) 知乎：CVPR 2023 | 谷歌提出CLIPPO：仅从像素理解图像和语言 - 知乎 (zhihu.com)

Code(and pretrain model) : https://github.com/google-research/big_vision

图片生成

CoralStyleCLIP: Co-Optimized Region and Layer Selection for Image Editing

基于clip的方法通过在StyleGAN的一个精心挑选的层中引入空间注意力来进行优化改进

没有源码

Shifted Diffusion for Text-to-Image Generation

字节跳动偏移扩散模型（Shifted Diffusion）在文本生成图像（Text-to-image Generation）任务上的应用提出一个名为Corgi的模型用于文生图一大亮点在于：将CLIP纳入Diffusion Process Code：https://github.com/drboog/Shifted_Diffusion