当前位置：首页 > news >正文

CLIP在Github上的使用教程

news 2026/2/8 13:07:07

CLIP的github链接：https://github.com/openai/CLIP

CLIP

Blog，Paper，Model Card，Colab
CLIP（对比语言-图像预训练）是一个在各种（图像、文本）对上进行训练的神经网络。可以用自然语言指示它在给定图像的情况下预测最相关的文本片段，而无需直接对任务进行优化，这与 GPT-2 和 3 的零镜头功能类似。我们发现，CLIP 无需使用任何 128 万个原始标注示例，就能在 ImageNet "零拍摄 "上达到原始 ResNet50 的性能，克服了计算机视觉领域的几大挑战。

Usage用法

首先，安装 PyTorch 1.7.1（或更高版本）和 torchvision，以及少量其他依赖项，然后将此 repo 作为 Python 软件包安装。在 CUDA GPU 机器上，完成以下步骤即可：

conda install --yes -c pytorch pytorch=1.7.1 torchvision cudatoolkit=11.0
pip install ftfy regex tqdm
pip install git+https://github.com/openai/CLIP.git

将上面的 cudatoolkit=11.0 替换为机器上相应的 CUDA 版本，如果在没有 GPU 的机器上安装，则替换为 cpuonly。

import torch
import clip
from PIL import Imagedevice = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)image = preprocess(Image.open("CLIP.png")).unsqueeze(0).to(device)
text = clip.tokenize(["a diagram", "a dog", "a cat"]).to(device)with torch.no_grad():image_features = model.encode_image(image)text_features = model.encode_text(text)logits_per_image, logits_per_text = model(image, text)probs = logits_per_image.softmax(dim=-1).cpu().numpy()print("Label probs:", probs)  # prints: [[0.9927937  0.00421068 0.00299572]]

API

CLIP 模块提供以下方法：

clip.available_models()

返回可用 CLIP 模型的名称。例如下面就是我执行的结果。
在这里插入图片描述

clip.load(name, device=..., jit=False)

返回模型和模型所需的 TorchVision 变换（由 clip.available_models() 返回的模型名称指定）。它将根据需要下载模型。name参数也可以是本地检查点的路径。
可以选择指定运行模型的设备，默认情况下，如果有第一个 CUDA 设备，则使用该设备，否则使用 CPU。当 jit 为 False 时，将加载模型的非 JIT 版本。

clip.tokenize(text: Union[str, List[str]], context_length=77)

返回包含给定文本输入的标记化序列的 LongTensor。这可用作模型的输入。

clip.load() 返回的模型支持以下方法：

model.encode_image(image: Tensor)

给定一批图像，返回 CLIP 模型视觉部分编码的图像特征。

model.encode_text(text: Tensor)

给定一批文本标记，返回 CLIP 模型语言部分编码的文本特征。

model(image: Tensor, text: Tensor)

给定一批图像和一批文本标记，返回两个张量，其中包含与每张图像和每个文本输入相对应的 logit 分数。这些值是相应图像和文本特征之间的余弦相似度乘以 100。

More Examples更多实例

Zero-Shot预测

下面的代码使用 CLIP 执行零点预测，如论文附录 B 所示。该示例从 CIFAR-100 数据集中获取一张图片，并预测数据集中 100 个文本标签中最有可能出现的标签。

import os
import clip
import torch
from torchvision.datasets import CIFAR100# Load the model
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load('ViT-B/32', device)# Download the dataset
cifar100 = CIFAR100(root=os.path.expanduser("~/.cache"), download=True, train=False)# Prepare the inputs
image, class_id = cifar100[3637]
image_input = preprocess(image).unsqueeze(0).to(device)
text_inputs = torch.cat([clip.tokenize(f"a photo of a {c}") for c in cifar100.classes]).to(device)# Calculate features
with torch.no_grad():image_features = model.encode_image(image_input)text_features = model.encode_text(text_inputs)# Pick the top 5 most similar labels for the image
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)
similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)
values, indices = similarity[0].topk(5)# Print the result
print("\nTop predictions:\n")
for value, index in zip(values, indices):print(f"{cifar100.classes[index]:>16s}: {100 * value.item():.2f}%")

输出结果如下（具体数字可能因计算设备而略有不同）：

Top predictions:snake: 65.31%turtle: 12.29%sweet_pepper: 3.83%lizard: 1.88%crocodile: 1.75%

请注意，本示例使用的 encode_image() 和 encode_text() 方法可返回给定输入的编码特征。

Linear-probe evaluation线性探针评估

下面的示例使用 scikit-learn 对图像特征进行逻辑回归。

import os
import clip
import torchimport numpy as np
from sklearn.linear_model import LogisticRegression
from torch.utils.data import DataLoader
from torchvision.datasets import CIFAR100
from tqdm import tqdm# Load the model
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load('ViT-B/32', device)# Load the dataset
root = os.path.expanduser("~/.cache")
train = CIFAR100(root, download=True, train=True, transform=preprocess)
test = CIFAR100(root, download=True, train=False, transform=preprocess)def get_features(dataset):all_features = []all_labels = []with torch.no_grad():for images, labels in tqdm(DataLoader(dataset, batch_size=100)):features = model.encode_image(images.to(device))all_features.append(features)all_labels.append(labels)return torch.cat(all_features).cpu().numpy(), torch.cat(all_labels).cpu().numpy()# Calculate the image features
train_features, train_labels = get_features(train)
test_features, test_labels = get_features(test)# Perform logistic regression
classifier = LogisticRegression(random_state=0, C=0.316, max_iter=1000, verbose=1)
classifier.fit(train_features, train_labels)# Evaluate using the logistic regression classifier
predictions = classifier.predict(test_features)
accuracy = np.mean((test_labels == predictions).astype(float)) * 100.
print(f"Accuracy = {accuracy:.3f}")

请注意，C 值应通过使用验证分割进行超参数扫描来确定。

CLIP在Github上的使用教程

CLIP

Usage用法

API

More Examples更多实例

Zero-Shot预测

Linear-probe evaluation线性探针评估

See Also

相关文章：

CLIP在Github上的使用教程

入职字节外包一个月，我离职了。。。

SpringBoot的web开发

传染病传播速度

前端打包环境配置步骤

css的4种引入方式--内联样式（标签内style）、内部样式表（＜style＞）、外部样式表（＜link＞、@import）

GPT-4 变懒了？官方回复

编译器和 IR：LLVM IR、SPIR-V 和 MLIR

蓝牙物联网对接技术难点有哪些？

漫谈Uniapp App热更新包-Jenkins CI/CD打包工具链的搭建

Axure简单安装与入门

前端知识笔记（四十五）———前端开发与后端开发有什么区别

Jol-分析Java对象的内存布局

基于sfunction builder的c-sfunction编写及案例测试分析

【Java期末复习资料】（1）知识点总结

进程、容器与虚拟机的区别

全网快递批量查询的得力助手

uniapp开发小程序经验记录

PR自动剪辑视频工具AI智能剪辑插件AutoPod

Visual Studio 2022+Python3.11实现C++调用python接口

利用最小二乘法找圆心和半径

Oracle查询表空间大小

MVC 数据库

Robots.txt 文件

什么？连接服务器也能可视化显示界面？：基于X11 Forwarding + CentOS + MobaXterm实战指南

Fabric V2.5 通用溯源系统——增加图片上传与下载功能

IP如何挑？2025年海外专线IP如何购买？

【从零学习JVM|第三篇】类的生命周期(高频面试题)

[免费]微信小程序问卷调查系统(SpringBoot后端+Vue管理端)【论文+源码+SQL脚本】

Chromium 136 编译指南 Windows篇：depot_tools 配置与源码获取（二）