Japanese Stable CLIP を試してみた

最高性能の、日本語画像言語特徴抽出モデル「Japanese Stable CLIP」をリリースしました

先日試してみた Japanese Stable VLM と似たモデルで、画像から特徴を抽出することができる、それも日本語で、という Japanese Stable CLIP を試してみます。

いつもの Docker イメージで起動します。

docker run -it --gpus=all --rm -p 7860:7860 -v /work:/work nvidia/cuda:11.8.0-base-ubuntu22.04 /bin/bash

必要なツールをインストールします。

apt update
apt install -y python3-pip
pip install scipy ftfy regex tqdm gradio transformers sentencepiece 'accelerate>=0.12.0' 'bitsandbytes>=0.31.5' protobuf

Hugging Face へログインします。(事前にハッシュを生成しておきます)

huggingface-cli login

サンプルをダウンロードします。

wget https://upload.wikimedia.org/wikipedia/commons/thumb/2/29/JAPANPOST-DSC00250.JPG/500px-JAPANPOST-DSC00250.JPG -O sample1.png
wget https://upload.wikimedia.org/wikipedia/commons/thumb/1/1c/Search_and_rescue_at_Unosumai%2C_Kamaishi%2C_-17_Mar._2011_a.jpg/500px-Search_and_rescue_at_Unosumai%2C_Kamaishi%2C_-17_Mar._2011_a.jpg -O sample2.png
wget https://upload.wikimedia.org/wikipedia/commons/thumb/6/60/Policeman_at_Tokyo.jpg/500px-Policeman_at_Tokyo.jpg -O sample3.png

ここからは .ipynb で実行します。まずはモデルを用意します。以下のコードを実行します。

#@title Load Japanese Stable CLIP
from typing import Union, List
import ftfy, html, re, io
import requests
from PIL import Image
import torch
from transformers import AutoModel, AutoTokenizer, AutoImageProcessor, BatchFeature


device = "cuda" if torch.cuda.is_available() else "cpu"
model_path = "stabilityai/japanese-stable-clip-vit-l-16"
model = AutoModel.from_pretrained(model_path, trust_remote_code=True).eval().to(device)
tokenizer = AutoTokenizer.from_pretrained(model_path)
processor = AutoImageProcessor.from_pretrained(model_path)


# taken from https://github.com/mlfoundations/open_clip/blob/main/src/open_clip/tokenizer.py#L65C8-L65C8
def basic_clean(text):
    text = ftfy.fix_text(text)
    text = html.unescape(html.unescape(text))
    return text.strip()


def whitespace_clean(text):
    text = re.sub(r"\s+", " ", text)
    text = text.strip()
    return text


def tokenize(
    texts: Union[str, List[str]],
    max_seq_len: int = 77,
):
    """
    This is a function that have the original clip's code has.
    https://github.com/openai/CLIP/blob/main/clip/clip.py#L195
    """
    if isinstance(texts, str):
        texts = [texts]
    texts = [whitespace_clean(basic_clean(text)) for text in texts]

    inputs = tokenizer(
        texts,
        max_length=max_seq_len - 1,
        padding="max_length",
        truncation=True,
        add_special_tokens=False,
    )
    # add bos token at first place
    input_ids = [[tokenizer.bos_token_id] + ids for ids in inputs["input_ids"]]
    attention_mask = [[1] + am for am in inputs["attention_mask"]]
    position_ids = [list(range(0, len(input_ids[0])))] * len(texts)

    return BatchFeature(
        {
            "input_ids": torch.tensor(input_ids, dtype=torch.long),
            "attention_mask": torch.tensor(attention_mask, dtype=torch.long),
            "position_ids": torch.tensor(position_ids, dtype=torch.long),
        }
    )


def compute_text_embeddings(text):
  if isinstance(text, str):
    text = [text]
  text = tokenize(texts=text)
  text_features = model.get_text_features(**text.to(device))
  text_features = text_features / text_features.norm(p=2, dim=-1, keepdim=True)
  del text
  return text_features.cpu().detach()

def compute_image_embeddings(image):
  image = processor(images=image, return_tensors="pt").to(device)
  image_features = model.get_image_features(**image)
  image_features = image_features / image_features.norm(p=2, dim=-1, keepdim=True)
  del image
  return image_features.cpu().detach()

次に判定用のカテゴリを用意します。

#@title Prepare for the demo
#@markdown Please feel free to change `categories` for your usage.

categories = [
    "配達員",
    "営業",
    "消防士",
    "救急隊員",
    "自衛隊",
    "スポーツ選手",
    "警察官",
]
# pre-compute text embeddings
text_embeds = compute_text_embeddings(categories)

WebUI を生成します。

# @title Launch the demo
import gradio as gr

num_categories = len(categories)
TOP_K = 3


def inference_fn(img):
  image_embeds = compute_image_embeddings(img)
  similarity = (100.0 * image_embeds @ text_embeds.T).softmax(dim=-1)
  similarity = similarity[0].numpy().tolist()
  output_dict = {categories[i]: float(similarity[i]) for i in range(num_categories)}
  del image_embeds
  return output_dict


with gr.Blocks() as demo:
    gr.Markdown("# Japanese Stable CLIP Demo")
    gr.Markdown(
        """[Japanese Stable CLIP](https://huggingface.co/stabilityai/japanese-stable-clip-vit-l-16) is a [CLIP](https://arxiv.org/abs/2103.00020) model by [Stability AI](https://ja.stability.ai/).
                - Blog: https://ja.stability.ai/blog/japanese-stable-clip
                - Twitter: https://twitter.com/StabilityAI_JP
                - Discord: https://discord.com/invite/StableJP"""
    )
    with gr.Row():
      with gr.Column():
        inp = gr.Image(type="pil")
      with gr.Column():
        out = gr.Label(num_top_classes=TOP_K)

    btn = gr.Button("Run")
    btn.click(fn=inference_fn, inputs=inp, outputs=out)
    examples = gr.Examples(
        examples=[
            # https://ja.wikipedia.org/wiki/%E9%83%B5%E4%BE%BF
            "sample1.png",
            # https://ja.wikipedia.org/wiki/%E6%97%A5%E6%9C%AC%E3%81%AE%E6%B6%88%E9%98%B2
            "sample2.png",
            # https://ja.wikipedia.org/wiki/%E6%97%A5%E6%9C%AC%E3%81%AE%E8%AD%A6%E5%AF%9F%E5%AE%98
            "sample3.png",
        ],
        inputs=inp
    )

if __name__ == "__main__":
    demo.launch(debug=True, share=True)

ブラウザで開くと、こんな WebUI が表示されます。

サンプルを選択して RUN すると、以下のように判定されます。

判定用のカテゴリを変えれば、それで判定されます。上のサンプルも、微妙に「営業」の可能性もだしているあたり、それっぽい判断をしてますね。

画像から特徴を抽出する、という機能は面白そうですね。サンプルコードは特定のカテゴリとの一致を判定しましたが、他の使い方もありそうです。

試行錯誤を重ねるブログ

このブログを検索

Japanese Stable CLIP を試してみた

コメント

コメントを投稿