List of Large Language Vision Models

An incomplete list of large language (computer) vision models with a promptable interface you can run locally on your computer

Sung Kim
Dev Genius

--

This is an incomplete list of large language (computer) vision models with a promptable interface you can run locally on your computer. These models can “cut out” any object, in any image, with a single click as well as replace those images with other images using a prompt.

Updates

  • 04/21/2023: Added samego and Anything-3D
  • 04/24/2023: Added Track Anything
  • 04/27/2023 Added Toronto Annotation Suite (TORAS), Caption-Anything, SAD: Segment Any RGBD, and Open-vocabulary-Segment-Anything
  • 04/30/2023: Added Ask-Anything
  • 05/05/2023: Added Personalize Segment Anything with 1 Shot in 10 Seconds
  • 05/09/2023: Added ImageBind and mPLUG-Owl
  • 05/24/2023: Added CoDi
  • 06/08/2023: Added Vocabulary-free Image Classification, Segment Anything in High Quality
  • 06/16/2023: Added Tracking Any Point (TAP)/TAPIR

Segment Anything

SAM is a generalization of these two classes of approaches. It is a single model that can easily perform both interactive segmentation and automatic segmentation. The model’s promptable interface (described shortly) allows it to be used in flexible ways that make a wide range of segmentation tasks possible simply by engineering the right prompt for the model (clicks, boxes, text, and so on).

Here is a list of projects using the Segment Anything Model:

  • Anything-3D
  • Caption-Anything
  • Edit Anything by Segment-Anything
  • Grounded-Segment-Anything
  • Inpaint Anything: Segment Anything Meets Image Inpainting
  • Magic Copy
  • Open-vocabulary-Segment-Anything
  • Personalize Segment Anything with 1 Shot in 10 Seconds
  • Prompt-Segment-Anything
  • SAD: Segment Any RGBD
  • SAM-Adaptor
  • samego
  • Segment Anything in High Quality
  • Segment Anything Labelling Tool (SALT)
  • segment-anything-py
  • Toronto Annotation Suite (TORAS)
  • Track Anything

Anything-3D

Here we present a project where we combine Segment Anything with a series of 3D models to create a very interesting demo. This is currently a small project, but we plan to continue improving it and creating more exciting demos.

Anything-3D

Caption-Anything

Caption-Anything is a versatile image processing tool that combines the capabilities of Segment Anything, Visual Captioning, and ChatGPT. Our solution generates descriptive captions for any object within an image, offering a range of language styles to accommodate diverse user preferences. It supports visual controls (mouse click) and language controls (length, sentiment, factuality, and language).

Caption-Anything

Edit Anything by Segment-Anything

This is an ongoing project aims to Edit and Generate Anything in an image, powered by Segment Anything, ControlNet, BLIP2, Stable Diffusion, etc.

Edit Anything by Segment-Anything

Grounded-Segment-Anything

The core idea behind this project is to combine the strengths of different models in order to build a very powerful pipeline for solving complex problems. And it’s worth mentioning that this is a workflow for combining strong expert models, where all parts can be used separately or in combination, and can be replaced with any similar but different models (like replacing Grounding DINO with GLIP or other detectors / replacing Stable-Diffusion with ControlNet or GLIGEN/ Combining with ChatGPT).

Speak to edit🎨: Whisper + ChatGPT + Grounded-SAM + SD

Inpaint Anything: Segment Anything Meets Image Inpainting

TL; DR: Users can select any object in an image by clicking on it. With powerful vision models, e.g., SAM, LaMa and Stable Diffusion (SD), Inpaint Anything is able to remove the object smoothly (i.e., Remove Anything). Further, prompted by user input text, Inpaint Anything can fill the object with any desired content (i.e., Fill Anything) or replace the background of it arbitrarily (i.e., Replace Anything).

Inpaint Anything

Magic Copy

Magic Copy is a Chrome extension that uses Meta’s Segment Anything Model to extract a foreground object from an image and copy it to the clipboard.

Magic Copy

Open-vocabulary-Segment-Anything

An interesting demo by combining OWL-ViT of Google and Segment Anything of Meta!

Personalize Segment Anything with 1 Shot in 10 Seconds

In this project, we propose a training-free Personalization approach for Segment Anything Model (SAM), termed as PerSAM. Given only a single image with a reference mask, PerSAM can segment specific visual concepts, e.g., your pet dog, within other images or videos without any training. For better performance, we further present an efficient one-shot fine-tuning variant, PerSAM-F. We freeze the entire SAM and introduce two learnable mask weights, which only trains 2 parameters within 10 seconds.

Prompt-Segment-Anything

This is an implementation of zero-shot instance segmentation using Segment Anything. Thanks to the authors of Segment Anything for their wonderful work! This repository is based on MMDetection and includes some code from H-Deformable-DETR and FocalNet-DINO.

Prompt-Segment-Anything

SAD: Segment Any RGBD

We find that humans can naturally identify objects from the visulization of the depth map, so we first map the depth map ([H, W]) to the RGB space ([H, W, 3]) by a colormap function, and then feed the rendered depth image into SAM. Compared to the RGB image, the rendered depth image ignores the texture information and focuses on the geometry information. The input images to SAM are all RGB images in SAM-based projects like SSA, Anything-3D, and SAM 3D. We are the first to use SAM to extract the geometry information directly. The following figures show that depth maps with different colormap functions has different SAM results

SAD

SAM-Adaptor

SAM-Adaptor, which incorporates domain-specific information or visual prompts into the segmentation network by using simple yet effective adaptors. Our extensive experiments show that SAM-Adaptor can significantly elevate the performance of SAM in challenging tasks and we can even achieve state-of-the-art performance. We believe our work opens up opportunities for utilizing SAM in downstream tasks, with potential applications in various fields, including medical image processing, agriculture, remote sensing, and more.

SAM-Adaptor

samego

A Python package for segmenting geospatial data with the Segment Anything Model (SAM) 🗺️

The segment-geospatial package draws its inspiration from segment-anything-eo repository authored by Aliaksandr Hancharenka. To facilitate the use of the Segment Anything Model (SAM) for geospatial data, I have developed the segment-anything-py and segment-geospatial Python packages, which are now available on PyPI and conda-forge. My primary objective is to simplify the process of leveraging SAM for geospatial data analysis by enabling users to achieve this with minimal coding effort. I have adapted the source code of segment-geospatial from the segment-anything-eo repository, and credit for its original version goes to Aliaksandr Hancharenka.

samego

Segment Anything in High Quality

We propose HQ-SAM to upgrade SAM for high-quality zero-shot segmentation. Refer to our paper for more details. Our code and models will be released in two weeks. Stay tuned!

Segment Anything Labelling Tool (SALT)

Uses the Segment-Anything Model By Meta AI and adds a barebones interface to label images and saves the masks in the COCO format.

Segment Anything Labelling Tool (SALT)

Semantic Segment Anything (SSA)

Semantic Segment Anything (SSA) project enhances the Segment Anything dataset (SA-1B) with a dense category annotation engine. SSA is an automated annotation engine that serves as the initial semantic labeling for the SA-1B dataset. While human review and refinement may be required for more accurate labeling. Thanks to the combined architecture of close-set segmentation and open-vocabulary segmentation, SSA produces satisfactory labeling for most samples and has the capability to provide more detailed annotations using image caption method. This tool fills the gap in SA-1B’s limited fine-grained semantic labeling, while also significantly reducing the need for manual annotation and associated costs. It has the potential to serve as a foundation for training large-scale visual perception models and more fine-grained CLIP models.

segment-anything-py

An unofficial Python package for Meta AI’s Segment Anything Model

Toronto Annotation Suite (TORAS)

Toronto Annotation Suite (TORAS) is a web-based AI-powered labeling platform. TORAS provides a repository system that allows efficient data management and collaboration between users. Data annotation can be done in a consistent way with “recipes” which is a blueprint of annotation tasks that define the type of tasks and how an annotator should annotate data. TORAS is equipped with human-in-the-loop AI tools as well as practical editing tools that enable productive data annotation.

Track Anything

Track-Anything is a flexible and interactive tool for video object tracking and segmentation. It is developed upon Segment Anything, can specify anything to track and segment via user clicks only. During tracking, users can flexibly change the objects they wanna track or correct the region of interest if there are any ambiguities.

Track Anything

Ask-Anything

Currently, Ask-Anything is a simple yet interesting tool for chatting with video. Our team is trying to build a smart and robust chatbot that can understand video.

Ask-Anything

CoDi

We present Composable Diffusion (CoDi), a novel generative model capable of generating any combination of output modalities, such as language, image, video, or audio, from any combination of input modalities. Unlike existing generative AI systems, CoDi can generate multiple modalities in parallel and its input is not limited to a subset of modalities like text or image. Despite the absence of training datasets for many combinations of modalities, we propose to align modalities in both the input and output space. This allows CoDi to freely condition on any input combination and generate any group of modalities, even if they are not present in the training data. CoDi employs a novel composable generation strategy which involves building a shared multimodal space by bridging alignment in the diffusion process, enabling the synchronized generation of intertwined modalities, such as temporally aligned video and audio. Highly customizable and flexible, CoDi achieves strong joint-modality generation quality, and outperforms or is on par with the unimodal state-of-the-art for single-modality synthesis.

ImageBind: Holistic AI learning across six modalities

ImageBind, the first AI model capable of binding information from six modalities. The model learns a single embedding, or shared representation space, not just for text, image/video, and audio, but also for sensors that record depth (3D), thermal (infrared radiation), and inertial measurement units (IMU), which calculate motion and position. ImageBind equips machines with a holistic understanding that connects objects in a photo with how they will sound, their 3D shape, how warm or cold they are, and how they move.

ImageBind

🌋 LLaVA: Large Language and Vision Assistant

LLaVA represents a novel end-to-end trained large multimodal model that combines a vision encoder and Vicuna for general-purpose visual and language understanding, achieving impressive chat capabilities mimicking spirits of the multimodal GPT-4 and setting a new state-of-the-art accuracy on Science QA.

LLaVA: Large Language and Vision Assistant

MiniGPT-4

Enhancing Vision-language Understanding with Advanced Large Language Models

MiniGPT-4 aligns a frozen visual encoder from BLIP-2 with a frozen LLM, Vicuna, using just one projection layer.

We train MiniGPT-4 with two stages. The first traditional pretraining stage is trained using roughly 5 million aligned image-text pairs in 10 hours using 4 A100s. After the first stage, Vicuna is able to understand the image. But the generation ability of Vicuna is heavilly impacted.

To address this issue and improve usability, we propose a novel way to create high-quality image-text pairs by the model itself and ChatGPT together. Based on this, we then create a small (3500 pairs in total) yet high-quality dataset.

The second finetuning stage is trained on this dataset in a conversation template to significantly improve its generation reliability and overall usability. To our surprise, this stage is computationally efficient and takes only around 7 minutes with a single A100.

MiniGPT-4 yields many emerging vision-language capabilities similar to those demonstrated in GPT-4.

MiniGPT-4

mPLUG-Owl🦉: Modularization Empowers Large Language Models with Multimodality

A new training paradigm with a modularized design for large multi-modal language models.

Learns visual knowledge while support multi-turn conversation consisting of different modalities (images/videos/texts).

Observed abilities such as multi-image correlation and scene text understanding, vision-based document comprehension.

Release a visually-related instruction evaluation set OwlEval.

mPLUG-Owl

Tracking Any Point (TAP)/TAPIR (Deepmind)

TAPIR is a two-stage algorithm which employs two stages: 1) a matching stage, which independently locates a suitable candidate point match for the query pointon every other frame, and (2) a refinement stage, which updates both the trajectory and query features based on local correlations. The resulting model is fast and surpasses all prior methods by a significant margin on the TAP-Vid benchmark.

TAPIR: Tracking Any Point with per-frame Initialization and temporal Refinement (deepmind-tapir.github.io)

Vocabulary-free Image Classification

Vocabulary-free Image Classification aims to assign a class to an image without prior knowledge on the list of class names, thus operating on the semantic class space that contains all the possible concepts. Our proposed method CaSED finds the best matching category within the unconstrained semantic space by multimodal data from large vision-language databases.

X-Decoder: Generalized Decoding for Pixel, Image and Language

We present X-Decoder, a generalized decoding pipeline that can predict pixel-level segmentation and language tokens seamlessly. X-Decoder takes as inputs two types of queries: (i) generic non-semantic queries and (ii) semantic queries induced from text inputs, to decode different pixel-level and token-level outputs in the same semantic space. With such a novel design, X-Decoder is the first work that provides a unified way to support all types of image segmentation and a variety of vision-language (VL) tasks.

Instructional Image Editing

I hope you have enjoyed this article. If you have any questions or comments, please provide them here.

--

--

A business analyst at heart who dabbles in ai engineering, machine learning, data science, and data engineering. threads: @sung.kim.mw