LAION

From HandWiki
Revision as of 18:58, 6 March 2023 by JMinHep (talk | contribs) (link)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Short description: Large-scale Artificial Intelligence Open Network, a German non-profit

The Large-scale Artificial Intelligence Open Network (LAION) is a German non-profit with a stated goal "to make large-scale machine learning models, datasets and related code available to the general public".[1] It is best known for releasing a number of large datasets of images and captions scraped from the web which have been used to train a number of high-profile text-to-image models, including Stable Diffusion and Imagen.[2]

In February 2023, LAION was named in the Getty Images lawsuit against Stable Diffusion as a non-party. [3]

Image datasets

LAION has publicly released a number of large datasets of image-caption pairs which have been widely used by AI researchers. The data is derived from the Common Crawl, a dataset of scraped web pages. The developers searched the crawled html for <img> tags and treated their alt attributes as captions. They used CLIP to identify and discard images whose content did not appear to match their captions.[4] LAION does not host the content of scraped images themselves; rather, the dataset contains URLs pointing to images, which researchers must download themselves.[5]

The first such dataset, LAION-400M, was released in August 2021 and consisted of 400 million image-caption pairs. The pairs were extracted from a random subset of webpages scraped by Common Crawl between 2014 and 2021.[6] It was an attempt to recreate the process used by OpenAI to collect the 400 million image-caption pairs they used to train the CLIP model - the company had chosen to open-source the model's code and weights, but not its training dataset.[4] Imagen, a text-to-image model announced by Google Brain in 2022, was trained on LAION-400M in combination with private internal datasets.[7]

A successor of more than 5 billion pairs, LAION-5B, was released in March 2022.[8] As of its release, it was the largest freely available dataset of image-caption pairs in existence.[4] Its creation was funded by Doodlebot, Hugging Face and Stability AI, the AI company behind the funding of the Stable Diffusion text-to-image model, which was trained on it.[9]

Example entry

An example of one of the billions of images in the LAION-5B dataset.

Below is an example of the metadata associated with one entry in the LAION-5B dataset. The image content itself, shown at right, is not stored in the dataset, but is only linked to via the URL field:[10]

URL
https://upload.wikimedia.org/wikipedia/commons/thumb/4/45/Ammodorcas_clarkei_The_book_of_antelopes_%281894%29.jpg/275px-Ammodorcas_clarkei_The_book_of_antelopes_%281894%29.jpg
Text
Ammodorcas clarkei The book of antelopes (1894).jpg
Width
275 (measured in pixels)
Height
311
Similarity
0.34972 (cosine similarity between the image and caption, as measured using CLIP. Any pairs having similarity values less than 0.3 were discarded from the dataset.)
Pwatermark
0.30022 (estimated probability that this image bears a watermark, as determined by an AI model)
Punsafe
0.0000001688 (estimated probability that this image is "not safe for work", as determined by an AI model)
Aesthetic
6.02298 (estimated score that a human rater would assign the aesthetics of this image, on a scale from 1 to 10)

References

  1. "About". LAION.ai. https://laion.ai/about/. 
  2. Edwards, Benj (15 September 2022). "Have AI image generators assimilated your art? New tool lets you check". Ars Technica. https://arstechnica.com/information-technology/2022/09/have-ai-image-generators-assimilated-your-art-new-tool-lets-you-check/. 
  3. "Getty Images (US), Inc. v. Stability AI, Inc., 1:23-cv-00135" (in en-us). https://www.courtlistener.com/docket/66788385/getty-images-us-inc-v-stability-ai-inc/. 
  4. 4.0 4.1 4.2 Alford, Anthony (17 May 2022). "LAION Releases Five Billion Image-Text Pair Dataset LAION-5B". InfoQ. https://www.infoq.com/news/2022/05/laion-5b-image-text-dataset/. 
  5. Edwards, Benj (21 September 2022). "Artist finds private medical record photos in popular AI training data set". Ars Technica. https://arstechnica.com/information-technology/2022/09/artist-finds-private-medical-record-photos-in-popular-ai-training-data-set/. 
  6. Schuhmann, Christoph (8 August 2021). "LAION-400-Million Open Dataset". LAION blog. https://laion.ai/blog/laion-400-open-dataset/. 
  7. Saharia, Chitwan; Chan, William; Saxena, Saurabh; Li, Lala; Whang, Jay; Denton, Emily; Kamyar Seyed Ghasemipour, Seyed; Karagol Ayan, Burcu et al. (23 May 2022). Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. 
  8. Beaumont, Romain (3 March 2022). "LAION-5B: A New Era of Open Large-Scale Multi-Modal Datasets". LAION blog. https://laion.ai/blog/laion-5b/. 
  9. Wiggers, Kyle (12 August 2022). "This startup is setting a DALL-E 2-like AI free, consequences be damned". TechCrunch. https://techcrunch.com/2022/08/12/a-startup-wants-to-democratize-the-tech-behind-dall-e-2-consequences-be-damned/. 
  10. "image 17024". LAION Aesthetic 6+ dataset explorer. https://laion-aesthetic.datasette.io/laion-aesthetic-6pls/images/17024.