LAION
The Large-scale Artificial Intelligence Open Network (LAION) is a German non-profit with a stated goal "to make large-scale machine learning models, datasets and related code available to the general public".[1] It is best known for releasing a number of large datasets of images and captions scraped from the web which have been used to train a number of high-profile text-to-image models, including Stable Diffusion and Imagen.[2]
In February 2023, LAION was named in the Getty Images lawsuit against Stable Diffusion as a non-party. [3]
Image datasets
LAION has publicly released a number of large datasets of image-caption pairs which have been widely used by AI researchers. The data is derived from the Common Crawl, a dataset of scraped web pages. The developers searched the crawled html for <img>
tags and treated their alt attributes as captions. They used CLIP to identify and discard images whose content did not appear to match their captions.[4] LAION does not host the content of scraped images themselves; rather, the dataset contains URLs pointing to images, which researchers must download themselves.[5]
The first such dataset, LAION-400M, was released in August 2021 and consisted of 400 million image-caption pairs. The pairs were extracted from a random subset of webpages scraped by Common Crawl between 2014 and 2021.[6] It was an attempt to recreate the process used by OpenAI to collect the 400 million image-caption pairs they used to train the CLIP model - the company had chosen to open-source the model's code and weights, but not its training dataset.[4] Imagen, a text-to-image model announced by Google Brain in 2022, was trained on LAION-400M in combination with private internal datasets.[7]
A successor of more than 5 billion pairs, LAION-5B, was released in March 2022.[8] As of its release, it was the largest freely available dataset of image-caption pairs in existence.[4] Its creation was funded by Doodlebot, Hugging Face and Stability AI, the AI company behind the funding of the Stable Diffusion text-to-image model, which was trained on it.[9]
Example entry
Below is an example of the metadata associated with one entry in the LAION-5B dataset. The image content itself, shown at right, is not stored in the dataset, but is only linked to via the URL field:[10]
- URL
- https://upload.wikimedia.org/wikipedia/commons/thumb/4/45/Ammodorcas_clarkei_The_book_of_antelopes_%281894%29.jpg/275px-Ammodorcas_clarkei_The_book_of_antelopes_%281894%29.jpg
- Text
- Ammodorcas clarkei The book of antelopes (1894).jpg
- Width
- 275 (measured in pixels)
- Height
- 311
- Similarity
- 0.34972 (cosine similarity between the image and caption, as measured using CLIP. Any pairs having similarity values less than 0.3 were discarded from the dataset.)
- Pwatermark
- 0.30022 (estimated probability that this image bears a watermark, as determined by an AI model)
- Punsafe
- 0.0000001688 (estimated probability that this image is "not safe for work", as determined by an AI model)
- Aesthetic
- 6.02298 (estimated score that a human rater would assign the aesthetics of this image, on a scale from 1 to 10)
References
- ↑ "About". LAION.ai. https://laion.ai/about/.
- ↑ Edwards, Benj (15 September 2022). "Have AI image generators assimilated your art? New tool lets you check". Ars Technica. https://arstechnica.com/information-technology/2022/09/have-ai-image-generators-assimilated-your-art-new-tool-lets-you-check/.
- ↑ "Getty Images (US), Inc. v. Stability AI, Inc., 1:23-cv-00135" (in en-us). https://www.courtlistener.com/docket/66788385/getty-images-us-inc-v-stability-ai-inc/.
- ↑ 4.0 4.1 4.2 Alford, Anthony (17 May 2022). "LAION Releases Five Billion Image-Text Pair Dataset LAION-5B". InfoQ. https://www.infoq.com/news/2022/05/laion-5b-image-text-dataset/.
- ↑ Edwards, Benj (21 September 2022). "Artist finds private medical record photos in popular AI training data set". Ars Technica. https://arstechnica.com/information-technology/2022/09/artist-finds-private-medical-record-photos-in-popular-ai-training-data-set/.
- ↑ Schuhmann, Christoph (8 August 2021). "LAION-400-Million Open Dataset". LAION blog. https://laion.ai/blog/laion-400-open-dataset/.
- ↑ Saharia, Chitwan; Chan, William; Saxena, Saurabh; Li, Lala; Whang, Jay; Denton, Emily; Kamyar Seyed Ghasemipour, Seyed; Karagol Ayan, Burcu et al. (23 May 2022). Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding.
- ↑ Beaumont, Romain (3 March 2022). "LAION-5B: A New Era of Open Large-Scale Multi-Modal Datasets". LAION blog. https://laion.ai/blog/laion-5b/.
- ↑ Wiggers, Kyle (12 August 2022). "This startup is setting a DALL-E 2-like AI free, consequences be damned". TechCrunch. https://techcrunch.com/2022/08/12/a-startup-wants-to-democratize-the-tech-behind-dall-e-2-consequences-be-damned/.
- ↑ "image 17024". LAION Aesthetic 6+ dataset explorer. https://laion-aesthetic.datasette.io/laion-aesthetic-6pls/images/17024.
Original source: https://en.wikipedia.org/wiki/LAION.
Read more |