Cortex: The Biggest AI Dataset

Click on the logo!

Description

Problem

Deep neural networks can solve wide variety of machine learning problems well
Deep neural networks used in industry applications usually work the best when they are trained using supervised learning given that:
- there is a lot of data available,
- the training data is from the same distribution as the data from the production environment and
- the labels and data are of a high quality
Large amounts of data is available on and outside the Internet, but it is not useful for building machine learning solutions in raw format

Solution

Solving the process of:
- Collecting large sets of data at the business process level
- Preparing and labeling the data for use in training/evaluation processes of deep neural networks
- Quality assurance of collected data and labels
Using multidisciplinary approach (technical, social, ethical, legal, ...)
Automating the process

Data Collection, Labeling and QA Process

Mission Statement

To create the biggest high quality labeled dataset for building machine learning models
To create a foundation dataset for artificial general intelligence

Current Dataset Statistics

FAQ

Does Cortex respect copyright laws?

Cortex dataset only references URLs to the original images. Images are scraped from Common Crawl database, temporarily stored in RAM and then discarded after labeling is done. Anyone using the dataset must download images they are interested in and suggestion is to use the img2dataset tool.