Cortex: The Biggest AI Dataset
Click on the logo!
Description
Problem
-
Deep neural networks can solve wide variety of machine learning problems well
-
Deep neural networks used in industry applications usually work the best when they are trained using supervised learning given that:
-
there is a lot of data available,
-
the training data is from the same distribution as the data from the production environment and
-
the labels and data are of a high quality
-
-
Large amounts of data is available on and outside the Internet, but it is not useful for building machine learning solutions in raw format
Solution
-
Solving the process of:
-
Collecting large sets of data at the business process level
-
Preparing and labeling the data for use in training/evaluation processes of deep neural networks
-
Quality assurance of collected data and labels
-
-
Using multidisciplinary approach (technical, social, ethical, legal, ...)
-
Automating the process
Data Collection, Labeling and QA Process
Mission Statement
-
To create the biggest high quality labeled dataset for building machine learning models
-
To create a foundation dataset for artificial general intelligence
Current Dataset Statistics
FAQ
Does Cortex respect copyright laws?
Cortex dataset only references URLs to the original images. Images are scraped from Common Crawl database, temporarily stored in RAM and then discarded after labeling is done. Anyone using the dataset must download images they are interested in and suggestion is to use the img2dataset tool. URLs inserted through /upload endpoint are not exposed through /get-labeled-data endpoint if they can not be scraped from Common Crawl database.
Papers