top of page

Cortex: The Biggest AI Dataset

logo.png.png

Click on the logo!

Description

Problem

  • Deep neural networks can solve wide variety of machine learning problems well

  • Deep neural networks used in industry applications usually work the best when they are trained using supervised learning given that:

    • there is a lot of data available,

    • the training data is from the same distribution as the data from the production environment and 

    • the labels and data are of a high quality

  • Large amounts of data is available on and outside the Internet, but it is not useful for building machine learning solutions in raw format

Solution

  • Solving the process of:

    • Collecting large sets of data at the business process level

    • Preparing and labeling the data for use in training/evaluation processes of deep neural networks

    • Quality assurance of collected data and labels

  • Using multidisciplinary approach (technical, social, ethical, legal, ...)

  • Automating the process

Data Collection, Labeling and QA Process

Mission Statement

  • To create the biggest high quality labeled dataset for building machine learning models

  • To create a foundation dataset for artificial general intelligence

Current Dataset Statistics

FAQ

Does Cortex respect copyright laws?

 

Cortex dataset only references URLs to the original images. Images are scraped from Common Crawl database, temporarily stored in RAM and then discarded after labeling is done. Anyone using the dataset must download images they are interested in and suggestion is to use the img2dataset tool. URLs inserted through /upload endpoint are not exposed through /get-labeled-data endpoint if they can not be scraped from Common Crawl database.

Papers

Copyright © 2024 Piculjan Technologies LLC

bottom of page