Skip to content
Change the repository type filter

All

    Repositories list

    • This dataset provides a glossary of AI terms in Swahili, Zulu, Xhosa, Afrikaans, English (as the common core), and other languages widely spoken in Africa. It's…
      HTML
      Creative Commons Attribution Share Alike 4.0 International
      1320Updated Apr 21, 2026Apr 21, 2026
    • locqa

      Public
      Dataset release for "Location Not Found: Exposing Implicit Local and Global Biases in Multilingual LLMs", ACL 2026.
      0000Updated Apr 19, 2026Apr 19, 2026
    • Python
      Apache License 2.0
      35254Updated Apr 8, 2026Apr 8, 2026
    • Objectron

      Public
      Objectron is a dataset of short, object-centric video clips. In addition, the videos also contain AR session metadata including camera poses, sparse point-cloud…
      Jupyter Notebook
      Other
      2652.3k310Updated Mar 6, 2026Mar 6, 2026
    • An annotated dataset of 9,003 adversarial queries in seven Sub-Saharan African languages.
      Jupyter Notebook
      3400Updated Jan 27, 2026Jan 27, 2026
    • SAFARI

      Public
      The dataset consists of stereotypes collected in 4 Sub Saharan African countries for the purpose of AI model evaluations.
      0100Updated Jan 24, 2026Jan 24, 2026
    • The dataset consists of globally situated cultural artifacts, covering 29 countries and many key aspects of culture.
      1100Updated Jan 22, 2026Jan 22, 2026
    • artydiqa

      Public
      ArTyDi-QA is a dataset for Question Answering (QA) and Question Generation (QG) in Modern Standard Arabic (MSA), adapted from TyDiQA. It features extractive QA …
      0000Updated Dec 18, 2025Dec 18, 2025
    • MGSM-Rev2

      Public
      To improve the MGSM benchmark, we corrected two erroneous English questions and rephrased others to remove ambiguity. We then used Gemini to retranslate all que…
      1001Updated Nov 10, 2025Nov 10, 2025
    • Other
      0510Updated Oct 13, 2025Oct 13, 2025
    • The dataset consists of AI generated stories and accompanied human ratings on their cultural fluency and relevance.
      Apache License 2.0
      0200Updated Aug 6, 2025Aug 6, 2025
    • Jupyter Notebook
      Apache License 2.0
      0000Updated Jul 30, 2025Jul 30, 2025
    • Conceptual 12M is a dataset containing (image-URL, caption) pairs collected for vision-and-language pre-training.
      Other
      1842260Updated Jul 14, 2025Jul 14, 2025
    • Corpus of domain names scraped from Common Crawl and manually annotated to add word boundaries (e.g. "commoncrawl" to "common crawl").
      MIT License
      12010Updated Jun 16, 2025Jun 16, 2025
    • CONFLICTS is a QA dataset annotated with knowledge conflict types. Each instance comprises a query, a set of retrieved relevant passages, a corresponding confli…
      Apache License 2.0
      11310Updated Jun 11, 2025Jun 11, 2025
    • egotempo

      Public
      Jupyter Notebook
      Creative Commons Attribution 4.0 International
      02630Updated Apr 26, 2025Apr 26, 2025
    • Images gathered from the Internet in 2023 and some metadata
      HTML
      Other
      1300Updated Mar 19, 2025Mar 19, 2025
    • screen_qa

      Public
      ScreenQA dataset was introduced in the "ScreenQA: Large-Scale Question-Answer Pairs over Mobile App Screenshots" paper. It contains ~86K question-answer pairs c…
      Python
      Creative Commons Attribution 4.0 International
      1014540Updated Feb 7, 2025Feb 7, 2025
    • adversarial-nibbler

      Public archive
      This dataset contains results from all rounds of Adversarial Nibbler. This data includes adversarial prompts fed into public generative text2image models and va…
      Creative Commons Attribution 4.0 International
      42600Updated Feb 3, 2025Feb 3, 2025
    • cube

      Public archive
      CUBE is a benchmark to evaluate the Cultural Competence of T2I models
      Creative Commons Attribution 4.0 International
      1830Updated Jan 20, 2025Jan 20, 2025
    • Jupyter Notebook
      Apache License 2.0
      176940Updated Jan 17, 2025Jan 17, 2025
    • hiertext

      Public archive
      The HierText dataset contains ~12k images from the Open Images dataset v6 with large amount of text entities. We provide word, line and paragraph level annotati…
      Jupyter Notebook
      Creative Commons Attribution Share Alike 4.0 International
      2930901Updated Dec 2, 2024Dec 2, 2024
    • scin

      Public
      The SCIN dataset contains 10,000+ images of dermatology conditions, crowdsourced with informed consent from US internet users. Contributions include self-report…
      Jupyter Notebook
      Other
      2115620Updated Nov 23, 2024Nov 23, 2024
    • MISeD

      Public
      MISeD (Meeting Information Seeking Dialogs dataset) is an information-seeking dialog dataset focused on meeting transcripts. It includes 432 dialogs over transc…
      31500Updated Nov 20, 2024Nov 20, 2024
    • uicrit

      Public archive
      UICrit is a dataset containing human-generated natural language design critiques, corresponding bounding boxes for each critique, and design quality ratings for…
      02610Updated Nov 19, 2024Nov 19, 2024
    • WordGraph

      Public
      The WordGraph dataset contains multilingual lexicon entries linked to wikipedia entities, focusing on human-denoting names and demonym adjectives. Each lexicon …
      Creative Commons Zero v1.0 Universal
      1300Updated Nov 7, 2024Nov 7, 2024
    • Dataset of conversations, generated by prompting Gemini Ultra. These are conversations between a teacher and a student, where the teacher is prompted with speci…
      93610Updated Oct 29, 2024Oct 29, 2024
    • GeniL

      Public archive
      GeniL dataset is an effort for detecting various types of generalization in language. This multilingual dataset covers sentences in EN, FR, ES, PT, AR, HI, BN, …
      Creative Commons Attribution 4.0 International
      0300Updated Oct 18, 2024Oct 18, 2024
    • The Tap Typing with Touch Sensing Images (TSI) dataset contains data of user taps on a mobile touchscreen keyboard, including elliptical features and capacitive…
      Creative Commons Attribution 4.0 International
      1300Updated Oct 15, 2024Oct 15, 2024
    • mittens

      Public archive
      Datasets for measuring misgendering in translation
      Other
      0500Updated Oct 4, 2024Oct 4, 2024
    ProTip! When viewing an organization's repositories, you can use the props. filter to filter by custom property.