google-research-datasets repositories

ssa-ai-terminologies

Public

This dataset provides a glossary of AI terms in Swahili, Zulu, Xhosa, Afrikaans, English (as the common core), and other languages widely spoken in Africa. It's…

HTML

•

Creative Commons Attribution Share Alike 4.0 International

•1•3•2•0•Updated

Apr 21, 2026

locqa

Public

Dataset release for "Location Not Found: Exposing Implicit Local and Global Biases in Multilingual LLMs", ACL 2026.

0•0•0•0•Updated

Apr 19, 2026

sanpo_dataset

Public

Python

•

Apache License 2.0

•3•52•5•4•Updated

Apr 8, 2026

Objectron

Public

Objectron is a dataset of short, object-centric video clips. In addition, the videos also contain AR session metadata including camera poses, sparse point-cloud…

python machine-learning ai

python machine-learning ai computer-vision deep-learning neural-network tensorflow augmented-reality pytorch dataset

Jupyter Notebook

•

Other

•265•2.3k•31•0•Updated

Mar 6, 2026

Amplify_SSA

Public

An annotated dataset of 9,003 adversarial queries in seven Sub-Saharan African languages.

Jupyter Notebook

•3•4•0•0•Updated

Jan 27, 2026

SAFARI

Public

The dataset consists of stereotypes collected in 4 Sub Saharan African countries for the purpose of AI model evaluations.

0•1•0•0•Updated

Jan 24, 2026

SCALE-Cultural-Data

Public

The dataset consists of globally situated cultural artifacts, covering 29 countries and many key aspects of culture.

1•1•0•0•Updated

Jan 22, 2026

artydiqa

Public

ArTyDi-QA is a dataset for Question Answering (QA) and Question Generation (QG) in Modern Standard Arabic (MSA), adapted from TyDiQA. It features extractive QA …

0•0•0•0•Updated

Dec 18, 2025

MGSM-Rev2

Public

To improve the MGSM benchmark, we corrected two erroneous English questions and rephrased others to remove ambiguity. We then used Gemini to retranslate all que…

1•0•0•1•Updated

Nov 10, 2025

wit-retrieval

Public

Other

•0•5•1•0•Updated

Oct 13, 2025

cultural_familiarity_annotations

Public

The dataset consists of AI generated stories and accompanied human ratings on their cultural fluency and relevance.

Apache License 2.0

•0•2•0•0•Updated

Aug 6, 2025

tydiqa-wana

Public

Jupyter Notebook

•

Apache License 2.0

•0•0•0•0•Updated

Jul 30, 2025

conceptual-12m

Public

Conceptual 12M is a dataset containing (image-URL, caption) pairs collected for vision-and-language pre-training.

vision-and-language pre-training multimodal-dataset

Other

•18•422•6•0•Updated

Jul 14, 2025

common-crawl-domain-names

Public

Corpus of domain names scraped from Common Crawl and manually annotated to add word boundaries (e.g. "commoncrawl" to "common crawl").

MIT License

•1•20•1•0•Updated

Jun 16, 2025

rag_conflicts

Public

CONFLICTS is a QA dataset annotated with knowledge conflict types. Each instance comprises a query, a set of retrieved relevant passages, a corresponding confli…

Apache License 2.0

•1•13•1•0•Updated

Jun 11, 2025

egotempo

Public

Jupyter Notebook

•

Creative Commons Attribution 4.0 International

•0•26•3•0•Updated

Apr 26, 2025

web-images

Public

Images gathered from the Internet in 2023 and some metadata

HTML

•

Other

•1•3•0•0•Updated

Mar 19, 2025

screen_qa

Public

ScreenQA dataset was introduced in the "ScreenQA: Large-Scale Question-Answer Pairs over Mobile App Screenshots" paper. It contains ~86K question-answer pairs c…

Python

•

Creative Commons Attribution 4.0 International

•10•145•4•0•Updated

Feb 7, 2025

adversarial-nibbler

Public archive

This dataset contains results from all rounds of Adversarial Nibbler. This data includes adversarial prompts fed into public generative text2image models and va…

Creative Commons Attribution 4.0 International

•4•26•0•0•Updated

Feb 3, 2025

cube

Public archive

CUBE is a benchmark to evaluate the Cultural Competence of T2I models

Creative Commons Attribution 4.0 International

•1•8•3•0•Updated

Jan 20, 2025

global_streamflow_model_paper

Public archive

Jupyter Notebook

•

Apache License 2.0

•17•69•4•0•Updated

Jan 17, 2025

hiertext

Public archive

The HierText dataset contains ~12k images from the Open Images dataset v6 with large amount of text entities. We provide word, line and paragraph level annotati…

Jupyter Notebook

•

Creative Commons Attribution Share Alike 4.0 International

•29•309•0•1•Updated

Dec 2, 2024

scin

Public

The SCIN dataset contains 10,000+ images of dermatology conditions, crowdsourced with informed consent from US internet users. Contributions include self-report…

Jupyter Notebook

•

Other

•21•156•2•0•Updated

Nov 23, 2024

MISeD

Public

MISeD (Meeting Information Seeking Dialogs dataset) is an information-seeking dialog dataset focused on meeting transcripts. It includes 432 dialogs over transc…

3•15•0•0•Updated

Nov 20, 2024

uicrit

Public archive

UICrit is a dataset containing human-generated natural language design critiques, corresponding bounding boxes for each critique, and design quality ratings for…

0•26•1•0•Updated

Nov 19, 2024

WordGraph

Public

The WordGraph dataset contains multilingual lexicon entries linked to wikipedia entities, focusing on human-denoting names and demonym adjectives. Each lexicon …

Creative Commons Zero v1.0 Universal

•1•3•0•0•Updated

Nov 7, 2024

Education-Dialogue-Dataset

Public archive

Dataset of conversations, generated by prompting Gemini Ultra. These are conversations between a teacher and a student, where the teacher is prompted with speci…

9•36•1•0•Updated

Oct 29, 2024

GeniL

Public archive

GeniL dataset is an effort for detecting various types of generalization in language. This multilingual dataset covers sentences in EN, FR, ES, PT, AR, HI, BN, …

Creative Commons Attribution 4.0 International

•0•3•0•0•Updated

Oct 18, 2024

tap-typing-with-touch-sensing-images

Public archive

The Tap Typing with Touch Sensing Images (TSI) dataset contains data of user taps on a mobile touchscreen keyboard, including elliptical features and capacitive…

Creative Commons Attribution 4.0 International

•1•3•0•0•Updated

Oct 15, 2024

mittens

Public archive

Datasets for measuring misgendering in translation

Other

•0•5•0•0•Updated

Oct 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Google Research Datasets

All

All

175 repositories

ssa-ai-terminologies

locqa

sanpo_dataset

Objectron

Amplify_SSA

SAFARI

SCALE-Cultural-Data

artydiqa

MGSM-Rev2

wit-retrieval

cultural_familiarity_annotations

tydiqa-wana

conceptual-12m

common-crawl-domain-names

rag_conflicts

egotempo

web-images

screen_qa

adversarial-nibbler

cube

global_streamflow_model_paper

hiertext

scin

MISeD

uicrit

WordGraph

Education-Dialogue-Dataset

GeniL

tap-typing-with-touch-sensing-images

mittens

All

All

Repositories list

175 repositories