Generative Image Model Benchmark for Reasoning and Representation
A Symposium on Challenges and Methods for Assessing the Next Generation of AI (AAAI-EDGeS 2023 is part of the AAAI Spring Symposium Series)
Jascha Achterberg*, Ron Arel*, Tetiana Grinberg*, Adel Chaibi, Joscha Bach, Nikos Tzagadakis (more)

Jascha Achterberg*: [University of Cambridge], [Intel Labs]

Tetiana Grinberg*: [Intel Labs]

Adel Chaibi: [Intel Labs]

Joscha Bach: [Intel Labs]

Nikos Tzagkarakis: [Metanomic]


Recent developments in generative AI have highlighted how the field is moving to a state where models start developing a large set of skills and can solve a multitude of tasks without having actively been optimized for them. Benchmarks have been a key component driving model development in the past, but as models’ capabilities become more complex, it becomes harder to create benchmarks which can meaningfully capture the skillsets of algorithms to inform future model developments and training pipelines. While in the domain of language we have seen the development of a wide-ranging array of benchmarks, these are currently missing for generative image algorithms. Here we introduce the Generative Image Model Benchmark for Reasoning and Representation (GIMBRR) which is an open-source software package to assess generative image algorithms on 11 cognitive tasks using manual and automated evaluation pipelines. GIMBRR is built with customizability in mind, so that it can easily be updated with new tasks and assessment routines. This way it can be adapted to suit the needs of research teams with specific goals in image generation and to update task difficulty as generative image algorithms progress in general. We used GIMBRR to measure performance of three popular generative image models (DALL-E 2, Midjourney, Stable Diffusion), demonstrating that reasoning and representation tasks pose a considerable challenge to all of them. We have also demonstrated how cognitive theory can be used to perform a systematic analysis of generative and representational capabilities of these models.

An Intel Labs Project © 2023