Introducing Apple ML-MGIE

Guiding Instruction-based Image Editing via Multimodal Large Language Models

Tsu-Jui Fu ¹ Wenze Hu ² Xianzhi Du ² William Yang Wang ¹ Yinfei Yang ² Zhe Gan ²

Abstract

Instruction-based image editing improves the controllability and flexibility of image manipulation via natural commands without elaborate descriptions or regional masks. However, human instructions are sometimes too brief for current methods to capture and follow. Multimodal large language models (MLLMs) show promising capabilities in cross-modal understanding and visual-aware response generation via LMs. We investigate how MLLMs facilitate edit instructions and present MLLM-Guided Image Editing (MGIE). MGIE learns to derive expressive instructions and provides explicit guidance. The editing model jointly captures this visual imagination and performs manipulation through end-to-end training. We evaluate various aspects of Photoshop-style modification, global photo optimization, and local editing. Extensive experimental results demonstrate that expressive instructions are crucial to instruction-based image editing, and our MGIE can lead to a notable improvement in automatic metrics and human evaluation while maintaining competitive inference efficiency.

Classic Examples of Apple ML-MGIE

Input	Instruction	InsPix2Pix	LGIE	MGIE	GroundTruth
	turn the day into night
	make the forest path into a beach
	make the frame red
	as if the shop was a library
	make it the vatican
	turn the sunset into a firestorm

Input	Instruction	InsPix2Pix	LGIE	MGIE	GroundTruth
	remove text
	show him on a frozen lake with snowy mountains
	increase the brightness of the entire image
	take the people out of the back in the photo
	add tiger
	change the background to purple

Input	Instruction	InsPix2Pix	LGIE	MGIE	GroundTruth
	edit out skiers on right
	make it look more professional
	remove hot air balloons
	make colors pop out
	remove boy with red shirt from picture
	lighten out yellow tone

Input	Instruction	InsPix2Pix	LGIE	MGIE	GroundTruth
	add brightness so the clouds look bright white
	make the color more green
	add more contrast to simulate more light
	remove the purple hue out of the picture
	brighten image a lot, sharpen photo
	need to clarified, more focus

Input	Instruction	InsPix2Pix	LGIE	MGIE	GroundTruth
	have there be a birthday cake on the table
	put buildings in the background of the image
	make the face happy
	let there be palm trees
	has a green web page
	replace food with soup