Enable multimodal AI in ManyChat — understand voice notes and images

Summary:
I propose that ManyChat’s integrated AI can understand audio (transcribe + extract intent) and images (OCR / recognize context). Today users prefer sending voice notes and images; allowing ManyChat to process them natively will reduce friction and enable new support and sales flows.

Problem:

Many users send voice messages and images because it’s more convenient; bots currently ask them to type or repeat information.
This causes delays and extra work for agents.

Proposal:

Enable processing of voice messages to produce a transcription and the user’s main intent.
Enable basic image recognition (read text in photos like tickets/receipts and detect image type: product / receipt / ID).
Allow flows to combine voice + image + text to make decisions (for example: detect a complaint and automatically create a ticket).

Use cases:

Support: customer sends a product photo and says by voice “it arrived broken” → bot identifies order and creates a refund proposal or ticket.
Commerce: user sends a photo of a product and asks the price by voice → bot replies with options and a purchase button.
Pre-human assistance: automatic summary of audio + photo so the agent sees the essentials before replying.
Accessibility: people who have trouble writing use voice and images to complete forms.

Closing / Request:
Please consider prioritizing multimodal capabilities (processing audio and images) in ManyChat’s AI. I can provide concrete flow examples if the product team wants them.

Be the first to reply!

Sign up

Welcome to the Manychat Community!

Scanning file for viruses.

This file cannot be downloaded