Summary:
I propose that ManyChat’s integrated AI can understand audio (transcribe + extract intent) and images (OCR / recognize context). Today users prefer sending voice notes and images; allowing ManyChat to process them natively will reduce friction and enable new support and sales flows.
Problem:
-
Many users send voice messages and images because it’s more convenient; bots currently ask them to type or repeat information.
-
This causes delays and extra work for agents.
Proposal:
-
Enable processing of voice messages to produce a transcription and the user’s main intent.
-
Enable basic image recognition (read text in photos like tickets/receipts and detect image type: product / receipt / ID).
-
Allow flows to combine voice + image + text to make decisions (for example: detect a complaint and automatically create a ticket).
Use cases:
-
Support: customer sends a product photo and says by voice “it arrived broken” → bot identifies order and creates a refund proposal or ticket.
-
Commerce: user sends a photo of a product and asks the price by voice → bot replies with options and a purchase button.
-
Pre-human assistance: automatic summary of audio + photo so the agent sees the essentials before replying.
-
Accessibility: people who have trouble writing use voice and images to complete forms.
Closing / Request:
Please consider prioritizing multimodal capabilities (processing audio and images) in ManyChat’s AI. I can provide concrete flow examples if the product team wants them.