Skip to main content

Summary:
I propose that ManyChat’s integrated AI can understand audio (transcribe + extract intent) and images (OCR / recognize context). Today users prefer sending voice notes and images; allowing ManyChat to process them natively will reduce friction and enable new support and sales flows.

Problem:

  • Many users send voice messages and images because it’s more convenient; bots currently ask them to type or repeat information.

  • This causes delays and extra work for agents.

Proposal:

  • Enable processing of voice messages to produce a transcription and the user’s main intent.

  • Enable basic image recognition (read text in photos like tickets/receipts and detect image type: product / receipt / ID).

  • Allow flows to combine voice + image + text to make decisions (for example: detect a complaint and automatically create a ticket).

Use cases:

  • Support: customer sends a product photo and says by voice “it arrived broken” → bot identifies order and creates a refund proposal or ticket.

  • Commerce: user sends a photo of a product and asks the price by voice → bot replies with options and a purchase button.

  • Pre-human assistance: automatic summary of audio + photo so the agent sees the essentials before replying.

  • Accessibility: people who have trouble writing use voice and images to complete forms.

Closing / Request:
Please consider prioritizing multimodal capabilities (processing audio and images) in ManyChat’s AI. I can provide concrete flow examples if the product team wants them.

Be the first to reply!