Solved: Convert Voice Memos from Telegram to Text using OpenAI Whisper API

🚀 Executive Summary

TL;DR: This project solves the problem of unstructured voice memos in Telegram by creating a Python bot that automatically transcribes them. It uses the Telegram Bot API to receive voice notes and the OpenAI Whisper API to convert them into searchable, copy-pasteable text, significantly boosting efficiency.

🎯 Key Takeaways

The solution integrates `python-telegram-bot` for message handling, `pydub` with `ffmpeg` for Ogg Opus to MP3 audio conversion, and the `openai` library for Whisper API transcription.
Secure management of API keys is achieved using `python-dotenv` to load `TELEGRAM_BOT_TOKEN` and `OPENAI_API_KEY` from a `config.env` file, preventing hardcoding.
Temporary audio files (OGA and MP3) are downloaded, processed, and then reliably cleaned up using `os.remove` within a `finally` block to ensure resource management.

Convert Voice Memos from Telegram to Text using OpenAI Whisper API

Alright, team. Darian here. Let’s talk about efficiency. I used to leave myself voice memos on the go—quick thoughts, reminders, even mini-debug sessions while walking the dog. The problem? They’d pile up in my Telegram “Saved Messages,” becoming a black hole of unstructured audio. Listening back to find one specific thought was a huge time sink. This little project changed that. Now, I just send a voice note to a bot, and a few seconds later, I get a clean text transcription back. It’s searchable, copy-pasteable, and has genuinely saved me a couple of hours a week.

This isn’t just a gimmick; it’s a powerful way to bridge the gap between spoken ideas and actionable, written data. Let’s build it.

Prerequisites

Before we dive in, make sure you have the following ready. We’re all busy, so getting this sorted out first will make the process much smoother.

A Telegram Bot Token: You can get this from the BotFather on Telegram. Just start a chat with him, create a new bot, and he’ll give you an API token.
An OpenAI API Key: You’ll need an account on the OpenAI platform. Grab your API key from your account dashboard.
Python Environment: A working Python 3.8+ installation.
FFmpeg: This is a crucial dependency for audio processing. You’ll need to install it on your system. A quick search for “install ffmpeg on [your OS]” will get you there. Pydub, the library we’ll use, depends on it.

The Guide: Step-by-Step

I’ll skip the standard virtual environment setup (`venv`, etc.) since you likely have your own workflow for that. Let’s jump straight to the logic. You’ll need to install a few Python libraries. Run your package installer for `python-telegram-bot`, `openai`, `python-dotenv`, and `pydub`.

Step 1: Environment and Configuration

First rule of production: never hardcode secrets. We’ll store our API keys in a `config.env` file. Create a file with that name in your project directory and add your keys like this:

TELEGRAM_BOT_TOKEN="YOUR_TELEGRAM_TOKEN_HERE"
OPENAI_API_KEY="YOUR_OPENAI_KEY_HERE"

Now, let’s start our Python script. We’ll call it `transcriber_bot.py`. We’ll begin by importing the necessary libraries and loading our environment variables.

import os
import logging
from dotenv import load_dotenv
from telegram import Update
from telegram.ext import Application, MessageHandler, filters, ContextTypes
from openai import OpenAI
from pydub import AudioSegment

# Load environment variables from config.env
load_dotenv('config.env')

# Setup basic logging
logging.basicConfig(
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
    level=logging.INFO
)
logger = logging.getLogger(__name__)

# Initialize OpenAI client
openai_client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

Step 2: The Telegram Bot Core Logic

Next, we’ll set up the main structure of our bot. This involves creating an `Application` instance and adding a `MessageHandler`. We specifically want to filter for voice messages, so we’ll use `filters.VOICE`.

async def handle_voice_message(update: Update, context: ContextTypes.DEFAULT_TYPE) -> None:
    # This is where the magic will happen. We'll fill this in next.
    await update.message.reply_text("Processing your voice memo...")
    # (Future steps go here)

def main() -> None:
    """Start the bot."""
    telegram_token = os.getenv("TELEGRAM_BOT_TOKEN")
    if not telegram_token:
        logger.error("TELEGRAM_BOT_TOKEN not found in environment variables!")
        return

    application = Application.builder().token(telegram_token).build()

    # Add a handler for voice messages
    application.add_handler(MessageHandler(filters.VOICE, handle_voice_message))

    # Start the Bot
    logger.info("Bot is starting...")
    application.run_polling()

if __name__ == '__main__':
    main()

This boilerplate code sets up a listener. When the bot receives a voice message, it will call our `handle_voice_message` function.

Step 3: Downloading and Converting the Audio

Telegram voice messages usually come in the Ogg Opus audio codec (`.oga` format). Whisper API works best with more standard formats like MP3 or WAV. This is where `pydub` and `ffmpeg` shine. We’ll download the file, then convert it.

Let’s flesh out the `handle_voice_message` function:

async def handle_voice_message(update: Update, context: ContextTypes.DEFAULT_TYPE) -> None:
    """Downloads, converts, and transcribes a voice message."""
    file_id = update.message.voice.file_id
    try:
        # 1. Download the file
        voice_file = await context.bot.get_file(file_id)
        
        # We create temporary file paths
        oga_path = f'{file_id}.oga'
        mp3_path = f'{file_id}.mp3'

        await voice_file.download_to_drive(oga_path)
        logger.info(f"Downloaded voice file to {oga_path}")

        # 2. Convert OGA to MP3
        audio = AudioSegment.from_ogg(oga_path)
        audio.export(mp3_path, format="mp3")
        logger.info(f"Converted {oga_path} to {mp3_path}")
        
        # (Transcription step comes next)

    except Exception as e:
        logger.error(f"An error occurred: {e}")
        await update.message.reply_text("Sorry, I couldn't process that voice memo.")
    finally:
        # 4. Clean up the temporary files
        if os.path.exists(oga_path):
            os.remove(oga_path)
        if os.path.exists(mp3_path):
            os.remove(mp3_path)
        logger.info("Cleaned up temporary files.")

Pro Tip: In my production setups, I handle file paths more robustly, often using a dedicated `/tmp` or temporary directory structure. For this example, creating files in the local directory is fine, but always be mindful of where you’re writing data, especially in a containerized environment. Cleaning up files in a `finally` block ensures they get deleted even if an error occurs.

Step 4: Transcribing with OpenAI Whisper

With our MP3 file ready, sending it to OpenAI is straightforward. We’ll use the `client.audio.transcriptions.create` method.

Let’s add the transcription logic into our `handle_voice_message` function:

async def handle_voice_message(update: Update, context: ContextTypes.DEFAULT_TYPE) -> None:
    """Downloads, converts, and transcribes a voice message."""
    file_id = update.message.voice.file_id
    oga_path = f'{file_id}.oga'
    mp3_path = f'{file_id}.mp3'
    
    try:
        await update.message.reply_text("Processing your voice memo...")
        voice_file = await context.bot.get_file(file_id)
        await voice_file.download_to_drive(oga_path)
        
        audio = AudioSegment.from_ogg(oga_path)
        audio.export(mp3_path, format="mp3")

        # 3. Send to Whisper API for transcription
        with open(mp3_path, "rb") as audio_file:
            transcription = openai_client.audio.transcriptions.create(
                model="whisper-1",
                file=audio_file
            )
        
        transcribed_text = transcription.text
        logger.info(f"Transcription successful: {transcribed_text}")
        
        # 4. Reply to the user
        await update.message.reply_text(f"Transcription:\n\n{transcribed_text}", parse_mode='HTML')

    except Exception as e:
        logger.error(f"An error occurred: {e}")
        await update.message.reply_text("Sorry, I couldn't process that voice memo.")
    finally:
        # 5. Clean up
        if os.path.exists(oga_path):
            os.remove(oga_path)
        if os.path.exists(mp3_path):
            os.remove(mp3_path)
        logger.info("Cleaned up temporary files for " + file_id)

And that’s the complete loop! The bot receives a voice note, downloads it, converts it, sends it to Whisper, and replies with the text.

Common Pitfalls

Here are a few places I’ve tripped up in the past. Hopefully, you can avoid them.

`ffmpeg` Not Found: The most common issue. The `pydub` library is just a Python wrapper around the `ffmpeg` command-line tool. If `ffmpeg` isn’t installed and available in your system’s PATH, `pydub` will fail. The error message is usually pretty clear about this.
API Key Errors: Double-check your `config.env` file. A typo in the variable name or a misplaced quote can lead to authentication failures. Make sure the file is in the same directory you’re running the script from, or provide an absolute path to it.
File Size Limits: The OpenAI Whisper API has a file size limit (currently 25 MB). For a simple voice memo bot, this is rarely an issue. But if you were adapting this for longer audio, you’d need to implement chunking—splitting the audio into smaller pieces and processing them sequentially.

Conclusion

You now have a fully functional, private transcription service. This pattern is incredibly versatile. You could modify it to save transcriptions to a database, send them to a Notion page, or create a Jira ticket. It’s a fantastic building block for automating any workflow that starts with a spoken idea.

Happy building,
Darian Vance
Senior DevOps Engineer, TechResolve

Darian Vance

Lead Cloud Architect & DevOps Strategist

With over 12 years in system architecture and automation, Darian specializes in simplifying complex cloud infrastructures. An advocate for open-source solutions, he founded TechResolve to provide engineers with actionable, battle-tested troubleshooting guides and robust software alternatives.

🤖 Frequently Asked Questions

❓ How does the bot convert Telegram voice messages to text?

The bot first downloads the Telegram voice message (typically Ogg Opus `.oga` format), converts it to MP3 using `pydub` (which relies on `ffmpeg`), and then sends the MP3 file to the OpenAI Whisper API for transcription, finally replying with the extracted text.

❓ What are the main benefits of this automated transcription bot?

This bot transforms unstructured voice memos into searchable, copy-pasteable text, drastically reducing the time spent listening back to audio. It provides a private, efficient, and automated way to bridge spoken ideas to actionable written data directly within Telegram.

❓ What is a common technical issue when setting up the bot, and how is it resolved?

A common issue is the “ffmpeg Not Found” error, as `pydub` is a Python wrapper that requires `ffmpeg` to be installed and accessible in the system’s PATH for audio conversion. Installing `ffmpeg` on the operating system resolves this dependency.

TechResolve – SaaS Troubleshooting & Software Alternatives

Leave a ReplyCancel reply