[Python] Multilingual Text / Language Detection and Translation in Python using GoogleTranslate

Handling multilingual text is a common challenge in social media data analysis, especially when working with user-generated content like YouTube comments, tweets, or customer reviews. In this guide, we’ll walk through a Python-based approach to detecting languages and translating text into English using Google Translate.

You can copy the full code at once at the end of this blog post.

Step 1: Install and Import Required Libraries

Run the following command to install and import the necessary packages. Here, we import pandas for data handling, Translator from googletrans for translation, time for adding delays (to prevent rate limiting), and re for regex-based language pattern detection.

Python
!pip install pandas googletrans==4.0.0-rc1

import pandas as pd
from googletrans import Translator
import time
import re

translator = Translator() # Initialize Google Translator

Step 2: Load the Dataset

For this tutorial, we assume that the dataset is stored as a CSV file. Replace file_path with the actual location of your file.

Python
# Define file path
file_path = "your_file_path_here.csv"

# Load the CSV file with UTF-8 encoding
df = pd.read_csv(file_path, encoding="utf-8-sig")

Step 3: Language Detection with Fallback Mechanism

Language detection can sometimes fail, so we define a fallback mechanism using regex patterns.

Python
def guess_language(text):
    """Guess the language based on character patterns if detection fails."""
    # East Asian languages
    if re.search("[\uac00-\ud7af]", text):  # Korean Hangul
        return "ko"
    elif re.search("[\u3040-\u309F\u30A0-\u30FF]", text):  # Japanese Hiragana & Katakana
        return "ja"
    elif re.search("[\u4E00-\u9FFF\u3400-\u4DBF]", text):  # Chinese Han characters (CJK Unified)
        return "zh"
    
    # Cyrillic-based languages (Russian, Ukrainian, Bulgarian, etc.)
    elif re.search("[\u0400-\u04FF]", text):  
        return "ru"
    
    # South and Southeast Asian scripts
    elif re.search("[\u0E00-\u0E7F]", text):  # Thai
        return "th"
    elif re.search("[\u0980-\u09FF]", text):  # Bengali
        return "bn"
    elif re.search("[\u0900-\u097F]", text):  # Devanagari (Hindi, Marathi, etc.)
        return "hi"
    elif re.search("[\u0A80-\u0AFF]", text):  # Gujarati
        return "gu"
    elif re.search("[\u0B00-\u0B7F]", text):  # Oriya (Odia)
        return "or"
    elif re.search("[\u0B80-\u0BFF]", text):  # Tamil
        return "ta"
    elif re.search("[\u0C00-\u0C7F]", text):  # Telugu
        return "te"
    elif re.search("[\u0C80-\u0CFF]", text):  # Kannada
        return "kn"
    elif re.search("[\u0D00-\u0D7F]", text):  # Malayalam
        return "ml"
    elif re.search("[\u0F00-\u0FFF]", text):  # Tibetan
        return "bo"
    elif re.search("[\u1780-\u17FF]", text):  # Khmer (Cambodian)
        return "km"
    elif re.search("[\u1000-\u109F]", text):  # Burmese (Myanmar)
        return "my"
    elif re.search("[\u1700-\u171F]", text):  # Baybayin script
        return "tl"
    
    # Middle Eastern languages
    elif re.search("[\u0600-\u06FF\u0750-\u077F\u08A0-\u08FF]", text):  # Arabic and extended Arabic scripts
        return "ar"
    elif re.search("[\u0590-\u05FF]", text):  # Hebrew
        return "he"
    elif re.search("[\u0700-\u074F]", text):  # Syriac
        return "sy"
    elif re.search("[\u10A0-\u10FF]", text):  # Georgian
        return "ka"
    elif re.search("[\u1200-\u137F]", text):  # Amharic (Ethiopic script)
        return "am"

    # Latin-based languages with unique diacritics
    elif re.search("[ÇĞİÖŞÜçğıöşü]", text):  # Turkish
        return "tr"
    elif re.search("[ÀÂÆÇÉÈÊËÎÏÔŒÙÛÜŸàâæçéèêëîïôœùûüÿ]", text):  # French
        return "fr"
    elif re.search("[ÁÉÍÑÓÚÜáéíñóúü]", text):  # Spanish
        return "es"
    elif re.search("[ÄÖÜäöüß]", text):  # German
        return "de"
    elif re.search("[ĄĆĘŁŃÓŚŹŻąćęłńóśźż]", text):  # Polish
        return "pl"
    elif re.search("[ŘŠŽřšž]", text):  # Czech and Slovak
        return "cs"
    elif re.search("[ÐđĆćČ芚Žž]", text):  # Croatian, Serbian (Latin)
        return "hr"
    
    return "unknown"

Next, we define the primary function for language detection:

Python
# Function to detect language with retries and intelligent fallback
def detect_language(text, retries=2):
    """Try to detect language, retrying if needed. If still unknown, attempt guessing."""
    if not isinstance(text, str) or text.strip() == "":
        return "unknown"
    
    for attempt in range(retries + 1):
        try:
            detected_lang = translator.detect(text).lang
            if detected_lang != "unknown":
                return detected_lang
        except:
            pass
        time.sleep(1)  # Delay before retrying

    # Final fallback: Guess the language based on script patterns
    return guess_language(text)

Now, we apply this function to the dataset. You need to change 'title' here for your parameter (variable or column name) that you want to detect languages and translate 🙂 For example, if your column name is 'body' you have to put df['body'] instead of df['title'].

Python
df['detected_language'] = df['title'].apply(lambda text: detect_language(text, retries=2))

Step 4: Translate Non-English Text to English

Python
def translate_to_english(text, detected_lang, retries=2):
    """Translate text to English. If detection fails, guess the language and retry."""
    if detected_lang != "en" and detected_lang != "unknown":
        for attempt in range(retries + 1):
            try:
                return translator.translate(text, src=detected_lang, dest="en").text
            except:
                time.sleep(1)  # Delay before retrying
    
    # Final fallback: If detection was unknown, try using the guessed language
    guessed_lang = guess_language(text)
    if guessed_lang != "unknown":
        try:
            return translator.translate(text, src=guessed_lang, dest="en").text
        except:
            return "Translation Error"

    return text  # If all else fails, return original text

Apply the translation function to the dataset:

Python
df['translated_text'] = df.apply(lambda row: translate_to_english(row['title'], row['detected_language'], retries=2), axis=1)

Step 5: Save the Results

Finally, we save the processed dataset to a new CSV file:

Python
output_file = "translated_data.csv"
df.to_csv(output_file, encoding="utf-8-sig", index=False)

Full code block

Here is the full code block that you can copy for your own use in Google Colab. You have to change the file path for your data file.

Python
# Step 1: Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Step 2: Install required packages
!pip install pandas googletrans==4.0.0-rc1

# Step 3: Import necessary libraries
import pandas as pd
from googletrans import Translator
import time
import re

# Initialize Google Translator
translator = Translator()

# Step 4: Define file path
file_path = "your_file_path_here.csv"

# Step 5: Load the CSV file with UTF-8 encoding
df = pd.read_csv(file_path, encoding="utf-8-sig")

# Step 6: Language detection function using regex patterns
def guess_language(text):
    """Guess the language based on character patterns if detection fails."""
    # East Asian languages
    if re.search("[\uac00-\ud7af]", text):  # Korean Hangul
        return "ko"
    elif re.search("[\u3040-\u309F\u30A0-\u30FF]", text):  # Japanese Hiragana & Katakana
        return "ja"
    elif re.search("[\u4E00-\u9FFF\u3400-\u4DBF]", text):  # Chinese Han characters (CJK Unified)
        return "zh"
    
    # Cyrillic-based languages (Russian, Ukrainian, Bulgarian, etc.)
    elif re.search("[\u0400-\u04FF]", text):  
        return "ru"
    
    # South and Southeast Asian scripts
    elif re.search("[\u0E00-\u0E7F]", text):  # Thai
        return "th"
    elif re.search("[\u0980-\u09FF]", text):  # Bengali
        return "bn"
    elif re.search("[\u0900-\u097F]", text):  # Devanagari (Hindi, Marathi, etc.)
        return "hi"
    elif re.search("[\u0A80-\u0AFF]", text):  # Gujarati
        return "gu"
    elif re.search("[\u0B00-\u0B7F]", text):  # Oriya (Odia)
        return "or"
    elif re.search("[\u0B80-\u0BFF]", text):  # Tamil
        return "ta"
    elif re.search("[\u0C00-\u0C7F]", text):  # Telugu
        return "te"
    elif re.search("[\u0C80-\u0CFF]", text):  # Kannada
        return "kn"
    elif re.search("[\u0D00-\u0D7F]", text):  # Malayalam
        return "ml"
    elif re.search("[\u0F00-\u0FFF]", text):  # Tibetan
        return "bo"
    elif re.search("[\u1780-\u17FF]", text):  # Khmer (Cambodian)
        return "km"
    elif re.search("[\u1000-\u109F]", text):  # Burmese (Myanmar)
        return "my"
    elif re.search("[\u1700-\u171F]", text):  # Baybayin script
        return "tl"
    
    # Middle Eastern languages
    elif re.search("[\u0600-\u06FF\u0750-\u077F\u08A0-\u08FF]", text):  # Arabic and extended Arabic scripts
        return "ar"
    elif re.search("[\u0590-\u05FF]", text):  # Hebrew
        return "he"
    elif re.search("[\u0700-\u074F]", text):  # Syriac
        return "sy"
    elif re.search("[\u10A0-\u10FF]", text):  # Georgian
        return "ka"
    elif re.search("[\u1200-\u137F]", text):  # Amharic (Ethiopic script)
        return "am"

    # Latin-based languages with unique diacritics
    elif re.search("[ÇĞİÖŞÜçğıöşü]", text):  # Turkish
        return "tr"
    elif re.search("[ÀÂÆÇÉÈÊËÎÏÔŒÙÛÜŸàâæçéèêëîïôœùûüÿ]", text):  # French
        return "fr"
    elif re.search("[ÁÉÍÑÓÚÜáéíñóúü]", text):  # Spanish
        return "es"
    elif re.search("[ÄÖÜäöüß]", text):  # German
        return "de"
    elif re.search("[ĄĆĘŁŃÓŚŹŻąćęłńóśźż]", text):  # Polish
        return "pl"
    elif re.search("[ŘŠŽřšž]", text):  # Czech and Slovak
        return "cs"
    elif re.search("[ÐđĆćČ芚Žž]", text):  # Croatian, Serbian (Latin)
        return "hr"
    
    return "unknown"

# Function to detect language with retries and intelligent fallback
def detect_language(text, retries=2):
    """Try to detect language, retrying if needed. If still unknown, attempt guessing."""
    if not isinstance(text, str) or text.strip() == "":
        return "unknown"

    for attempt in range(retries + 1):
        try:
            detected_lang = translator.detect(text).lang
            if detected_lang != "unknown":
                return detected_lang
        except:
            pass
        time.sleep(1)  # Delay before retrying

    # Final fallback: Guess the language based on script patterns
    return guess_language(text)

# Step 7: Apply function to detect language
df['detected_language'] = df['title'].apply(lambda text: detect_language(text, retries=2))

# Step 8: Translate non-English text to English with intelligent fallback
def translate_to_english(text, detected_lang, retries=2):
    """Translate text to English. If detection fails, guess the language and retry."""
    if detected_lang != "en" and detected_lang != "unknown":
        for attempt in range(retries + 1):
            try:
                return translator.translate(text, src=detected_lang, dest="en").text
            except:
                time.sleep(1)  # Delay before retrying

    # Final fallback: If detection was unknown, try using the guessed language
    guessed_lang = guess_language(text)
    if guessed_lang != "unknown":
        try:
            return translator.translate(text, src=guessed_lang, dest="en").text
        except:
            return "Translation Error"

    return text  # If all else fails, return original text

# Step 9: Apply translation function
df['translated_title'] = df.apply(lambda row: translate_to_english(row['title'], row['detected_language'], retries=2), axis=1)

# Step 10: Save the results back to Google Drive
output_file = "translate_data.csv"
df.to_csv(output_file, encoding="utf-8-sig", index=False)

# Step 11: Display the first few rows
df.head()

Last Note

In this tutorial, we explored how to detect and translate multilingual text using Python, leveraging the googletrans library for both language detection and translation tasks. While this approach offers a straightforward solution, it’s important to consider alternative methods. From my personal experience, these methods were not as accurate as Google Translate, but I guess it could be different for other use cases 🙂

Alternative Language Detection Methods:

  • langdetect Package: This is a simple probabilistic language detection tool. However, in practice, it often exhibits lower accuracy compared to Google’s translation services, particularly when dealing with short or informal texts.​

You can also consider some alternative translation methods as follows:

Alternative Translation Services:

  • DeepL API: DeepL is renowned for its high-quality translations, especially for European languages. While it is a paid service, many users find its accuracy superior to that of free alternatives.​ You can incorporate it into your code by using the DeepL API.
  • Helsinki-NLP Models: The Language Technology Research Group at the University of Helsinki provides open-source neural machine translation models, available through platforms like Hugging Face. These models cover a wide range of language pairs and can be integrated into custom applications.​

  • March 1, 2025