[Python] Multilingual Text / Language Detection and Translation in Python using GoogleTranslate
Handling multilingual text is a common challenge in social media data analysis, especially when working with user-generated content like YouTube comments, tweets, or customer reviews. In this guide, we’ll walk through a Python-based approach to detecting languages and translating text into English using Google Translate.
You can copy the full code at once at the end of this blog post.
Step 1: Install and Import Required Libraries
Run the following command to install and import the necessary packages. Here, we import pandas
for data handling, Translator
from googletrans
for translation, time
for adding delays (to prevent rate limiting), and re
for regex-based language pattern detection.
!pip install pandas googletrans==4.0.0-rc1
import pandas as pd
from googletrans import Translator
import time
import re
translator = Translator() # Initialize Google Translator
Step 2: Load the Dataset
For this tutorial, we assume that the dataset is stored as a CSV file. Replace file_path
with the actual location of your file.
# Define file path
file_path = "your_file_path_here.csv"
# Load the CSV file with UTF-8 encoding
df = pd.read_csv(file_path, encoding="utf-8-sig")
Step 3: Language Detection with Fallback Mechanism
Language detection can sometimes fail, so we define a fallback mechanism using regex patterns.
def guess_language(text):
"""Guess the language based on character patterns if detection fails."""
# East Asian languages
if re.search("[\uac00-\ud7af]", text): # Korean Hangul
return "ko"
elif re.search("[\u3040-\u309F\u30A0-\u30FF]", text): # Japanese Hiragana & Katakana
return "ja"
elif re.search("[\u4E00-\u9FFF\u3400-\u4DBF]", text): # Chinese Han characters (CJK Unified)
return "zh"
# Cyrillic-based languages (Russian, Ukrainian, Bulgarian, etc.)
elif re.search("[\u0400-\u04FF]", text):
return "ru"
# South and Southeast Asian scripts
elif re.search("[\u0E00-\u0E7F]", text): # Thai
return "th"
elif re.search("[\u0980-\u09FF]", text): # Bengali
return "bn"
elif re.search("[\u0900-\u097F]", text): # Devanagari (Hindi, Marathi, etc.)
return "hi"
elif re.search("[\u0A80-\u0AFF]", text): # Gujarati
return "gu"
elif re.search("[\u0B00-\u0B7F]", text): # Oriya (Odia)
return "or"
elif re.search("[\u0B80-\u0BFF]", text): # Tamil
return "ta"
elif re.search("[\u0C00-\u0C7F]", text): # Telugu
return "te"
elif re.search("[\u0C80-\u0CFF]", text): # Kannada
return "kn"
elif re.search("[\u0D00-\u0D7F]", text): # Malayalam
return "ml"
elif re.search("[\u0F00-\u0FFF]", text): # Tibetan
return "bo"
elif re.search("[\u1780-\u17FF]", text): # Khmer (Cambodian)
return "km"
elif re.search("[\u1000-\u109F]", text): # Burmese (Myanmar)
return "my"
elif re.search("[\u1700-\u171F]", text): # Baybayin script
return "tl"
# Middle Eastern languages
elif re.search("[\u0600-\u06FF\u0750-\u077F\u08A0-\u08FF]", text): # Arabic and extended Arabic scripts
return "ar"
elif re.search("[\u0590-\u05FF]", text): # Hebrew
return "he"
elif re.search("[\u0700-\u074F]", text): # Syriac
return "sy"
elif re.search("[\u10A0-\u10FF]", text): # Georgian
return "ka"
elif re.search("[\u1200-\u137F]", text): # Amharic (Ethiopic script)
return "am"
# Latin-based languages with unique diacritics
elif re.search("[ÇĞİÖŞÜçğıöşü]", text): # Turkish
return "tr"
elif re.search("[ÀÂÆÇÉÈÊËÎÏÔŒÙÛÜŸàâæçéèêëîïôœùûüÿ]", text): # French
return "fr"
elif re.search("[ÁÉÍÑÓÚÜáéíñóúü]", text): # Spanish
return "es"
elif re.search("[ÄÖÜäöüß]", text): # German
return "de"
elif re.search("[ĄĆĘŁŃÓŚŹŻąćęłńóśźż]", text): # Polish
return "pl"
elif re.search("[ŘŠŽřšž]", text): # Czech and Slovak
return "cs"
elif re.search("[ÐđĆćČ芚Žž]", text): # Croatian, Serbian (Latin)
return "hr"
return "unknown"
Next, we define the primary function for language detection:
# Function to detect language with retries and intelligent fallback
def detect_language(text, retries=2):
"""Try to detect language, retrying if needed. If still unknown, attempt guessing."""
if not isinstance(text, str) or text.strip() == "":
return "unknown"
for attempt in range(retries + 1):
try:
detected_lang = translator.detect(text).lang
if detected_lang != "unknown":
return detected_lang
except:
pass
time.sleep(1) # Delay before retrying
# Final fallback: Guess the language based on script patterns
return guess_language(text)
Now, we apply this function to the dataset. You need to change 'title'
here for your parameter (variable or column name) that you want to detect languages and translate 🙂 For example, if your column name is 'body'
you have to put df['body']
instead of df['title']
.
df['detected_language'] = df['title'].apply(lambda text: detect_language(text, retries=2))
Step 4: Translate Non-English Text to English
def translate_to_english(text, detected_lang, retries=2):
"""Translate text to English. If detection fails, guess the language and retry."""
if detected_lang != "en" and detected_lang != "unknown":
for attempt in range(retries + 1):
try:
return translator.translate(text, src=detected_lang, dest="en").text
except:
time.sleep(1) # Delay before retrying
# Final fallback: If detection was unknown, try using the guessed language
guessed_lang = guess_language(text)
if guessed_lang != "unknown":
try:
return translator.translate(text, src=guessed_lang, dest="en").text
except:
return "Translation Error"
return text # If all else fails, return original text
Apply the translation function to the dataset:
df['translated_text'] = df.apply(lambda row: translate_to_english(row['title'], row['detected_language'], retries=2), axis=1)
Step 5: Save the Results
Finally, we save the processed dataset to a new CSV file:
output_file = "translated_data.csv"
df.to_csv(output_file, encoding="utf-8-sig", index=False)
Full code block
Here is the full code block that you can copy for your own use in Google Colab. You have to change the file path for your data file.
# Step 1: Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')
# Step 2: Install required packages
!pip install pandas googletrans==4.0.0-rc1
# Step 3: Import necessary libraries
import pandas as pd
from googletrans import Translator
import time
import re
# Initialize Google Translator
translator = Translator()
# Step 4: Define file path
file_path = "your_file_path_here.csv"
# Step 5: Load the CSV file with UTF-8 encoding
df = pd.read_csv(file_path, encoding="utf-8-sig")
# Step 6: Language detection function using regex patterns
def guess_language(text):
"""Guess the language based on character patterns if detection fails."""
# East Asian languages
if re.search("[\uac00-\ud7af]", text): # Korean Hangul
return "ko"
elif re.search("[\u3040-\u309F\u30A0-\u30FF]", text): # Japanese Hiragana & Katakana
return "ja"
elif re.search("[\u4E00-\u9FFF\u3400-\u4DBF]", text): # Chinese Han characters (CJK Unified)
return "zh"
# Cyrillic-based languages (Russian, Ukrainian, Bulgarian, etc.)
elif re.search("[\u0400-\u04FF]", text):
return "ru"
# South and Southeast Asian scripts
elif re.search("[\u0E00-\u0E7F]", text): # Thai
return "th"
elif re.search("[\u0980-\u09FF]", text): # Bengali
return "bn"
elif re.search("[\u0900-\u097F]", text): # Devanagari (Hindi, Marathi, etc.)
return "hi"
elif re.search("[\u0A80-\u0AFF]", text): # Gujarati
return "gu"
elif re.search("[\u0B00-\u0B7F]", text): # Oriya (Odia)
return "or"
elif re.search("[\u0B80-\u0BFF]", text): # Tamil
return "ta"
elif re.search("[\u0C00-\u0C7F]", text): # Telugu
return "te"
elif re.search("[\u0C80-\u0CFF]", text): # Kannada
return "kn"
elif re.search("[\u0D00-\u0D7F]", text): # Malayalam
return "ml"
elif re.search("[\u0F00-\u0FFF]", text): # Tibetan
return "bo"
elif re.search("[\u1780-\u17FF]", text): # Khmer (Cambodian)
return "km"
elif re.search("[\u1000-\u109F]", text): # Burmese (Myanmar)
return "my"
elif re.search("[\u1700-\u171F]", text): # Baybayin script
return "tl"
# Middle Eastern languages
elif re.search("[\u0600-\u06FF\u0750-\u077F\u08A0-\u08FF]", text): # Arabic and extended Arabic scripts
return "ar"
elif re.search("[\u0590-\u05FF]", text): # Hebrew
return "he"
elif re.search("[\u0700-\u074F]", text): # Syriac
return "sy"
elif re.search("[\u10A0-\u10FF]", text): # Georgian
return "ka"
elif re.search("[\u1200-\u137F]", text): # Amharic (Ethiopic script)
return "am"
# Latin-based languages with unique diacritics
elif re.search("[ÇĞİÖŞÜçğıöşü]", text): # Turkish
return "tr"
elif re.search("[ÀÂÆÇÉÈÊËÎÏÔŒÙÛÜŸàâæçéèêëîïôœùûüÿ]", text): # French
return "fr"
elif re.search("[ÁÉÍÑÓÚÜáéíñóúü]", text): # Spanish
return "es"
elif re.search("[ÄÖÜäöüß]", text): # German
return "de"
elif re.search("[ĄĆĘŁŃÓŚŹŻąćęłńóśźż]", text): # Polish
return "pl"
elif re.search("[ŘŠŽřšž]", text): # Czech and Slovak
return "cs"
elif re.search("[ÐđĆćČ芚Žž]", text): # Croatian, Serbian (Latin)
return "hr"
return "unknown"
# Function to detect language with retries and intelligent fallback
def detect_language(text, retries=2):
"""Try to detect language, retrying if needed. If still unknown, attempt guessing."""
if not isinstance(text, str) or text.strip() == "":
return "unknown"
for attempt in range(retries + 1):
try:
detected_lang = translator.detect(text).lang
if detected_lang != "unknown":
return detected_lang
except:
pass
time.sleep(1) # Delay before retrying
# Final fallback: Guess the language based on script patterns
return guess_language(text)
# Step 7: Apply function to detect language
df['detected_language'] = df['title'].apply(lambda text: detect_language(text, retries=2))
# Step 8: Translate non-English text to English with intelligent fallback
def translate_to_english(text, detected_lang, retries=2):
"""Translate text to English. If detection fails, guess the language and retry."""
if detected_lang != "en" and detected_lang != "unknown":
for attempt in range(retries + 1):
try:
return translator.translate(text, src=detected_lang, dest="en").text
except:
time.sleep(1) # Delay before retrying
# Final fallback: If detection was unknown, try using the guessed language
guessed_lang = guess_language(text)
if guessed_lang != "unknown":
try:
return translator.translate(text, src=guessed_lang, dest="en").text
except:
return "Translation Error"
return text # If all else fails, return original text
# Step 9: Apply translation function
df['translated_title'] = df.apply(lambda row: translate_to_english(row['title'], row['detected_language'], retries=2), axis=1)
# Step 10: Save the results back to Google Drive
output_file = "translate_data.csv"
df.to_csv(output_file, encoding="utf-8-sig", index=False)
# Step 11: Display the first few rows
df.head()
Last Note
In this tutorial, we explored how to detect and translate multilingual text using Python, leveraging the googletrans
library for both language detection and translation tasks. While this approach offers a straightforward solution, it’s important to consider alternative methods. From my personal experience, these methods were not as accurate as Google Translate, but I guess it could be different for other use cases 🙂
Alternative Language Detection Methods:
langdetect
Package: This is a simple probabilistic language detection tool. However, in practice, it often exhibits lower accuracy compared to Google’s translation services, particularly when dealing with short or informal texts.
You can also consider some alternative translation methods as follows:
Alternative Translation Services:
- DeepL API: DeepL is renowned for its high-quality translations, especially for European languages. While it is a paid service, many users find its accuracy superior to that of free alternatives. You can incorporate it into your code by using the DeepL API.
- Helsinki-NLP Models: The Language Technology Research Group at the University of Helsinki provides open-source neural machine translation models, available through platforms like Hugging Face. These models cover a wide range of language pairs and can be integrated into custom applications.