Build a Language Detection Model Using Python

I’m excited to share my journey of building a language detection tool using Python. Whether you’re a beginner or someone looking to explore natural language processing (NLP), this guide will walk you through the process, explain the key concepts, and provide practical enhancements to make the tool more versatile.

Why Language Detection?

Language detection is a fundamental task in NLP that identifies the language of a given text. It’s widely used in applications like:

Multilingual Content Analysis: Automatically categorizing content by language.
Translation Systems: Detecting the source language before translating.
User Experience Personalization: Tailoring content based on the user’s preferred language.

For this project, I used the langdetect library, a Python port of Google’s language detection library. It’s simple, accurate, and easy to integrate into your projects.

Setting Up the Environment

Before writing any code, I set up my Python environment. Here’s what I did:

Install the langdetect Library:

pip install langdetect

Create a Python File:
I created a file named language_detector.py to write and test the code.

Writing the Basic Code

The core functionality of the tool is straightforward. Here’s the initial code:

from langdetect import detect_langs

# Input text
text_to_detect = "This is a text. \n Este es un texto."

# Detect languages
detected_languages = detect_langs(text_to_detect)

# Print results
for language in detected_languages:
    print(f"{language.lang}: {language.prob}")

Explanation of the Code

Importing the Library:

from langdetect import detect_langs

The langdetect library provides the detect_langs function, which detects the language(s) of a given text.

Input Text:

text_to_detect = "This is a text. \n Este es un texto."

The variable text_to_detect contains a multi-language string with English and Spanish.

Detecting Languages:

detected_languages = detect_langs(text_to_detect)

The detect_langs function returns a list of Language objects, each containing:

lang: The language code (e.g., en for English, es for Spanish).
prob: The probability (confidence) that the text is in that language.

Printing Results:

for language in detected_languages:
    print(f"{language.lang}: {language.prob}")

This loop iterates through the detected languages and prints their codes and probabilities.

Enhancing the Code

While the basic code works well, I wanted to make it more practical and user-friendly. Here are the enhancements I added:

Handling Multiple Texts

If you have multiple texts to analyze, you can process them in bulk:

texts = [
    "This is a text in English.",
    "Este es un texto en español.",
    "Ceci est un texte en français."
]

for text in texts:
    detected_languages = detect_langs(text)
    print(f"Text: {text}")
    for language in detected_languages:
        print(f"Detected Language: {language.lang}, Probability: {language.prob}")
    print("-" * 30)

Adding Language Names

Instead of displaying language codes, I mapped them to their full names for better readability:

language_names = {
    'en': 'English',
    'es': 'Spanish',
    'fr': 'French',
    'de': 'German',
    # Add more languages as needed
}

text_to_detect = "This is a text. \n Este es un texto."

detected_languages = detect_langs(text_to_detect)
for language in detected_languages:
    lang_name = language_names.get(language.lang, "Unknown Language")
    print(f"{lang_name}: {language.prob}")

Handling Errors

The langdetect library may raise errors for short or ambiguous texts. I added error handling to manage such cases:

from langdetect import detect_langs, LangDetectException

text_to_detect = "Hi"

try:
    detected_languages = detect_langs(text_to_detect)
    for language in detected_languages:
        print(f"{language.lang}: {language.prob}")
except LangDetectException as e:
    print(f"Error: {e}")

Setting a Confidence Threshold

To filter out languages with low probabilities, I added a confidence threshold:

confidence_threshold = 0.5

detected_languages = detect_langs(text_to_detect)
for language in detected_languages:
    if language.prob >= confidence_threshold:
        print(f"{language.lang}: {language.prob}")
    else:
        print(f"{language.lang} is below the confidence threshold.")

Detecting Languages in Files

For analyzing text files, I added functionality to read and process each line:

file_path = "sample_text.txt"

with open(file_path, "r", encoding="utf-8") as file:
    for line in file:
        line = line.strip()
        if line:
            detected_languages = detect_langs(line)
            print(f"Text: {line}")
            for language in detected_languages:
                print(f"Detected Language: {language.lang}, Probability: {language.prob}")
            print("-" * 30)

Creating a Reusable Function

To make the code reusable, I encapsulated the logic in a function:

from langdetect import detect_langs

def detect_language(text, confidence_threshold=0.1):
    try:
        detected_languages = detect_langs(text)
        results = []
        for language in detected_languages:
            if language.prob >= confidence_threshold:
                results.append((language.lang, language.prob))
        return results
    except Exception as e:
        return f"Error: {e}"

text_to_detect = "This is a text. \n Este es un texto."
print(detect_language(text_to_detect))

Final Thoughts

Building this language detection tool was both fun and educational. It reminded me of the power of Python and its libraries in solving real-world problems. The langdetect library is incredibly easy to use, and with a few enhancements, the tool becomes even more versatile.

Whether you’re analyzing multilingual content, building a translation system, or personalizing user experiences, language detection is a valuable skill to have in your toolkit.