Translating Different Content Types Using Helsinki-NLP ML Model from Hugging Face / Blogs / Perficient

Vernon September 4, 2023

0 4 minutes read

Translating Different Content Types Using Helsinki-NLP ML Model from Hugging Face / Blogs / Perficient

In Hugging Face, a translation model is a pre-trained deep learning model that can be used for machine translation tasks, These ****** are pre-trained on large amounts of multilingual data and fine-tuned on translation-specific datasets.

To use a translation model in Hugging Face, we typically load the model using the from_pretrained() function, which fetches the pre-trained weights and configuration. Then, we can use the model to translate text by passing the source language text as input and obtaining the translated text as output.

Hugging Face’s translation ****** are implemented in the Transformers library, which is a popular open-source library for natural language processing (NLP) tasks. The library provides a unified interface and a set of powerful tools for working with various NLP ******, including translation ******.

Let’s start by implementing a translation model using the Helsinki-NLP model from Hugging Face:

Install the necessary libraries: Install the transformers library, which includes the translation ******, use pip to install it.
```
pip install transformers
```
Load the translation model: Use the from_pretrained()function to load a pre-trained translation model. need to specify the model’s name or the model’s identifier. For example, to load the English-to-French translation model we can use the following code.
```
from transformers import MarianMTModel, MarianTokenizer

model_name = "Helsinki-NLP/opus-mt-en-fr"
model = MarianMTModel.from_pretrained(model_name)
tokenizer = MarianTokenizer.from_pretrained(model_name)
```
Tokenize the input: Before translating the text, we need to tokenize it using the appropriate tokenizer. The tokenizer splits the text into smaller units, such as words or subwords, that the model understands.
```
from transformers import MarianMTModel, MarianTokenizer

model_name = "Helsinki-NLP/opus-mt-en-fr"
model = MarianMTModel.from_pretrained(model_name)
tokenizer = MarianTokenizer.from_pretrained(model_name)
```
Translate the text: Pass the encoded input to the translation model to obtain the translated output.
```
translated_output = model.generate(encoded_input)
```
Translate the text: Pass the encoded input to the translation model to obtain the translated output.
```
translated_text = tokenizer.decode(translated_output[0], skip_special_tokens=True
```
These steps provide a basic outline of implementing a translation model in Hugging Face.

Handling Metadata, HTML Body, and Plain Text

With the fundamentals of the Helsinki-NLP Hugging Face model in hand, let us gets started by translating various forms of content, including plain text, HTML body content, and metadata.

The Future of Big Data

With some guidance, you can craft a data platform that is right for your organization’s needs and gets the most return from your data capital.

Get the Guide

It is essential to determine the type of content we will be dealing with before we start translating.

Methods used to differentiate between plain text, HTML, and Metadata are as follows:

is_plain_text(content): By looking for the presence of HTML tags and Python string identifiers, this function can tell if the content is plain text.
is_html_content(content): identifies the existence of the html tag to identify HTML content.
is_python_string(content): Recognizes metadata in Python strings based on specific delimiters.

Approaches that demonstrates the translation of different content:

Translating Metadata Content: Metadata often consists of structured data in the form of key-value pairs like name, title etc,. This translate just the values of the metadata object while leaving the keys as it is:

def translate_metadata_content(metadata,model,tokenizer,fields_to_translate):
            translated_metadata = {}
            # Loop through each field and perform the translation process
            for key, value in metadata.items():
                # Translate the value if it is a string and included in fields_to_translate
                if isinstance(value, str) and key in fields_to_translate:
                    value_tokens = tokenizer.encode(value, return_tensors="pt")
                    translated_value_tokens = model.generate(value_tokens, max_length=100)
                    translated_value = tokenizer.decode(translated_value_tokens[0], skip_special_tokens=True)
                else:
                    translated_value = value

                translated_metadata[key] = translated_value

            return json.dumps(translated_metadata)

Translating Plain Text Content: Plain text translation is a simple technique. To translate plain text from one language to another, we’ll use our translation model:

def translate_plainText(content,model,tokenizer):
            # Tokenize the plain text content
            encoded = tokenizer(content, return_tensors="pt", padding=True, truncation=True)

            # Translate the text
            translated_tokens = model.generate(**encoded, max_length=1024, num_beams=4, early_stopping=True)
            return tokenizer.decode(translated_tokens[0], skip_special_tokens=True)

Translating HTML Body Content: Due to the existence of markup, HTML body material requires certain processing. This method focuses on translating HTML body text:

def translate_html_content(content,model,tokenizer):
             # Tokenize the HTML content
             soup = BeautifulSoup(content, 'html.parser')

             # Translate the text
             translated_text = model.generate(**tokenizer(content, return_tensors="pt", padding=True, truncation=True),
                                          max_length=1024, num_beams=4, early_stopping=True)
             translated_text = tokenizer.decode(translated_text[0], skip_special_tokens=True)

             # Create a new soup with the translated text
             new_soup = BeautifulSoup(translated_text, 'html.parser')

             # Replace the text in the original HTML structure
             for original_tag, translated_tag in zip(soup.find_all(), new_soup.find_all()):
                 if original_tag.string:
                     original_tag.string = translated_tag.get_text()
             return soup.prettify()

Putting it All Together

We provide a central approach that manages the translation method according to the content type to bring everything together:

import sys
import subprocess
import json
import sacremoses
from transformers import MarianMTModel, MarianTokenizer


# Install necessary packages if not already installed
try:
    import transformers
    import sacremoses
except ImportError:
    subprocess.check_call(['pip', 'install', 'torch', 'transformers', 'sacremoses'])
    import transformers
    import sacremoses

from transformers import MarianMTModel, MarianTokenizer
from bs4 import BeautifulSoup



def translate_content(content):
    # Load the translation model and tokenizer
    model_name = f'Helsinki-NLP/opus-mt-en-fr'
    model = MarianMTModel.from_pretrained(model_name)
    tokenizer = MarianTokenizer.from_pretrained(model_name)
    tokenizer.src_tokenizer = sacremoses.MosesTokenizer()
    tokenizer.tgt_tokenizer = sacremoses.MosesTokenizer()

    # Check if the input is HTML or plain text or metadata
    if is_html_content(content):
        translated_content = translate_html_content(content,model,tokenizer)     
    elif is_python_string(content):
        print("Content is a Python string expression.")
        fields_to_translate=['title','name']
        content = content.replace("null", "None")
        metadata = eval(content)
        translated_content = translate_metadata_content(metadata,model,tokenizer,fields_to_translate)
    elif is_plain_text(content):
        translated_content = translate_plainText(content,model,tokenizer)
        
    return translated_content          
       
def translate_metadata_content(metadata,model,tokenizer,fields_to_translate):
    # Utilize the code snippet from the first point above to translate metadata values.
    #...

def translate_plainText(content,model,tokenizer):
    # Utilize the code snippet from the second point above to translate plain text
    #...

def translate_html_content(content,model,tokenizer):
    # Utilize the code snippet from the third point above to translate html content
    #... 

def is_html_content(content):
    return "<html>" in content.lower()

def is_plain_text(content):
    return "<html>" not in content.lower() and not is_python_string(content)
    
def is_python_string(content):
    return (content.startswith("'") and content.endswith("'")) or \
           (content.startswith('"') and content.endswith('"')) or \
           (content.startswith("{") and content.endswith("}"))

Example Usage

Here are examples of using the provided functions with different content types.

# Example usage with HTML content:
html_content = """
<html>
<head>
    <title>Example HTML</title>
</head>
<body>
    <h1>Hello, world!</h1>
    <p>This is a sample HTML content to be translated.</p>
</body>
</html>
"""
translated_html = translate_content(html_content)
print(translated_html)


# Example usage with plain text:
plain_text = "plain text content for testing translation functionality "

translated_text = translate_content(plain_text)
print(translated_text)


# Example usage with metadata
metadata ="{'title':'title for testing translation of metadata value'}"


translated_metadata = translate_content(metadata)
print(translated_metadata)

You just need to run this script separately using below command.

python your_file_name.py command

The Helsinki-NLP model from Hugging Face is like a strong tool that can translate different types of content. This includes regular text, website text (HTML), and extra information (metadata). Using special ****** in the Transformers library, we can easily translate words from one language to another.

Source link