top of page

"Benchmarking Neural Machine Translation Using Open-Source Transformer Models and a Comparative Study with a Focus on Medical and Legal Domains" by Jawad Zaman

Benchmarking Neural Machine Translation Using Open-Source Transformer Models and a Comparative Study with a Focus on Medical and Legal Domains

Jawad Zaman, St. Joseph's University

ree

Abstract: This research evaluates the performance of open-source Neural Machine Translation (NMT) models from Hugging Face websites, such as T5-base, MBART-large, and Helsinki-NLP. It emphasizes the ability of these models to handle both general and specialized translations, particularly medical and legal texts. Given the growing demand for accurate translations in professional fields, this study explores the fluency, accuracy, and contextual relevance of NMT for specialized terms. BLEU and METEOR scores will quantify the performance of these models and present a comparative visual representation using Matplotlib and Seaborn. Traditional Statistical Machine Translation (SMT) relies on word-level probabilities, often struggling with idiomatic expressions and contextual understanding. Neural Machine Translation (NMT) addresses these limitations by considering entire sentences to provide more nuanced translations. However, its efficiency in specialized fields like medical and legal translation can yield better results through fine-tuning pre-trained NMT models on domain-specific datasets. This research utilizes TensorFlow and PyTorch frameworks to interface with and benchmark pre-trained NMT models across two types of datasets: general translation and specialized translation datasets (medical and legal texts) obtained from Kaggle, a popular dataset repository. Performance is visualized through bar graphs and scatter plots, demonstrating each model's strengths, limitations, and potential for improvement. Comparing the results of each model, specific models are expected to perform better in general translations, while others excel in medical or legal fields, highlighting the adaptability of NMT systems in domain-specific translation tasks.


Keywords: Neural Machine Translation, T5, MBart, Helsinki, BLEU, METEOR, medical, legal

Introduction

Neural Machine Translation, or NMT, is important in computational linguistics, particularly transformer-based architectures. Traditional methods, such as Statistical Machine Translation (SMT), rely on word-level probabilities that struggle with idiomatic expressions and specialized terms. On the other hand, NMT uses an attention mechanism to generate fluent and contextually accurate translations. 


However, the efficiency of NMT in specialized fields like medical and legal translation is still limited. Although it can handle general translations well, NMT faces difficulties translating specialized terms. “Due to the language barriers and inequality of digital resources across languages, there is an urgent need for knowledge transformation, such as from one human language to another. Thus, to help address digital health inequalities, machine translation (MT) technologies can be of good use in this case” (Han et al., 2024, p. 2).


This research focuses on the translation effectiveness of NMT models and their adaptability to medical and legal domains. Translation datasets are obtained from Hugging Face and Kaggle repositories. Each dataset contains texts in a source language and target language in general, medical, and legal contexts. Pre-trained NMT models are trained and tested on these datasets and benchmarked using BLEU and METEOR scores. 


A total of 756 scores were obtained, consisting of 2 scores per training session for 3 types of models spanning 7 source languages, 6 target languages, and 3 contexts. Each training session takes 11 hours to complete, resulting in a total training time of 8,316 hours (approximately 347 days) for all models. To speed up the process, some scores were calculated using estimated values based on the obtained scores.


The scores are visualized through scatterplots, and the data is computed and visualized for each model, language, and context individually through bar charts. The datasets, models, code, and results are openly accessible at the following GitHub repository:  https://github.com/Jzaman2004/NLP_Translation


Research Questions

This research focuses on three key parameters: first, the general effectiveness of open-source translation models; second, the difference in translation quality between languages that use the Latin alphabet and those with unique character systems; and finally, and most importantly, the challenges and opportunities in translating medical and legal terms.


Question 1: Which open-source translation model is the most effective and accurate?

There are many open-source NLP models available nowadays. “The accelerated progress in AI and natural language processing (NLP) has not only fostered the development of highly sophisticated and adaptable language models but has also exerted a considerable influence on numerous domains, particularly machine translation, where traditional NMT techniques have achieved remarkable advancements in recent years” (Son & Kim, 2023, p. 1). This research aims to identify the best model for translation in terms of accuracy and effectiveness.


Question 2: Does translation quality differ significantly between languages that use the Latin alphabet and those with unique character systems?

With around 7,000 languages in the world, fewer than 100 languages use the English Latin alphabet (The World's, n.d.). Does translation become more difficult when translating between languages with different character systems versus translating between languages with the same characters? It is expected that translations between languages with the same characters will perform better than the languages with different character systems.


Question 3: What are the challenges and opportunities in translating medical and legal terms?

NMT can struggle with domain-specific terminology and context. Medical translations require full accuracy to avoid potentially life-threatening misinterpretations. “Healthcare Text Analytics (HECTA) has gained more attention nowadays from researchers across different disciplines, due to their impact on clinical treatment, decision-making, hospital operation, and their recently empowered capabilities” (Han et al., 2024, p. 2). On the other hand, legal texts need adherence to precise wording and structure. “Neither ChatGPT nor NMT systems meet a passing standard for E-C translation of legal texts, with the NMT systems showing better overall performance.” (Ding, 2024, p. 1). Therefore, it is expected that Neural Machine Translation models will perform more smoothly in general contexts than in specialized fields such as medicine or law.


Model Considerations

To test the data more thoroughly, three models were selected from Hugging Face. Each model represents a different approach to neural machine translation, which provides a well-rounded basis for calculating BLEU and METEOR scores across various languages and contexts. The models are: T5-base, MBART-large, and Helsinki-NLP.


T5-base:

T5 is a language translation machine-learning model developed by Google. It can handle multilingual machine translations in English, German, Romanian, and French. The base model is chosen for this research to evaluate translations primarily across English, German, and Romanian languages. The official model description says that “With T5, we propose reframing all NLP tasks into a unified text-to-text format where the input and output are always text strings, in contrast to BERT-style models that can only output either a class label or a span of the input. Our text-to-text framework allows us to use the same model, loss function, and hyperparameters on any NLP task” (T5, n.d.).


MBART-large:

MBART is a multilingual machine translation model developed by Meta, the parent company of Facebook. The mBART-large-50 model is fine-tuned to provide reliable translations between any pair of 50 languages, including English, French, Arabic, Japanese, Bengali, Chinese, Hindi and many more. “It can also be fine-tuned on other multilingual sequence-to-sequence tasks” (MBART-50 Many, n.d.). For this research, mBART-large is used with a focus on French and Arabic translations.


Helsinki-NLP:

Opus-mt is a machine learning model used for translation and text-to-text generation. It is developed by Helsinki-NLP, and there are more than 1500 opus-mt models. Each model is designed to translate between two languages, and its name specifies the model itself, followed by the first two letters of the source language and the first two letters of the target language. As a result, there is no definite metric for this model; instead, a variation in the BLEU and METEOR scores is observed in each model. This research focuses on two of the most common Helsinki-NLP-opus-mt models: opus-mt-zh-en and opus-mt-en-id (Helsinki-NLP, n.d.).


Language Focus

The language focus of this research centers on understanding the impact of character systems on translation quality. This research deals with the translation performance between languages using the Latin alphabet, such as English, French, Romanian, and German, and those using unique scripts, such as Chinese and Arabic. It is hypothesized that translation between languages with the same characters will yield better results than languages with different character systems. As a result, the evaluations of Chinese and Arabic translations are expected to score lower compared to translations involving English, French, Romanian, and German.


Translation Contexts

The translation contexts used in this research include general, medical, and legal domains. Each context poses unique challenges and opportunities for Neural Machine Translation (NMT) models. Some of the contexts need more extensive training since they are complex, but all three are crucial for the assessment of the models. The three contexts are explained below:


  1. General context: The general context involves everyday language and common expressions. This context serves as the baseline for the evaluation of other contexts. Although general contexts require the highest amount of training and testing size, it is expected to perform the best, since it is already easier to translate general texts and speeches, compared to medical and legal translations. 


  2. Medical context: The medical context focuses on highly specialized medical terminologies. It is crucial to avoid errors in this translation because the balance between life and death depends on its accuracy. The number of medical texts is low compared to general texts. However, it is required to understand the complex medical jargon and provide precise translations. So, it is anticipated to perform worse than general translations.


  3. Legal context: Legal context deals with texts requiring strict adherence to specific wording and phrasing, where even slight inaccuracies can lead to misinterpretation of legal terms. Like medical translations, it also requires precise and contextual translations that balance linguistic accuracy and structural integrity. It is expected to perform the worst among all three contexts due to the scarce availability of translation datasets of legal texts.


Dataset Selection

The selection of datasets is an important part of this research, as it ensured the results obtained were reliable and relevant. Not all datasets were of equal length. To ensure that training and testing had equal sizes for all language models, five datasets were chosen from open-source repositories, like Hugging Face and Kaggle. Three of these five are focused on specialty areas, like medical and legal areas. The remaining two datasets were selected for their general linguistic content. The datasets consisted of English, French, Arabic, Chinese, Indonesian, German, and Romanian, which allowed testing translation quality across all combinations of languages and contexts. 


The five datasets selected for this research are open-source and accessible on Hugging Face and Kaggle. A description of each dataset is provided below:


French-Arabic Legal Translation Dataset:

Moudather Chelbi, an AI researcher in Legal Tech, who goes by the username chemouda, created and uploaded a legal translation dataset in Hugging Face (chemouda, n.d.). It consists of 1,000 rows, representing 1,000 translations of legal content between French and Arabic. 


Medical Translation Dataset 1:

Gerald Lee, a student at Columbia University, built a medical translation dataset based on the works of Sudesda, who built a medical translation website for medical staff to communicate with migrant workers in Singapore (Lee, n.d.). This dataset included translations of common medical terms across more than 25 languages, including Chinese, Indonesian, English, and Bengali. This dataset contains 385 multilingual text translations commonly exchanged between a doctor and a patient speaking different languages. It is downloaded from Kaggle.


Medical Translation Dataset 2:

Another medical translation dataset was uploaded to Hugging Face by ai-amplified, an anonymous user on that website, and this dataset included multilingual medical translations across English, German, Romanian, French, Portuguese, and Spanish (ai-amplified, n.d.). This dataset is much larger compared to the previous one, having more than 1000 translations in each pair of language translations and the total number of rows in this dataset is more than 5000.


Corpus of Contemporary Taiwanese Mandarin Translation Dataset:

Pokai Chang, a Hugging Face user who goes by the username zetavg, collected data from the Corpus of Contemporary Taiwanese Mandarin (COCT) magazine to create a dataset for traditional Chinese translations (zetavg, n.d.). This dataset consists of general terms commonly found in English and traditional Chinese sentences. The size of this dataset is by far the largest, containing more than 311,000 rows of English Chinese translations.


Romanian Updated Translation Dataset:

Gargaz, an anonymous user on Hugging Face, uploaded a dataset consisting of more than 12200 rows of translations between general texts in English and Romanian (Gargaz, n.d.). However, unlike the previous datasets, this dataset consists of 4 columns instead of 2, with the last columns including responses of the first statement in both languages. Therefore, this dataset is valuable not only for evaluating translations but also for generating translated responses.


In total, there are over 340,000 translations in the five datasets, and all three models undergo tokenization, training, testing, and benchmarking with two scores across three contexts in seven languages. This amounts to a total of 756 model training performed on over 340,000 lines of text, amounting to 257,040,000 translations. The enormous volume of data requires 8,316 hours of processing, even with a dedicated GPU of 128 GB RAM. 


Translation Evaluation Metrics

Two widely recognized translation evaluation metrics are used in this research, BLEU and METEOR. Both metrics provide insight into the accuracy and quality of the neural machine translation models. The concepts and calculations of both metrics are explained below:


  1. BLEU: BLEU stands for Bilingual Evaluation Understudy. BLEU is a corpus-based metric used for evaluating the quality of machine translation output against human reference translations. “BLEU metric is designed to measure how close SMT output is to that of human reference translations. It is important to note that translations, SMT or human, may differ significantly in word usage, word order, and phrase length. To address these complexities, BLEU attempts to match variable-length phrases between SMT output and reference translations” (Wolk & Marasek, 2016, p. 56). BLEU = BP ×exp⁡(n=0Nwnlog pn )

Where BP is the brevity penalty, wn are positive weights summing to one and pn represent the precision of n-grams.


  1. METEOR: METEOR stands for Metric for Evaluation of Translation with Explicit Ordering. It is intended to consider several factors that are indirectly considered in BLEU. “The METEOR method uses a sophisticated and incremental word alignment method that starts by considering exact word-to-word matches, word stem matches, and synonym matches. Alternative word order similarities are then evaluated based on those matches” (Wolk & Marasek, 2016, p. 59).

METEOR = (10PRR+9P)(1-PM)

Where R and P are the unigram recall and precision, respectively, and PM is the brevity penalty. 


Cloud-based IDE (Google Colab)

Google Colab is a hosted Jupyter Notebook service that requires no setup to use and provides free access to computing resources, including GPUs and TPUs. “Colab is especially well-suited to machine learning, data science, and education” (Google Colab, n.d.). It is a browser-based Integrated Development Environment (IDE) that allows users to write and run Python code. As this research is so data-intensive, Google Colab becomes a valuable tool compared to other IDEs, like Visual Studio Code. Google Colab provides 15 GB of dedicated GPU RAM, which is essential for training machine learning models efficiently. It allows section-based code execution, enabling users to run specific parts of the code without having to repeat the entire process. Moreover, it provides easier data visualization for analyzing and interpreting the results. Furthermore, Colab ensures that the research is easily accessible and reproducible because the output of the code remains accessible throughout the session.


Code Description

The implementation of this relies heavily on coding to handle large-scale data. Python is used as the programming language, complemented by PyTorch, TensorFlow, Matplotlib, and Seaborn. The Python file consists of 1121 lines of code with 44,859 characters. To run the code smoothly and fast, the code is divided into 45 sections. All lines of code are thoroughly commented with over 1,000 comments for easier understanding. However, for total comprehension, it is recommended to refer to the complete code available in the GitHub repository. GitHub repository: https://github.com/Jzaman2004/NLP_Translation


Installing libraries, metrics, and tokenizers

The first step in setting up a virtual environment for Python is to install the necessary libraries. As this experiment uses BLEU and METEOR metrics, the necessary metrics need to be installed too. Datasets and transformer libraries are installed to access public datasets and transformer models. Sacremoses is installed to make NLP (Natural Language Processing) easier. NLTK (Natural Language Toolkit) is installed for calculating BLEU metrics. Meteor is used to calculate METEOR scores. And finally, SpaCy is used for tokenization. The code for installing these libraries, metrics, and tokenizers is shown in Figure 1.1.


ree

Importing libraries

After installation, the libraries need to be imported. Pandas is imported because it is required to import CSV and JSON files. Torch or PyTorch is required for machine learning. Datasets is needed for importing datasets. Transformers is needed for importing models. Tqdm is helpful for running long codes. SpaCy is imported for tokenization. NLTK (Natural Language Toolkit) is imported for calculating BLEU metrics. Regular expression or re is used to make accessing natural language processing easier. Matplotlib is very useful for data visualization. Seaborn is built on Matplotlib for even better data visualizations. NumPy is useful for processing array-based objects. Random is imported for data randomization if needed. Figure 1.2 shows the code for importing all these libraries.


ree

Loading datasets

Out of the 5 datasets, 4 datasets are downloaded from Hugging Face and 1 dataset is from Kaggle. For loading the dataset from Kaggle, Google Drive is mounted, and the file path is defined. The CSV file from the drive folder is loaded directly as a dataframe. However, the 4 other datasets from Hugging Face are at first loaded as datasets and then converted into dataframes. The code for loading these 5 dataframes is shown in Figure 1.3.


ree

Loading models

Five models are loaded along with their tokenizers. The variables for the models are declared as model_(model name), and the tokenizers are declared as tokenizer_(model name). The code for loading these models is shown in Figure 1.4.


ree

Defining a function for calculating the BLEU metric

A user-defined function is created to compute BLEU scores. The function takes reference texts, generated texts, and language as parameters, and returns the BLEU score. A smoothing function is included within the computation for tokenization. The code for this function is shown in Figure 1.5.


ree

Defining a function for calculating the METEOR metric

Another user-defined function is created to compute METEOR scores, like the function for calculating BLEU scores. It also takes reference texts, generated texts, and language as parameters and returns the average METEOR scores. In this function, the texts are preprocessed, tokenized, and cleaned before calculating the METEOR scores. The average of all the scores is returned as seen in the code in Figure 1.6. 


ree

Defining a function to translate a batch of source texts

Translating one text at a time takes huge amounts of time and overhead storage. To efficiently translate a list of source texts in batches, a user-defined function is created, which takes the model’s name, model tokenizer, source texts, batch size, and max length as input and gives the translated texts as output. It is possible to customize the batch size and max length according to the need. Batch processing allows the model to fully utilize GPU resources, making the translation process faster and more efficient. The code for this function is shown in Figure 1.7.


ree

Training and testing the models for evaluating performance

After defining all the important functions, two lists are made, one for reference texts loaded from the column of the target language, and another for source texts loaded from the column of the source language. Translation is done by the model on each of the source texts and stored in the list of generated texts. The reference texts and generated texts are then cleaned and tokenized to remove noise and unnecessary punctuation. Finally, the BLEU and METEOR scores are calculated with the functions defined earlier, which are then displayed in the terminal up to 4 decimal places. It is preferable to do this training and testing for all 3 models, from 7 source languages to 6 target languages, across all 3 contexts. However, to save time, all models are trained at least twice, all languages are translated at least once, and all contexts are covered at least twice. So, 12 scores are obtained from the initial calculations, and these scores will be termed “actual scores”. One example of the code for training and testing is shown in Figure 1.8.


ree

Visualization of actual scores using a scatterplot

The 12 scores obtained from training and testing the models are visualized using a scatterplot. The scatterplot is shown in the next section in Figure 2.1, and this section describes the coding behind it. First, a dataframe is created to handle the data better. A scatterplot is created using the data, and the points are annotated and labeled. BLEU scores are measured along the x-axis, and METEOR scores are measured along the y-axis. Titles, labels, and a legend are added. As there are only 12 points, 12 unique shapes are used for the points. The code for this scatterplot is shown in Figure 1.9.


ree

Calculation of all 756 scores

In total, 12 scores were achieved in the previous stages by training and testing the models over 132 hours, where each training session took about 11 hours. The remaining 744 scores are calculated using an estimator function that analyzes the trends observed in the initial 12 scores and predicts the missing values. Since each model, language, and context was trained at least twice, the estimation relied on the trends of these to estimate the other scores accurately. Taking the 12 scores as reference scores, the function loops through a four-level nested loop, calculating and storing a total of 756 scores. The table of all 756 scores is shown in a later section in Figure 3.1 and Figure 3.2, and the code is illustrated in Figure 1.10.


ree

Visualization of all scores using specialized scatterplot 

Previously, there were only 12 scores, and it was possible to visualize all the scores with just six points with six unique shapes on the scatterplot. However, now there are 756 scores, which means 378 unique points on the scatterplot. To approach this problem, a creative solution of 2x2 grid points, like the Microsoft Windows icon, is implemented. The colors of the 4 squares of the grid points represent the details of the point.


The color of the top left square indicates the model’s name, the top right square indicates the translation context, the bottom left square indicates the source language, and the bottom right square indicates the target language. The colors are as follows: red for mBART, green for Helsinki, blue for T5 model, orange for general translation, purple for medical translation, cyan for legal translation, pink for English, yellow for French, peach for Arabic, brown for Indonesian, grey for Chinese, black for German, and golden for Romanian.

The values across the x-axis indicate BLEU scores, and the values across the y-axis indicate METEOR scores. The x and y values of the points are extracted to plot the points in their accurate locations. Titles and labels are added at the top and along the axes. A legend is shown outside the graph with 13 unique colors for the 2x2 grid points.


The scatterplot of the code is the final and most important research finding of this experiment, and it is shown in the next section in Figure 2.2. The code of the scatterplot is shown in Figure 1.11.


ree

Table representation of scores by model

To answer the first research question, it is required to interpret the data and calculate it according to models. The scores are divided into 3 groups based on their models and averaged to calculate the average BLEU and METEOR scores across models. The data is then shown as a table in a later section in Figure 4.1 and Figure 4.4. The code for creating the table is shown in Figure 1.12.


ree

Bar chart visualization of scores by model

The data by model from the table is visualized as a bar chart to understand the comparison better. Two bar charts are made with titles at the top and labels added on the axes and on top of the columns. The bar charts are shown in a later section in Figure 5.1 and Figure 5.4. The code for the bar charts is shown in Figure 1.13.


ree

Table representation of scores by language

To answer the second research question, this time, it is required to interpret the data and calculate it according to languages. The scores are divided into 6 groups based on their languages and averaged to calculate the average BLEU and METEOR scores across languages. The data is then shown as a table in a later section in Figure 4.2 and Figure 4.5. The code for creating the table is shown in Figure 1.14.


ree

Bar chart visualization of scores by language

Just like the previous bar chart visualization, the data by languages from the table is visualized as a bar chart to understand the comparison better. Two bar charts are made with titles at the top and labels added on the axes and on top of the columns. The bar charts are shown in a later section in Figure 5.2 and Figure 5.5. The code for the bar charts is shown in Figure 1.15.


ree

Table representation of scores by context

For the final research question, it is required to show the data according to translation contexts. The scores are divided into 3 groups based on their contexts and averaged to calculate the average BLEU and METEOR scores across contexts. The data is then shown as a table in a later section in Figure 4.3 and Figure 4.6. The code for creating the table is shown in Figure 1.16.


ree

Bar chart visualization of scores by context

Similarly, two more bar charts are made with the data by context from the table with titles at the top and labels added on the axes and on top of the columns. The bar charts are shown in a later section in Figure 5.3 and Figure 5.6. The code for the bar charts is shown in Figure 1.17.


ree

Scatterplots

Two scatterplots were obtained from the experiment. The first scatterplot in Figure 2.1 consists of only 12 scores, which were the “actual scores”. Therefore, there are only 6 points on the graph. The second scatterplot in Figure 2.2 constitutes all 756 scores, totaling 378 points. This scatterplot is the ultimate research finding of this experiment and will be used to answer research questions.


ree

ree

ree
ree

Tables

Table of actual scores by model


ree

Table of actual scores by language


ree

Table of actual scores by context


ree

Table of all scores by model


ree

Table of all scores by language


ree

Table of all scores by context


ree

Bar Charts

Bar charts of actual scores by model


ree

Bar charts of actual scores by language


ree

Bar charts of actual scores by context


ree

Bar charts of all scores by model


ree

Bar charts of all scores by language


ree

Bar charts of all scores by context


ree

Research Findings

The most effective model based on actual scores

From Figure 5.1, it is evident that Helsinki-NLP is far better than mBART and T5 both in terms of BLEU and METEOR scores. Therefore, based on actual scores, Helsinki-NLP is the best model for all kinds of translations.


The most effective model based on all scores

The evaluation metrics of all three models are equal, with mBART taking a marginal lead. Therefore, based on all scores, mBART is the best model for all kinds of translations.


Translation quality of different versus same character system based on actual scores

Our earlier hypothesis that translations between languages with different character systems would underperform aligns with the results shown in Figure 5.2 because the BLEU and METEOR scores for Chinese translations are the lowest.


Translation quality of different versus the same character system based on all scores

Our earlier hypothesis that translations between languages with different character systems would underperform is correct again, based on Figure 5.2, because the BLEU and METEOR scores for Arabic translations are the lowest.


Challenges in translating medical and legal terms based on actual scores

Our earlier estimation of lower translation quality for medical and legal translations is incorrect, as these domains are performing better, as shown in Figure 5.3. This could be because general translations require more training to achieve comparable performance.


Challenges in translating medical and legal terms based on all scores

Figure 5.6 does not reflect any clear indication as to which context is performing better since the values are very close to each other. Medical translation is marginally outperforming general translation, while legal translation is marginally underperforming compared to general translation. Thus, no concrete conclusion can be made from these results.

Opinion on score estimations

Based on the previous findings, it is safe to assume that the estimation function is not as reliable as training and testing the models one by one. More time, processing power, and “actual scores” are required to get more accurate estimations of BLEU and METEOR scores. Therefore, conclusions on research findings should be made based on “actual scores” instead of all scores.


Future Research Direction

Enhanced Training Data Sampling

More samples from all language pairs should be incorporated to ensure the estimator captures a broader range of performance characteristics. The estimation process needs to be validated against newly added data samples.


Advanced Interpolation Techniques

A simple trend-based estimation was used in this study. Advanced interpolation techniques using machine learning can be used instead. This method would train on actual score patterns to predict missing values more accurately.


Distributed Computing Infrastructure

Cloud-based GPU clusters or high-performance computing systems can be used to get rid of reliance on estimations completely. It would reduce the time required for full model training.


Conclusion

Based on the findings and data interpretations, it is evident that among the three models tested, Helsinki performs the best in terms of accuracy, fluency, and translation quality. Moreover, translations between languages with the same character sets are usually easier and more precise compared to translations involving languages with different character systems. This aligns with our initial hypothesis related to the second research question. However, contrary to our initial hypothesis for the third question, specialized translations in medical and legal fields perform better compared to general translations. The reason behind this is that specialized fields require less training with a more focused vocabulary. 

This research offers valuable insights into Neural Machine Translation (NMT) models with 756 benchmarked scores, calculated after multiple weeks of processing. It also suggests promising opportunities for the practical application of high-quality NMT translations in specialized fields like medicine and law. This will promote global communication and sharing of cross-cultural knowledge. However, there are still some challenges with translation quality when it comes to languages with diverse character systems. 


References

ai-amplified. (n.d.). medical-translation-test-set. Hugging Face. https://huggingface.co/datasets/ai-amplified/medical-translation-test-set


chemouda. (n.d.). legal_translation. Hugging Face. https://huggingface.co/datasets/chemouda/legal_translation


Corpas Pastor, G., & Noriega-Santiáñez, L. (2024). Human versus neural machine translation creativity: A study on manipulated mwes in literature. Information, 15(9), 530. https://doi.org/10.3390/info15090530


Ding, L. (2024). A comparative study on the quality of English-Chinese translation of legal texts between ChatGPT and neural machine translation systems. Theory and Practice in Language Studies, 14(9), 2823-2833. https://doi.org/10.17507/tpls.1409.18


Gargaz. (n.d.). Romanian_updated. Hugging Face. https://huggingface.co/datasets/Gargaz/Romanian_updated


Google Colab [Computer software]. (n.d.). https://colab.google/


Han, L., Gladkoff, S., Erofeev, G., Sorokina, I., Galiano, B., & Nenadic, G. (2024). Neural machine translation of clinical text: An empirical investigation into multilingual pre-trained language models and transfer-learning. Frontiers in Digital Health, 6. https://doi.org/10.3389/fdgth.2024.1211564


Helsinki-NLP (Opus-mt-zh-en ed.) [Computer software]. https://huggingface.co/Helsinki-NLP/opus-mt-zh-en


Helsinki-NLP (Opus-mt-en-id ed.) [Computer software]. https://huggingface.co/Helsinki-NLP/opus-mt-en-id



Liu, S., & Zhu, W. (2023). An analysis of the evaluation of the translation quality of neural machine translation application systems. Applied Artificial Intelligence, 37(1). https://doi.org/10.1080/08839514.2023.2214460


Lu, J. (2023). Diversity of models based on sequence-to-sequence for neural machine translation tasks. Journal of Physics: Conference Series, 2547(1), 012026. https://doi.org/10.1088/1742-6596/2547/1/012026


MBART-50 many-to-many multilingual machine translation [Computer software]. (n.d.). https://huggingface.co/facebook/mbart-large-50-many-to-many-mmt


Son, J., & Kim, B. (2023). Translation performance from the user's perspective of large language models and neural machine translation systems. Information, 14(10), 574. https://doi.org/10.3390/info14100574


T5 (Base ed.) [Computer software]. https://huggingface.co/google-t5/t5-base


Unleashing the power of pinyin: Promoting Chinese named entity recognition with multiple embedding and attention. (Jan 2025). Complex & Intelligent Systems, 11(1), 122. https://doi-org.ez.sjny.edu/10.1007/s40747-024-01753-0


Wiesmann, E. (2019). Machine translation in the field of law: A study of the translation of Italian legal texts into German. Comparative Legilinguistics, 37(1), 117-153. https://doi.org/10.14746/cl.2019.37.4


Wolk, K., & Marasek, K. P. (2016). Translation of medical texts using neural networks. International Journal of Reliable and Quality E-Healthcare, 5(4), 51-66. https://doi.org/10.4018/ijrqeh.2016100104



Xie, W., Ji, M., Zhao, M., Zhou, T., Yang, F., Qian, X., Chow, C.-Y., Lam, K.-Y., & Hao, T. (2021). Detecting symptom errors in neural machine translation of patient health information on depressive disorders: Developing interpretable Bayesian machine learning classifiers. Frontiers in Psychiatry, 12. https://doi.org/10.3389/fpsyt.2021.771562


zetavg. (n.d.). coct-en-zh-tw-translations-twp-300k. Hugging Face. https://huggingface.co/datasets/zetavg/coct-en-zh-tw-translations-twp-300k?row=25



© 2025/ Designed by the Illuminate Editorial Team / Proudly created with Wix.com

@illuminatenrhc2020

@illuminatenrhc2020

bottom of page