---
annotations_creators:
- expert-generated
- machine-generated
language_creators:
- machine-generated
languages:
- en
- th
licenses:
- cc-by-sa-4.0
multilinguality:
- translation
size_categories:
- 100K<n<1M
source_datasets:
- original
task_categories:
- conditional-text-generation
- text-classification
task_ids:
- machine-translation
- multi-class-classification
- semantic-similarity-classification
---

# Dataset Card for `generated_reviews_enth`

## Table of Contents
- [Dataset Description](#dataset-description)
  - [Dataset Summary](#dataset-summary)
  - [Supported Tasks](#supported-tasks-and-leaderboards)
  - [Languages](#languages)
- [Dataset Structure](#dataset-structure)
  - [Data Instances](#data-instances)
  - [Data Fields](#data-instances)
  - [Data Splits](#data-instances)
- [Dataset Creation](#dataset-creation)
  - [Curation Rationale](#curation-rationale)
  - [Source Data](#source-data)
  - [Annotations](#annotations)
  - [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
  - [Social Impact of Dataset](#social-impact-of-dataset)
  - [Discussion of Biases](#discussion-of-biases)
  - [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
  - [Dataset Curators](#dataset-curators)
  - [Licensing Information](#licensing-information)
  - [Citation Information](#citation-information)
  - [Contributions](#contributions)

## Dataset Description

- **Homepage:** ttp://airesearch.in.th/
- **Repository:** https://github.com/vistec-ai/generated_reviews_enth
- **Paper:** https://arxiv.org/pdf/2007.03541.pdf
- **Leaderboard:**
- **Point of Contact:** [AIResearch](http://airesearch.in.th/)

### Dataset Summary

`generated_reviews_enth` is created as part of [scb-mt-en-th-2020](https://arxiv.org/pdf/2007.03541.pdf) for machine translation task. This dataset (referred to as `generated_reviews_yn` in [scb-mt-en-th-2020](https://arxiv.org/pdf/2007.03541.pdf)) are English product reviews generated by [CTRL](https://arxiv.org/abs/1909.05858), translated by Google Translate API and annotated as accepted or rejected (`correct`) based on fluency and adequacy of the translation by human annotators. This allows it to be used for English-to-Thai translation quality esitmation (binary label), machine translation, and sentiment analysis.

### Supported Tasks and Leaderboards

English-to-Thai translation quality esitmation (binary label) is the intended use. Other uses include machine translation and sentiment analysis.

### Languages

English, Thai

## Dataset Structure

### Data Instances

```
{'correct': 0, 'review_star': 4, 'translation': {'en': "I had a hard time finding a case for my new LG Lucid 2 but finally found this one on amazon. The colors are really pretty and it works just as well as, if not better than the otterbox. Hopefully there will be more available by next Xmas season. Overall, very cute case. I love cheetah's. :)", 'th': 'ฉันมีปัญหาในการหาเคสสำหรับ LG Lucid 2 ใหม่ของฉัน แต่ในที่สุดก็พบเคสนี้ใน Amazon สีสวยมากและใช้งานได้ดีเช่นเดียวกับถ้าไม่ดีกว่านาก หวังว่าจะมีให้มากขึ้นในช่วงเทศกาลคริสต์มาสหน้า โดยรวมแล้วน่ารักมาก ๆ ฉันรักเสือชีตาห์ :)'}}
{'correct': 0, 'review_star': 1, 'translation': {'en': "This is the second battery charger I bought as a Christmas present, that came from Amazon, after one purchased before for my son. His was still working. The first charger, received in July, broke apart and wouldn't charge anymore. Just found out two days ago they discontinued it without warning. It took quite some time to find the exact replacement charger. Too bad, really liked it. One of these days, will purchase an actual Nikon product, or go back to buying batteries.", 'th': 'นี่เป็นเครื่องชาร์จแบตเตอรี่ก้อนที่สองที่ฉันซื้อเป็นของขวัญคริสต์มาสซึ่งมาจากอเมซอนหลังจากที่ซื้อมาเพื่อลูกชายของฉัน เขายังทำงานอยู่ เครื่องชาร์จแรกที่ได้รับในเดือนกรกฎาคมแตกเป็นชิ้น ๆ และจะไม่ชาร์จอีกต่อไป เพิ่งค้นพบเมื่อสองวันก่อนพวกเขาหยุดมันโดยไม่มีการเตือนล่วงหน้า ใช้เวลาพอสมควรในการหาที่ชาร์จที่ถูกต้อง แย่มากชอบมาก สักวันหนึ่งจะซื้อผลิตภัณฑ์ Nikon จริงหรือกลับไปซื้อแบตเตอรี่'}}
{'correct': 1, 'review_star': 1, 'translation': {'en': 'I loved the idea of having a portable computer to share pictures with family and friends on my big screen. It worked really well for about 3 days, then when i opened it one evening there was water inside where all the wires came out. I cleaned that up and put some tape over that, so far, no leaks. My husband just told me yesterday, however, that this thing is trash.', 'th': 'ฉันชอบไอเดียที่มีคอมพิวเตอร์พกพาเพื่อแชร์รูปภาพกับครอบครัวและเพื่อน ๆ บนหน้าจอขนาดใหญ่ของฉัน มันใช้งานได้ดีจริง ๆ ประมาณ 3 วันจากนั้นเมื่อฉันเปิดมันในเย็นวันหนึ่งมีน้ำอยู่ภายในที่ซึ่งสายไฟทั้งหมดออกมา ฉันทำความสะอาดมันแล้ววางเทปไว้ที่นั่นจนถึงตอนนี้ไม่มีรอยรั่ว สามีของฉันเพิ่งบอกฉันเมื่อวานนี้ว่าสิ่งนี้เป็นขยะ'}}
```

### Data Fields

- `translation`: 
  - `en`: English product reviews generated by [CTRL](https://arxiv.org/abs/1909.05858)
  - `th`: Thai product reviews translated from `en` by Google Translate API
- `review_star`: Stars of the generated reviews, put as condition for [CTRL](https://arxiv.org/abs/1909.05858)
- `correct`: 1 if the English-to-Thai translation is accepted (`correct`) based on fluency and adequacy of the translation by human annotators else 0

### Data Splits

|                 | train  | valid | test  |
|-----------------|--------|-------|-------|
| # samples       | 141369 | 15708 | 17453 |
| # correct:0     | 99296  | 10936 | 12208 |
| # correct:1     | 42073  | 4772  | 5245  |
| # review_star:1 | 50418  | 5628  | 6225  |
| # review_star:2 | 22876  | 2596  | 2852  |
| # review_star:3 | 22825  | 2521  | 2831  |
| # review_star:1 | 22671  | 2517  | 2778  |
| # review_star:5 | 22579  | 2446  | 2767  |

## Dataset Creation

### Curation Rationale

`generated_reviews_enth` is created as part of [scb-mt-en-th-2020](https://arxiv.org/pdf/2007.03541.pdf) for machine translation task. This dataset (referred to as `generated_reviews_yn` in [scb-mt-en-th-2020](https://arxiv.org/pdf/2007.03541.pdf)) are English product reviews generated by [CTRL](https://arxiv.org/abs/1909.05858), translated by Google Translate API and annotated as accepted or rejected (`correct`) based on fluency and adequacy of the translation by human annotators. This allows it to be used for English-to-Thai translation quality esitmation (binary label), machine translation, and sentiment analysis.

### Source Data

#### Initial Data Collection and Normalization

The data generation process is as follows:
- `en` is generated using conditional generation of [CTRL](https://arxiv.org/abs/1909.05858), stating a star review for each generated product review. 
- `th` is translated from `en` using Google Translate API
- `correct` is annotated as accepted or rejected (1 or 0) based on fluency and adequacy of the translation by human annotators

For this specific dataset for translation quality estimation task, we apply the following preprocessing:
- Drop duplciates on `en`,`th`,`review_star`,`correct`; duplicates might exist because the translation checking is done by annotators.
- Remove reviews that are not between 1-5 stars.
- Remove reviews whose `correct` are not 0 or 1.
- Deduplicate on `en` which contains the source sentences.

#### Who are the source language producers?

[CTRL](https://arxiv.org/abs/1909.05858)

### Annotations

#### Annotation process

Annotators are given English and Thai product review pairs. They are asked to label the pair as acceptable translation or not based on fluency and adequacy of the translation.

#### Who are the annotators?

Human annotators of [Hope Data Annotations](https://www.hopedata.org/) hired by [AIResearch.in.th](http://airesearch.in.th/)

### Personal and Sensitive Information

The authors do not expect any personal or sensitive information to be in the generated product reviews, but they could slip through from pretraining of [CTRL](https://arxiv.org/abs/1909.05858).

## Considerations for Using the Data

### Social Impact of Dataset

- English-Thai translation quality estimation for machine translation
- Product review classification for Thai

### Discussion of Biases

[More Information Needed]

### Other Known Limitations

Due to annotation process constraints, the number of one-star reviews are notably higher than other-star reviews. This makes the dataset slighly imbalanced.

## Additional Information

### Dataset Curators

The dataset was created by [AIResearch.in.th](http://airesearch.in.th/)

### Licensing Information

CC BY-SA 4.0

### Citation Information

```
@article{lowphansirikul2020scb,
  title={scb-mt-en-th-2020: A Large English-Thai Parallel Corpus},
  author={Lowphansirikul, Lalita and Polpanumas, Charin and Rutherford, Attapol T and Nutanong, Sarana},
  journal={arXiv preprint arXiv:2007.03541},
  year={2020}
}
```

### Contributions

Thanks to [@cstorm125](https://github.com/cstorm125) for adding this dataset.