Big Data

Amazon Reviews - Big Data ETL and ML

I used two big Data Sets from Amazon customer review data sets. The data were extracted , transformed and loaded to Amazon Relational Database Service (RDS). Also a ML model was created to predict the user ratings based on the reviews.

The two data sets are
  • amazon_reviews_us_Toys_v1_00.tsv (4,864,249 reviews)
  • amazon_reviews_us_Watches_v1_00.tsv (960,872 reviews)


Part 1 (ETL)
Both raw data sets were initially saved in the AWS cloud environment. Data cleaning was done in Google Colab. The data sets were transformed into two PostgreSQL databases and uploaded to Amazon RDS.

Part 2 (ML)
The second data set was analyzed using PySpark. The following results were obtained. Also a Machine learning model was created to predict the user rating using the revirews.
No. of reviews : 960680
Mean value of rating (out of 5) :4.14
Standard deviation : 1.29 minimum : 1
Maximum: 5


Used: PySpark, Colab, ETL, Cloud Services (AWS), PostgreSQL, NaiveBayes
Github: https://github.com/lumindak/big-data-challenge