I used two big Data Sets from Amazon customer review data sets. The data were extracted , transformed
and loaded to Amazon Relational Database Service (RDS). Also a ML model was created to predict the user ratings
based on the reviews.
Part 1 (ETL)
Both raw data sets were initially saved in the AWS cloud environment. Data cleaning
was done in Google Colab. The data sets were transformed into two PostgreSQL databases
and uploaded to Amazon RDS.
Part 2 (ML)
The second data set was analyzed using PySpark. The following results were obtained.
Also a Machine learning model was created to predict the user rating using the revirews.
No. of reviews : 960680
Mean value of rating (out of 5) :4.14
Standard deviation : 1.29 minimum : 1
Maximum: 5