Instantly share code, notes, and snippets.

@backpackerhh

backpackerhh / core-set.sql

  • Download ZIP
  • Star 75 You must be signed in to star a gist
  • Fork 40 You must be signed in to fork a gist
  • Embed Embed this gist in your website.
  • Share Copy sharable link for this gist.
  • Clone via HTTPS Clone using the web URL.
  • Learn more about clone URLs
  • Save backpackerhh/2487a2c59789ef13099d to your computer and use it in GitHub Desktop.

@bmwilllee

bmwilllee commented Apr 2, 2017

Thanks, very helpful!

Sorry, something went wrong.

@EmiliaDariel

EmiliaDariel commented Jul 12, 2018

can anyone please explain this: SELECT DISTINCT Re1.name, Re2.name FROM Rating R1, Rating R2, Reviewer Re1, Reviewer Re2 WHERE R1.mID = R2.mID AND R1.rID = Re1.rID AND R2.rID = Re2.rID AND Re1.name < Re2.name ORDER BY Re1.name, Re2.name;

@safwans

safwans commented Jun 6, 2019 • edited

Is there a potential issue in #9 because you may have repeating rows due to the join? I wrote the query below and got slightly different average

select (select avg(ratings) from( select avg(r.stars) ratings from rating r, movie m where r.mid = m.mid and m.year < '1980' group by r.mid )) - (select avg(ratings) from( select avg(r.stars) ratings from rating r, movie m where r.mid = m.mid and m.year >= '1980' group by r.mid ))rat

@femiaiyeku

femiaiyeku commented Jul 27, 2019

very helpful to prepare for sql interview

@AshwinAJa

AshwinAJa commented Sep 21, 2019

how to download data

@macso95

macso95 commented Feb 22, 2020 • edited

How would I do these ones?

For each movie, display the number of times it was reviewed and the average of the number of stars it received. List only the movies that were reviewed three or more times.

Use a correlated reference to find all reviews that have occurred on the same day by different reviewers. Display the reviewer ID and date of the review. Print out the. Order by rating date. You must use the word EXISTS within query.

@bartubozkurt

bartubozkurt commented May 1, 2020

How can I do ? - How many movies have been made each year? - How many actors are there in each movie? thank you for the Exercises

@bikashghadai3

bikashghadai3 commented Jun 3, 2020

how find the rating of 1 and 2 stars for the last 5 days in a week in a table.

@wahabmemo

wahabmemo commented Apr 21, 2021

You have to display an actor name who has worked in many films. [Use join, group by, order by]

ghost commented Jun 9, 2021

How can I do ? How many movies have been made each year? How many actors are there in each movie? thank you for the Exercises

@Windsleeper

Windsleeper commented Jun 25, 2021

Thank you, this is very helpful!

@arsalh

arsalh commented Aug 27, 2021

For the average rating of movies before and after 1980 question (movies #9), can someone help me what I am doing wrong in my query below? Instead of getting result of 0.0555555555555558, mine comes to 0.05555555555555536. Small difference but would like to understand what I am doing wrong. Thank you so much!

SELECT distinct (SELECT avg(rt_avg) FROM (SELECT m.mID, avg(rt.stars) as rt_avg, year FROM Rating rt JOIN Movie m ON rt.mID=m.mID GROUP BY rt.mID) temp WHERE year<1980)

(SELECT avg(rt_avg) FROM (SELECT m.mID, avg(rt.stars) as rt_avg, year FROM Rating rt JOIN Movie m ON rt.mID=m.mID GROUP BY rt.mID) temp WHERE year>1980)

@AlMokgalaka

AlMokgalaka commented Apr 7, 2022

EmiliaDariel , Have a look at this site from Stanford might help you http://linyishui.top/2019090601.html Otherwise the SELECT DISTINCT statement is used to return only distinct/different values (avoiding duplicate values present in any specific columns of a table.). An example, inside a table, a column often contains many duplicate values, and sometimes we only want to return or list the different values. The second line FROM clause, third line WHERE clause, 4th-line AND clauses (the two tables having common columns, matching id, mid and ratings id, rid) follows ANSI (American National Standards Institute) table aliases and ANSI old/theta style to reduce those chains of names. Remember ratings, reviewers tables and id are spelt in small letters and when you submit that query, the query handler might not be able to accept and or recognize those big/capital letters as the configuration settings in SQL might have been disenabled/abled although by SQL is by default case insensitive (Query handler checks spelling (=goes to view RATINGS or ratings, like when you are hungry you would ask a lunch not 2 lunches) and recognize only available views or table views and then raises a red flag, saying I don't have such table, RATING or rating in here, that means it only saw ratings/RATINGS/RaTIngs). see an example: SELECT

Orders.OrderID, Orders.CustomerID, Orders.EmployeeID, Orders.OrderDate, Orders.RequiredDate, Orders.ShippedDate, Orders.ShipVia, Orders.Freight, Orders.ShipName, Orders.ShipAddress, Orders.ShipCity, Orders.ShipRegion, Orders.ShipPostalCode, Orders.ShipCountry,

Customers.CompanyName, Customers.Address, Customers.City, Customers.Region, Customers.PostalCode, Customers.Country

FROM Customers

INNER JOIN Orders

Note: The table names need not be repeated unless the same column names exist in both tables. The table names are only required in the FROM, JOIN, and ON clauses, and in the latter, only because the relating column, CustomerID, has the same name in both tables.

The query syntax shown above follows ANSI (American National Standards Institute) rules and should work in the latest versions of all relational databases. Older syntax includes the join condition in the WHERE clause (theta style). Note the number of rows and columns in the result set for the Orders Query and try the same example (with fewer columns), using the older style and table aliases, as follows:

SELECT o.OrderID, o.EmployeeID, o.OrderDate, o.RequiredDate, o.ShippedDate, o.ShipVia, o.Freight, c.CompanyName, c.Address, c.City, c.Region, c.PostalCode, c.Country

FROM Customers c, Orders o WHERE c.CustomerID = o.CustomerID;

Note for MS Access users: Compare this query in design view with the ANSI style query. MS Access runs the query correctly but cannot represent it in the usual way In the graphical query interface.

@shnartho

shnartho commented Jun 22, 2022

Thanks a lot man

@Antimatterr

Antimatterr commented Sep 29, 2022

For the average rating of movies before and after 1980 question (movies #9), can someone help me what I am doing wrong in my query below? Instead of getting result of 0.0555555555555558, mine comes to 0.05555555555555536. Small difference but would like to understand what I am doing wrong. Thank you so much! SELECT distinct (SELECT avg(rt_avg) FROM (SELECT m.mID, avg(rt.stars) as rt_avg, year FROM Rating rt JOIN Movie m ON rt.mID=m.mID GROUP BY rt.mID) temp WHERE year<1980) (SELECT avg(rt_avg) FROM (SELECT m.mID, avg(rt.stars) as rt_avg, year FROM Rating rt JOIN Movie m ON rt.mID=m.mID GROUP BY rt.mID) temp WHERE year>1980) FROM Movie

you can use this: SELECT AVG(S1)-AVG(S2) FROM( SELECT AVG(STARS) S1 FROM MOVIE M,RATING R WHERE M.MID=R.MID and year<1980 GROUP BY M.MID ), (SELECT AVG(STARS) S2 FROM MOVIE M,RATING R WHERE M.MID=R.MID and year>1980 GROUP BY M.MID);

@himanshu43215

himanshu43215 commented Mar 10, 2023

Perform the Following Operations and Capture the query plan before each query.

  • Write a query in SQL to list the Horror movies
  • Write a query in SQL to find the name of all reviewers who have rated 8 or more stars
  • Write a query in SQL to list all the information of the actors who played a role in the movie ‘Deliverance’.
  • Write a query in SQL to find the name of the director (first and last names) who directed a movie that casted a role for 'Eyes Wide Shut'. (using subquery)
  • Write a query in SQL to find the movie title, year, date of release, director and actor for those movies which reviewer is ‘Neal Wruck’
  • Write a query in SQL to find all the years which produced at least one movie and that received a rating of more than 4 stars.
  • Write a query in SQL to find the name of all movies who have rated their ratings with a NULL value
  • Write a query in SQL to find the name of movies who were directed by ‘David’
  • Write a query in SQL to list the first and last names of all the actors who were cast in the movie ‘Boogie Nights’, and the roles they played in that production. 10.Find the name of the actor who have worked in more than one movie. please help me to solve this

IMDB-Data-Analysis-in-SQL

This project was carried out to answer a set of analytical questions to suggest a movie production house on which set of actors, directors, and production houses would be the best fit for a super hit commercial movie..

glow (1)

Table of Content (TOC)

  • Database Creation for the Project
  • Table Creation
  • Data Insertion

Data Analysis

  • EXECUTIVE SUMMARY AND RECOMMENDATIONS

1. Overview

This analysis is carried out to support RSVP Movies with a well-analyzed list of global stars to plan a movie for the global audience in 2022.

With this, we will be able to answer a set of analytical questions to suggest RSVP Production House on which set of actors, directors, and production houses would be the best fit for a super hit commercial movie.

IMDB Data Analysis in MySQL

RSVP Movies is an Indian film production company that has produced many super-hit movies. They have usually released movies for the Indian audience but for their next project, they are planning to release a movie for the global audience in 2022.

Why this Analysis?

The production company wants to plan its every move analytically based on data and has approached for help with this new project.

We have been provided with the data of the movies that have been released in the past three years. Let’s analyze the data set and draw meaningful insights that can help them start their new project.

We will use SQL to analyze the given data and give recommendations to RSVP Movies based on the insights.

We will be carrying out the entire analytics process into four segments, where each segment leads to significant insights from different combinations of tables.

2. Database Creation for the Project

A. check the list of database.

  • The very first step of any MySQL analysis is to access the database and check if related data is available or not.
  • Use show databases; to access the list of databases:

b. Create Database

  • Create a new database for this project.
  • Use Create database IMDB;
  • Use show databases; to confirm the list of databases:

c. Use Database

  • Instruct the system to use *IMDB Database* by running use imdb;

3. Table Creation

Steps to follow before creating the table:.

  • Download the IMDb dataset. And try to understanding every table and its importance.
  • Understand the ERD and the table details. Study them carefully and understand the relationships between the table.

image

  • Inspect each table given in the subsequent tabs and understand the features associated with each of them.
  • Draft your table with the correct Data Type and Constraints in a paper or note file.
  • Open your MySQL Workbench and start writing the DDL and DML commands to create the database.

Create Table

For this project we need a total of 6 tables:

a. Create Table Movie

B. create table genre, c. create table director_mapping.

| Table Name: director_mapping | Column Description | | ———– | ———– | | movie_id | Movie Id of the movie directed by a director | | name_id | Name ID of the director |

d. Create Table role_mapping

E. create table names, f. create table ratings.

Now, Run show tables; to ensure that all the six tables are created.

4. Data Insertion

In the previous steps, we created six tables. Now, we will insert the data into these tables. Here, we will be showing the syntax of 5 rows insertion into each table. (The complete data insertion syntax is available in the Repository)

a. Inserting data into Movie Table

B. inserting data into genre table, c. inserting data into director_mapping table, d. inserting data into role_mapping table, e. inserting data into names table, f. inserting data into ratings table, checking tables for inserted values:.

Select * from Movie;

Select * from Genre;

Select * from Director_Mapping;

Select * from Role_Mapping;

Select * from Names;

Select * from Ratings;

All the sample data inserted looks good. SO, we can go ahead with insertion of complete data. For insertion to work smoothly, lets drop all data from tables using TRUNCATE :

Insert Complete data

Run the command to insert complete data: IMDB File 3 Insert all data

1. Find the total number of rows in each table of the schema?

Alternative 1:.

Number of Rows after ignoring the Null Rows

Alternative 2:

Rows count inclusive of Null Rows:

TABLE_NAME Tables_in_imdb director_mapping 3867 genre 14662 movie 8519 names 23714 ratings 8230 role_mapping 15173

2. Which columns in the movie table have null values?

id_null title_null year_null date_null duration_null country_null world_null language_null production_null 0 0 0 0 0 20 3724 194 528

3.1. Find the total number of movies released each year?

Movies per year:, 3.2. find the total number of movies released each year, movies per month, 4.1 find the count of indian movies., 4.2 find the count of movies from usa, 4.3 find the count of movies which are either from india or usa, 4.4 find the count of movies that are either from india or usa and released in 2019., 5. find the unique list of the genres present in the data set, 6.1 find the movies count for each genre., 6.2 find the genre with the maximum number of movies., 6.3 find the genre with minimum number of movies., 6.4 find the top-3 genre with the maximum number of movies., 6.4 find the movies count for action genre., 6.5 find the genre count for each movie., 6.6 find the list of indian movies that belongs to 3 genre., 6.7 longest indian movie tagged with 3 genre..

‘tt6200656’, ‘Kammara Sambhavam’, ‘182’, ‘3’

6.8 Which genres are tagged with ‘Kammara Sambhavam’ movie.

genre Action Comedy Drama

7.1. How many movies belong to only one genre?

Create a list of Movies with a genre count
Restrict the list to Genre count = 1
Count the total number of rows

7.2. How many movies belong to two genres?

7.3. how many movies belong to three genres, 8.1. what is the average duration of movies in each genre, 8.2. rank the genre by the average duration of movies in each genre., 9. what is the rank of the ‘thriller’ genre of movies among all the genres in terms of the number of movies produced, 10. find the minimum and maximum values in each column of the rating table except the movie_id column, 11. which are the top 10 movies based on average rating, 12. summarize the ratings table based on the movie counts by median ratings., 13. which production house has produced the most number of hit movies (average rating > 8).

Create list of production house with count of movies where average rating > 8 and Ranked over “Movies count”
Applied CTE to pull the production house with Rank = 1
NOTE: applied (production_company IS NOT NULL) as there are few movies without production house name

14. How many movies released in each genre during March 2017 in the USA had more than 1,000 votes?

15. find movies of each genre that start with the word ‘the’ and which have an average rating > 8, 16. of the movies released between 1 april 2018 and 1 april 2019, how many were given a median rating of 8, 17. do german movies get more votes than italian movies, q18. which columns in the names table have null values, 19. who are the top three directors in the top three genres whose movies have an average rating > 8.

Pull the Top three Genre by Movie count where avg_rating > 8

Pull the Directors with Movie count where avg_rating > 8

Keeping “top_3_genres” as CTE, restrict the 2nd code to avg_rating > 8 and directors of top_3_genre

Trying Row_Number() function:

20. who are the top two actors whose movies have a median rating >= 8, 21. which are the top three production houses based on the number of votes received by their movies, 22. rank actors with movies released in india based on their average ratings. which actor is at the top of the list.

– Note: The actor should have acted in at least five Indian movies.

ALTERNTIVE 1 (Using Rank Window Function):

Alternative 2 (using cte):, 23.find out the top five actresses in hindi movies released in india based on their average ratings.

– Note: The actresses should have acted in at least three Indian movies.

24. Select thriller movies as per avg rating and classify them in the following category:

Rating > 8: Superhit movies
Rating between 7 and 8: Hit movies
Rating between 5 and 7: One-time-watch movies
Rating < 5: Flop movies

——————————————————————————————–*/

EXECUTIVE SUMMARY AND RECOMMENDATIONS {##-EXECUTIVE-SUMMARY-AND-RECOMMENDATIONS}

1. insights.

Based on 7,997 released and recorded on IMDB between 2017 and 2019, a summary of audience interest and recommendations are mentioned as below:

  • Average Duration: 103.89359
  • Total number of Actors: 12611 (7445 actor & 5166 Actress)

1. Year and Month wise Movie Release Pattern:

  • A year wise record of movies indicates a slight decrease in number of movies from 3052 movies in 2017 to 2001 movies in 2019.
  • Maximum number of movies were released in March, followed by September, October, and January. While more interesting fact is about the least number of movies being released in mid-year and end of year months, could be because of more people prefer vacation and family time in this time of year.

2. Geographical Region Distribution

  • USA and India produced 1059 movies together in 2019 alone, way above half of total movies released (2001) in the year.

3. Genre Popularity

  • Movies were tagged with genre tags as Drama, Fantasy, Thriller, Comedy, Horror, Family, Romance, Adventure, Action, Sci-Fi, Crime, and Mystery.
  • Drama is most popular genre among all the genre with 4285 tags across three years, followed by Comedy and Thriller.
  • There were 3289 movies with only one genre tags, while remaining were tagged with multiple genres.

4. The average duration of movies are around 103.89359 minutes, and even genre vise average revolves around the same figure.

5. top production houses.

  • Marvel Studios rules the best Production House category with 551245 votes based on the number of votes received by the movies they have produced, followed by Syncopy, and New Line Cinema.
  • Star Cinema, and Twentieth Century Fox are the top 2 multi-Lingual production house based on the most number of superhit movies.

6. Top Director

  • James Mangold has given most number of Superhit Movies, followed by Soubin Shahir, Joe Russo, and Anthony Russo.
  • A.L. Vijay, Andrew Jones, and Chris Stokes are the top directors based on number of movies.

7. Top Actors and Actress

  • Mammootty with 8 Superhit movies is most successful actor followed by Mohanlal with 5 Superhits.
  • There are quite a few number of actors with 4 Superhit movies under their name, which include Amrinder Gill, Amit Sadh, Johnny Yong Bosch, Tovino Thomas, Dulquer Salmaan, Siddique, Rajkummar Rao, Fahadh Faasil, Pankaj Tripathi, Dileesh Pothan, Joju George, and Ayushmann Khurrana.
  • Vijay Sethupathi, Fahadh Faasil, and Yogi Babu are the top three Indian actors who have acted atleast in five movies.
  • Taapsee Pannu, Divya Dutta, and Kriti Kharbanda are the top three Hindi Speaking actress who have acted at least in three movies.
  • Parvathy Thiruvothu, Susan Brown, and Amanda Lawrence are the best rated actresses in Drama genre.

8. Top-10 movies based on average rating are: Kirket, Love in Kilnerry, Gini Helida Kathe, Runam, Fan, Android Kunjappan Version 5.25, Yeh Suhaagraat Impossible, Safe, The Brighton Miracle, and Shibu

  • Based on Median rating counts, most of the movies are rated between 5 and 8, and falls under hit movie categories.

9. Top Grossing Movies

The highest-grossing movies of each year are:

i. Thank You for Your Service, a comedy movie released in 2017

ii. The Villain, a thriller movie released in 2018

iii. Joker, a drama movie released in 2019

2. Recommendation:

Based on Insights, the recommendations for RSVP are as following:

  • Concentrate on multi-genre drama-comedy movies with a pinch of thriller, keeping an average duration of around 104 minutes.
  • Plan for release of movie between January to March. Focus on multilingual movies which can be launched in India and USA as preferred audience market.
  • Rope in either Star Cinema or Twentieth Century Fox as the production house, under the directorial of James Mangold with assistance of A.L. Vijay.
  • Mammootty and Mohanlal can be the lead actors along with assistance from other side actors. Inclusion of Vijay Sethupathi would act as stardom promotion for the movie.
  • Parvathy Thiruvothu is one of the most rated drama actresses to be brought in.

thecleverprogrammer

Movie Rating Analysis using Python

Aman Kharwal

  • September 22, 2021
  • Machine Learning

We all watch movies for entertainment, some of us never rate it, while some viewers always rate every movie they watch. This type of viewer helps in rating movies for people who go through the movie reviews before watching any movie to make sure they are about to watch a good movie. So, if you are new to data science and want to learn how to analyze movie ratings using the Python programming language, this article is for you. In this article, I will walk you through the task of Movie Rating Analysis using Python.

Analyzing the rating given by viewers of a movie helps many people decide whether or not to watch that movie. So, for the Movie Rating Analysis task, you first need to have a dataset that contains data about the ratings given by each viewer. For this task, I have collected a dataset from Kaggle that contains two files:

  • one file contains the data about the movie Id, title and the genre of the movie 
  • and the other file contains the user id, movie id, ratings given by the user and the timestamp of the ratings

You can download both these datasets from here .

Now let’s get started with the task of movie rating analysis by importing the necessary Python libraries and the datasets:

In the above code, I have only imported the movies dataset that does not have any column names, so let’s define the column names:

Now let’s import the ratings dataset:

The rating dataset also doesn’t have any column names, so let’s define the column names of this data also:

Now I am going to merge these two datasets into one, these two datasets have a common column as ID, which contains movie ID, so we can use this column as the common column to merge the two datasets:

As it is a beginner level task, so I will first have a look at the distribution of the ratings of all the movies given by the viewers:

Movie Rating Analysis

So, according to the pie chart above, most movies are rated 8 by users. From the above figure, it can be said that most of the movies are rated positively.

As 10 is the highest rating a viewer can give, let’s take a look at the top 10 movies that got 10 ratings by viewers:

So, according to this dataset, Joker (2019) got the highest number of 10 ratings from viewers. This is how you can analyze movie ratings using Python as a data science beginner.

So this is how you can do movie rating analysis by using the Python programming language as a data science beginner. Analyzing the ratings given by viewers of a movie helps many people decide whether or not to watch that movie. I hope you liked this article on Movie rating analysis using Python. Feel free to ask your valuable questions in the comments section below.

Aman Kharwal

Aman Kharwal

Data Strategist at Statso. My aim is to decode data science for the real world in the most simple words.

Recommended For You

Data Science Certifications to Boost Your Resume

Data Science Certifications to Boost Your Resume

  • April 11, 2024

How to Learn Data Science for Finance

Here’s How to Learn Data Science for Finance

  • April 10, 2024

Data Manipulation Operations Asked in Interviews

Data Manipulation Operations Asked in Interviews

  • April 9, 2024

Stock Market Anomaly Detection using Python

Stock Market Anomaly Detection using Python

  • April 8, 2024

Leave a Reply Cancel reply

Discover more from thecleverprogrammer.

Subscribe now to keep reading and get access to the full archive.

Type your email…

Continue reading

Welcome! This site is currently in beta. Get 10% off everything with promo code BETA10.

Data Analysis Example: Analyzing Movie Ratings with Python

64 min to complete · By CodingNomads Team

  • Introduction

Introduction: Movie Ratings Data Analysis Example

Inspect the data, set up your notebook, users: users.dat, ratings: ratings.dat, movies: movies.dat, explore your data, join the datasets, visualize patterns, explore a question, top rated sci-fi movies by decades, success your very own data analysis example project.

* Post By Cagdas Yetkin , Senior Data Scientist and CodingNomads Mentor

Do you like movies? We do too! When working with our data science & analysis students, we like to use datasets that everyone can relate to – because it makes learning more fun! In this data analysis example, you will analyze a dataset of movie ratings to draw various conclusions. You will learn how to:

  • Get and Clean the data
  • Get the overall figures and basic statistics with their interpretation
  • Join datasets, aggregate and filter your data by conditions
  • Discover hidden patterns and insights
  • Create summary tables

This tutorial teaches you to perform all of the above tasks using Python and its popular pandas and matplotlib libraries. You can download and run the Jupyter Notebook used in this data analysis example here .

You can download the data from the original GitHub repo – Movie Tweetings Project .

The data in this example consists of movie ratings from Twitter since 2013, updated daily. The data was created from people who connected their IMDB profile with their Twitter accounts. Whenever they rated a movie on the IMDB website, an automated process generated a standard, well-structured tweet.

These well-structured tweets look like this:

"I rated The Matrix 9/10 http://www.imdb.com/title/tt0133093/ #IMDb"

Because of this nice structure, we can use this data to learn and practice data analysis using Python.

Tip: You are highly encouraged to write the code for this data analysis example yourself! This will help you truly understand the contents of this tutorial, give you the practice you need to improve your data analysis "muscle memory" skills, and you may discover some additional interesting revelations for yourself!

To get started, confirm that you have these 3 files in your working directory:

  • ratings.dat

If all these files are accessible to you, you can start off your investigation by checking what these files contain. Let’s start off by looking at the first three lines in users.dat directly in your terminal:

Your output will look similar to this:

At first it may be confusing that you can’t see any field names but these are documented in the README file as follows:

In users.dat the first field is the user_id and the second one is twitter_id .

You can see that there is a surprising amount of colons in this data snippet. Because you already know that you are working with two data fields, this means that the creators of this dataset decided to use a double-colon :: as a field separator. Interesting choice! It is helpful to keep in mind that data fields can be divided by all sorts of different separators, and it’s good to know which one is used in the data you are working with.

With a basic idea of what you can expect to see in users.dat , let’s next take a peek into movies.dat :

The output of this file will look like this:

In this file, you have three fields:

  • movie_title

A single movie can belong to more than one genre, and the genres are separated by pipe characters | , another interesting choice!

After looking at movies.dat , there’s only one file left to inspect. Let’s peek into ratings.dat next:

The output you will receive should look similar to the one below:

In this third dataset, your variables are:

  • rating_timestamp

And again it comes with an interesting feature: The timestamps are in unixtime format!

UNIX time is a time format often used in computer time that shows the seconds passed since January 1st, 1970. You can use online converters to translate it to a format that is easier to read for humans. If you’re interested, read more about Unix time on Wikipedia .

Now you have an overall understanding of how the raw datasets look. Next, you will import the libraries you will need for the rest of this analysis:

Let’s look a bit closer at the options you set up in the code snippet above. You:

  • Give the filter-out-warnings command to have a cleaner notebook without warning messages.
  • Set the max rows and max columns to some big numbers, in this case 50 . This option just makes all the columns and rows in a DataFrame more readable or visible.
  • Use fivethirtyeight style to have plots like the ones on fivethirtyeight.com : a website founded by Nate Silver . If you want to explore fivethirtyeight further, I highly recommend the book: The Signal and the Noise .

These imports and adjustments create a good base setup for you to get started with your analysis. Keep in mind that while the import s are necessary, the adjustments are just to make your analysis easier and better-looking.

Read in the Data

After importing the necessary libraries, you are now ready to read the files into pandas data frames.

There are a couple of adjustments you should make while reading in the data, to make sure it will be in good shape to work with:

  • Define that the separators are double colons ::
  • Give the column names, so they will become the headers
  • Convert the UNIX time to a datetime format

With this in mind, let’s read in users.dat , ratings.dat and movies.dat one by one:

Starting with users.dat , the following code snippet will read in the file into your notebook, register the double-colon as the separator between the fields, and add column names as well:

This creates a DataFrame() object, and you can check the first few entries of this table-like object with the .head() method:

You will see a nicely formatted output that shows the first 5 rows of your users data frame:

You successfully read in the data from the external file and now have access to it as a DataFrame() object. Let’s do the same with the other files as well.

Similar to before, you will want to read in the data and save it into a data frame, define the separator, and pass in the column names. Additionally, you will also call the .sort_values() method on the data frame right away, to sort your data by when the ratings have been created:

You will also want to convert the rating_timestamp values to actual datetime format, and you can do that in pandas like so:

Let’s peek into the first 5 rows of your newly created ratings data frame:

You output should look similar to below:

With the ratings data read in, there’s only one more file left to go.

Of course you also need to have access to information about the actual movies, to find potential correlations e.g. between ratings and movie genres. So, let’s read in that data next:

Checking the successful completion of this process with the familiar movies.head() command, you will see something similar to below:

With this, the data has been read in to the notebook. What follows next, is exploration .

To get a feeling for the data you are working with, it always helps to play around a little and create some quick stats and graphs for different aspects of it. This will help you have a better overview of what the data is about.

Since you want to find out how well movies are liked or disliked, the most important variable is the movie rating . Let’s see its distribution:

Your output should look similar to the one you can see below:

value_counts() is a quick but effective way of checking what values your variable takes. Here we see quickly that the rating score 8 was given 211699 times!

Let’s keep exploring. A histogram will show you the distribution and the describe() method will give additional basic statistics . Both of them are quite helpful to get quick insights, so let’s try them out next:

As mentioned, the .describe() method will display basic statistics about a column, so here they are for the rating column:

Next, let’s look at a visual representation of the data by creating a histogram:

The data with the above settings will produce a histogram that looks like this:

Rating Histogram with Bin of 10

You’ll noticed that it is skewed to the left! That means that the distribution doesn’t have a symmetrical shape around the mean, and this specific off-balanced distribution has a long tail on the left hand side.

The hist() and describe() methods are in fact quite similar: One gives text output and the other gives its visual representation.

Given that both functions return the same output, you may also be able to conclude that the rating is left-skewed by looking only at the text output of your .describe() method. The relevant data for this conclusion are:

  • The mean is much smaller than the median and
  • 25% of the data covers only until a rating of 6

This is a bit confusing. You have seen first that the highest frequency was 8 . And then, after generating the histogram, it looked like the ratings were highest around 9 – 10 .

This difference can arise because of binning . Different amounts of bins will lead to different results. Most of the time, the person conducting the analysis decides the right number of bins after a few trials. Generally, you will have a better idea about what is the right bin size for your data set after some research and digging into it.

Playing with the bins of a histogram can have an impact on the story you are telling. The same histogram would look like this if you increase the number of bins from 10 to 30 :

Rating Histogram with Bin of 30

You can see that this can lead to a different conclusion. If you were using the first histogram you would falsely argue that the most frequent rating was 9 or maybe 10 . However, the second one makes everything crystal clear and shows that the most frequent rating lies at 8 instead. Also, note that if you use the .value_counts() method, you wouldn’t fall into that trap.

Thanks to these methods now you have a more clear understanding about the rating variable in your data. You will focus on the user_id column next.

How many unique user_id do you have in the users data?

‘You have 68388 unique user ids in the data’

You have seen earlier that both value_counts() and describe() are quite handy. So why not combine them to learn a little more?

For instance, how many rating tweets are posted by a user on average? What is the minimum, maximum and median number of tweets posted by the users? The answer to these questions will enable you understand how active the users are: Are they frequent users or are they disappearing after shooting one single tweet?

Let’s try it out:

Running the code snippet above, you will receive another block of text-based statistics as your output:

Notice that this time you accessed the column using dot notation . In this case it does the same as accessing it through the square-bracket notation you used before, but is a little bit more convenient. Check out this StackOverflow post if you want to learn more about the limitations and differences between the two notations.

See in the above output how the mean is much greater than the median (12.83 vs 2). It means that the data is skewed to the right.

This skewness is at the extreme: Look how the max value is far, far away! Could there be someone posting more than 2000 times? Not likely.

The output also tells us that 50% of the people used it only twice but the mean is almost 13 . This is because of those users with extremely high usage numbers.

Could it be possible that they are not human beings but bots instead? That could be a great investigation topic, if you want to dive deeper.

But for this data analysis example, let’s leave this aside for now and continue by joining the datasets we have.

Joining data could be really difficult, as this tweet addresses:

Joining before Pandas Twitter

Luckily, with pandas you have a user-friendly interface to join your movies data frame with the ratings data frame. This is going to be an inner join. It means that you are bringing in the movies only if there is a rating available for them:

Inspecting the first two rows with the .head(2) method shows you this:

Notice that you didn’t use the on and how parameters when you joined the data, because you set the index of both data frames to movie_id . So, the .join() method knew on which variable to join and by default this creates an inner join.

Looking at the output of the .join() operation, you have a new problem: You want to quantify the genres , but how would you count them?

One way of doing that could be creating dummies for each possible genre , such as Sci-Fi or Drama , and having a single column for each. Creating dummies means creating 0 s and 1 s just like you can see in the example below:

The data frame that gets produced by this command looks like this:

You can concatenate these dummies to the original movies_rating data frame:

Your newly created data frame will look like this:

This is almost as tidy as you want it, but it would be much more clean and useful if you could get those production years in a separate column. That would allow you to compare film productions over the years.

To accomplish this, you will practice working with the .str attribute, which is quite popular – and a lifesaver in many cases! You will:

  • Make a new column by getting the 4 digits representing the year
  • Remove the last 7 characters from the movie names
  • Checkout the result

Let’s write the code for achieving these tasks:

Before checking out the results, let’s go ahead and reset the index on this data frame first:

Now you can see that you produce a better-formatted version of the data frame:

Congratulations! With this, you have completed the most difficult part of this data analysis example: Getting and cleaning the data. Let’s quickly recap what you did so far:

  • You read the raw data into data frames
  • You learned and reported basic statistics
  • You joined data frames and created new fields

You did some great work if you followed all the way until here! You can now: watch the first movie in your records from 1894 as a reward 🙂

Next, you are going to visualize your data and discover some patterns. When delivering a report in a professional or academic setting, this is where things start to get very interesting!

First, you will start with visualizing the total volume of films created over the years.

Next, you will count the total number of productions for each year and plot it. The record you see for the year of 2021 should be filtered out before proceeding:

Similar to the .head() method you have encountered before, .tail() shows you a subset of the rows of your data frame. However, instead of showing the first ones, it shows you the last ones:

Aside from 2021, which you filtered out, the other interesting year here is 2020. Although more than half of the year 2020 has passed at the time of writing this article, there are only 5712 rated films and movies for the year so far. Looks like 2020 is one of the most extraordinary years in history? Or maybe the movies are so new, that people didn’t have the time to watch them yet. Or both!

You can chart a 5 year moving average of the total productions:

This will produce a graphic similar to the one below:

5-year moving average plot

You can see that the 5-year moving average is in a shocking decline! What is happening here? What can be the reason? Can you formulate some hypotheses? Here are some points for you to consider:

  • This was an inner join. So these are the rated movies. Perhaps site and app usage went down.
  • The filming industry is in a serious crisis! They are not producing films because of COVID-19.
  • People didn’t have time to watch the most recent movies. If they didn’t watch them, they don’t rate them, and you can see a decline in ratings. For example, I didn’t watch the Avengers series before doing this analysis. On the other hand, the movie Braveheart (1995) most probably had enough time to get high numbers.

Each of these hypotheses could warrant an investigation, and there might be other ideas that you can come up with yourself. Feel free to explore any of these hypotheses further on your own. Remember that practicing your skills by following your interests is one of the best ways to learn new skills and keep them sharp.

For this data analysis example, let’s continue by investigating a slightly different question:

What have people watched (or rated) most since 2000?

For this question, let’s focus on the genres with a high volume of movies. You are going to identify the top 6 genres with the highest number of movies in them, and filter them out to produce the next chart:

Unless the movie industry changed significantly in the time between writing this article and when you are reading it, your output will probably look like this:

Now, you want to get the ratings for these genres from your tidy_movie_ratings data frame, but restrict the ratings to only the movies made between 2000 and 2019:

Finally, you can create a graph showing a 2-year moving average of the total volume of rated films:

And here is your graph output for this data:

2-year moving average plot for total rated films

This gives a nice visual representation and helps you to interpret the data to answer the question you posed before. Here are the take-aways that I took from it:

  • Drama and Thriller are the winner genres
  • Seems that Sci-Fi & Adventure are not as popular

On the other hand, some patterns can be misleading since we are only looking at the absolute numbers. Therefore, another way to analyze this phenomenon would be to look at the percentage changes . This could help your decision making if you are, let’s say, in the business of online movie streaming.

So let’s give that a try and plot the percentage changes:

From this filtered data, let’s produce a 5-years moving average graph:

And the output is shown below:

5-year moving average plot for percentage changes

You notice the decline you already spotted earlier. However, it’s interesting to see the Sci-Fi & Adventure genres moving to the top.

Indeed, Sci-Fi & Adventure movies were a real hype , and you might want to play your cards into them, especially if your business is somewhat related to global film industry trends. These two genres has the sharpest slope for the increase in receiving ratings. This may signal that there is an increasing demand and could be a valuable insight for your business.

Let’s stay with one of these hyped genres for a bit longer and explore yet another question you can answer through this data set.

Let’s say you’re still building out your imaginary streaming service, you understood that the interest in Sci-Fi movies is rising sharply, and you want to make it easy for your users to find the best Sci-Fi movies of all times. What are the movies from each decade which you could suggest to your users by default?

To answer this question, let’s start by writing the necessary steps:

  • Build a scifi base table containing only the columns you need
  • Filter for the records before 2020
  • Create a new column called decade
  • Check it out

And here’s the code to accomplish these tasks:

The first 5 rows of your new scifi data frame will look like this:

Next, you will filter for movies that have more than 10 ratings. But how can you find how many times a movie was rated? Here .groupby() comes to the rescue. After getting the counts, you will generate a new list called movie_list with the condition that a movie needs to have greater than 10 ratings. Below, the final operation will be only about getting the indices of the filtered count_group . You will achieve that by using .index.values method:

The output looks like below:

movie_list now contains those movies that have been rated more than 10 times. Next, you will filter on your scifi base table using the movie_list . Notice the usage of the .isin() method. It is quite user-friendly and straight-forward:

After you created the filtered_scifi table, you can focus on building up your metrics in order to select the best liked movies of each decade. You will look at the average rating, and you will need to .groupby() decade and movie_title .

It is important to sort the aggregated value in a descending order to get the results you are expecting. You want each group to have a maximum of 5 films, so a lambda expression can help you to loop through the decade groups and show only the top 5. Otherwise, if there are less than 5 films in a decade, you want to show only the top movie, meaning only 1 record. Finally you will round the ratings to two decimal points.

You are encouraged to chop the code shown below into single lines and see the individual result for each of them:

The output of this operation will be your top-rated Sci-Fi movies by decade:

If you want to see the values starting from 1990, you can do so by slicing the data frame:

Here are the results going back to 1990:

Congratulations! You have officially completed your first movie recommendation engine! Ok, I know it’s not quite Netflix – which uses machine learning to recommend what you should watch. However in the tables you just generated, you’ve established some rule-of-thumb recommendations based on data and logic – a solid and fun first step!

What’s more, you’ve completed your own full data analysis example project:

  • You read your data as pandas data frames
  • You created basic statistics and interpreted the results
  • You joined data frames, applied conditions to filter them, and aggregated them
  • You used data visualization to find patterns and develop hypotheses
  • And you didn’t jump into conclusions and root causes. You kept your reasoning simple and skeptic
  • You created summary tables

All of the above are important and common aspects of working with data.

Source Code on GitHub

If you enjoyed this data analysis example and you want to learn more and practice your skills further:

  • Add More Data : You can search for some additional IMDB data freely available on the internet. Chances are they contain information about directors of the movies. You could join this data with your tidy_movie_ratings dataset and see which directors are getting top ratings for which movies over the years, and by decades. This way, you can practice everything you have learned here over again
  • Build Your Service : You can write a function which takes the top_rate_by_decade data frame as input and returns a random movie from the list, further simulating a movie recommendation system
  • Your Idea Here : There are limitless possibilities to practice and play with this data. Share your explorations with us if you do!
  • If you want to learn more : Check out CodingNomads’ Data Science & Machine Learning Course to dive even deeper into data analysis and run full end-to-end machine learning projects on your own!

I hope you enjoyed this article and continue having fun with analyzing your datasets.

About the Author: Cagdas Yetkin is a Senior Data Scientist at Jabil where he works on Integrated Circuit Test projecst to detect anomalies using test probe measurements in Printed Circuit Board Assambly production on multiple sites and Supply Chain Digitalization for dynamic slotting, and bin optimization in factory warehouses. He develops soccer analytics and betting applications as a hobby, and enjoys traveling. Connect with him on LinkedIn and Twitter .

Editor : Martin Breuss, @martinbreuss , martinbreuss.com

IMAGES

  1. Movies_Ratings_Analysis/ML_Dataframe_and_Model_Creation.ipynb at main

    assignment 1 movie ratings github

  2. CSE 255 Assignment 1 : Movie Rating Prediction Using the Movielens

    assignment 1 movie ratings github

  3. Term 1 Assignment 1 Movie ratings-hint

    assignment 1 movie ratings github

  4. STAT 503 Assignment 1: Movie Ratings SOLUTION NOTES

    assignment 1 movie ratings github

  5. GitHub

    assignment 1 movie ratings github

  6. Films And Their Ratings, Do You Know How They Work?

    assignment 1 movie ratings github

VIDEO

  1. The man chosen by God. #shorts

  2. MOVIE BLOOPERS that made the final cut #moviebloopers

  3. He endured to the end, but...😠 #movie #series

  4. Mammootty Telugu Super Hit Political Full Movie

  5. Video#14 Assignment of 45 Exercises with TypeScript & NodeJs

  6. EA Assignment 1_Movie of optimizing the path_Jianhao Chen_JC5900

COMMENTS

  1. APCS-Edhesive/Term1/Assignments/Movie_Ratings_1.java at master ...

    Cannot retrieve latest commit at this time. History. Code. 47 lines (30 loc) · 1.38 KB. package Assignments; import java.util.Scanner; public class Movie_Ratings_1 { public static void main (String [] args) { //Website ratings int web1, web2, web3; //Focus group ratings double fg1, fg2; //Movie critic rating double c; //Averages of Website and ...

  2. GitHub

    Analyse reviews and popularity vs actors to see if there are any correlations to increased RoI Source increased data from the movie database that specifies primary genre as this would increase the data sets and ability to analyse. Thank you Email: [email protected] GitHub: PaulStewAus LinkedIn: paulstewartaus. Repository Files. README.md; Data

  3. GitHub

    Packages. Host and manage packages

  4. SQL

    For all pairs of reviewers such that both reviewers gave a rating to the same movie, return the names of both reviewers. Eliminate duplicates, don't pair reviewers with themselves, and include each pair only once. For each pair, return the names in the pair in alphabetical order. SELECT DISTINCT Re1. name, Re2. name.

  5. Data Engineering Project

    In this article, I will create a data pipeline for transferring and analyzing movie data from IMDb. The data pipeline will be created using the following tools: Data ingestion: Web scraping from IMDB using Python. Data storage: Google BigQuery. Data analysis: DBT. Data visualization: Power BI. Data orchestration: Apache Airflow.

  6. Assignment 1: Movie Ratings? : r/EdhesiveHelp

    Assignment 1: Movie Ratings? comments sorted by Best Top New Controversial Q&A Add a Comment. Accomplished_End3197 • Additional comment actions ... System.out.println("Overall movie rating: "+y); } } Reply

  7. IMDB-Data-Analysis-in-SQL

    SELECT title, CASE WHEN avg_rating>8 THEN 'Superhit Movie' WHEN avg_rating>7 THEN 'Hit Movie' WHEN avg_rating >5 THEN 'One-time-watch Movie' ELSE 'Flop Movie' END AS movie_category FROM movie as m INNER JOIN genre as g ON m.id=g.movie_id INNER JOIN ratings as r ON m.id=r.movie_id WHERE genre ='thriller';

  8. Building a movie recommender system with Python

    The reviews are scores from 1 to 5, where 5 is the best score and 1 the worst, and 0 means that a person has not watched the movie. I can represent each person's reviews in a separate vector.

  9. GitHub

    Analysis of Movie ratings and revenue from 2006-2016 - kawshiksharma/ASD1_Assignment_1

  10. Assignment 1 review · Issue #1 · trungdq88/learn-swift · GitHub

    I have finished assignment 1 (movie review application), please review, thank you! README file: https://github.com/trungdq88/learn-swift/tree/master/assignment-1 /cc ...

  11. Movie Rating Analysis using Python

    Now let's get started with the task of movie rating analysis by importing the necessary Python libraries and the datasets: 4. 1. import numpy as np. 2. import pandas as pd. 3. movies = pd.read_csv("movies.dat", delimiter='::') 4.

  12. Data Analysis Example: Analyzing Movie Ratings with Python

    # top 6 genres by the total number of movies top6_genre = (tidy_movie_ratings.iloc[:, 4:-1] # get the genre columns only .sum() # sum them up .sort_values(ascending=False) # sort descending .head(6) # get the first 6 .index.values # get the genre names ) top6_genre ... Source Code on GitHub. What Next? If you enjoyed this data analysis example ...

  13. ml-1--IMdb-Movie--Reviews/Movie+Assignment+Data.csv at main ...

    Contribute to sameer0jethwani/ml-1--IMdb-Movie--Reviews development by creating an account on GitHub.

  14. Use Sentiment Analysis With Python to Classify Movie Reviews

    This process will generate a trained model that you can then use to predict the sentiment of a given piece of text. To take advantage of this tool, you'll need to do the following steps: Add the textcat component to the existing pipeline. Add valid labels to the textcat component. Load, shuffle, and split your data.

  15. Managing code review settings for your team

    In the upper-right corner of GitHub.com, select your profile photo, then click Your organizations. Click the name of your organization. Under your organization name, click Teams. Click the name of the team. At the top of the team page, click Settings. In the left sidebar, click Code review. Select Only notify requested team members.

  16. A Recommendation Engine for Predicting Movie Ratings Using a Big Data

    Table 2, Table 3 and Table 4 show the results of movie lists vs. genres, movie prediction against the movie ID, and top five movie ratings, respectively. When using the mean rating of each movie as the prediction, the testing RMSE 1 is 0.9761; Figure 8 shows the results for the training and test sets.

  17. Movie-Review-Sentiment-Classification/datasets/Train (1).csv ...

    Saved searches Use saved searches to filter your results more quickly