Instantly share code, notes, and snippets.
backpackerhh / core-set.sql
- Download ZIP
- Star 75 You must be signed in to star a gist
- Fork 40 You must be signed in to fork a gist
- Embed Embed this gist in your website.
- Share Copy sharable link for this gist.
- Clone via HTTPS Clone using the web URL.
- Learn more about clone URLs
- Save backpackerhh/2487a2c59789ef13099d to your computer and use it in GitHub Desktop.
bmwilllee commented Apr 2, 2017
Thanks, very helpful!
Sorry, something went wrong.
EmiliaDariel commented Jul 12, 2018
can anyone please explain this: SELECT DISTINCT Re1.name, Re2.name FROM Rating R1, Rating R2, Reviewer Re1, Reviewer Re2 WHERE R1.mID = R2.mID AND R1.rID = Re1.rID AND R2.rID = Re2.rID AND Re1.name < Re2.name ORDER BY Re1.name, Re2.name;
safwans commented Jun 6, 2019 • edited
Is there a potential issue in #9 because you may have repeating rows due to the join? I wrote the query below and got slightly different average
select (select avg(ratings) from( select avg(r.stars) ratings from rating r, movie m where r.mid = m.mid and m.year < '1980' group by r.mid )) - (select avg(ratings) from( select avg(r.stars) ratings from rating r, movie m where r.mid = m.mid and m.year >= '1980' group by r.mid ))rat
femiaiyeku commented Jul 27, 2019
very helpful to prepare for sql interview
AshwinAJa commented Sep 21, 2019
how to download data
macso95 commented Feb 22, 2020 • edited
How would I do these ones?
For each movie, display the number of times it was reviewed and the average of the number of stars it received. List only the movies that were reviewed three or more times.
Use a correlated reference to find all reviews that have occurred on the same day by different reviewers. Display the reviewer ID and date of the review. Print out the. Order by rating date. You must use the word EXISTS within query.
bartubozkurt commented May 1, 2020
How can I do ? - How many movies have been made each year? - How many actors are there in each movie? thank you for the Exercises
bikashghadai3 commented Jun 3, 2020
how find the rating of 1 and 2 stars for the last 5 days in a week in a table.
wahabmemo commented Apr 21, 2021
You have to display an actor name who has worked in many films. [Use join, group by, order by]
ghost commented Jun 9, 2021
How can I do ? How many movies have been made each year? How many actors are there in each movie? thank you for the Exercises
Windsleeper commented Jun 25, 2021
Thank you, this is very helpful!
arsalh commented Aug 27, 2021
For the average rating of movies before and after 1980 question (movies #9), can someone help me what I am doing wrong in my query below? Instead of getting result of 0.0555555555555558, mine comes to 0.05555555555555536. Small difference but would like to understand what I am doing wrong. Thank you so much!
SELECT distinct (SELECT avg(rt_avg) FROM (SELECT m.mID, avg(rt.stars) as rt_avg, year FROM Rating rt JOIN Movie m ON rt.mID=m.mID GROUP BY rt.mID) temp WHERE year<1980)
(SELECT avg(rt_avg) FROM (SELECT m.mID, avg(rt.stars) as rt_avg, year FROM Rating rt JOIN Movie m ON rt.mID=m.mID GROUP BY rt.mID) temp WHERE year>1980)
AlMokgalaka commented Apr 7, 2022
EmiliaDariel , Have a look at this site from Stanford might help you http://linyishui.top/2019090601.html Otherwise the SELECT DISTINCT statement is used to return only distinct/different values (avoiding duplicate values present in any specific columns of a table.). An example, inside a table, a column often contains many duplicate values, and sometimes we only want to return or list the different values. The second line FROM clause, third line WHERE clause, 4th-line AND clauses (the two tables having common columns, matching id, mid and ratings id, rid) follows ANSI (American National Standards Institute) table aliases and ANSI old/theta style to reduce those chains of names. Remember ratings, reviewers tables and id are spelt in small letters and when you submit that query, the query handler might not be able to accept and or recognize those big/capital letters as the configuration settings in SQL might have been disenabled/abled although by SQL is by default case insensitive (Query handler checks spelling (=goes to view RATINGS or ratings, like when you are hungry you would ask a lunch not 2 lunches) and recognize only available views or table views and then raises a red flag, saying I don't have such table, RATING or rating in here, that means it only saw ratings/RATINGS/RaTIngs). see an example: SELECT
Orders.OrderID, Orders.CustomerID, Orders.EmployeeID, Orders.OrderDate, Orders.RequiredDate, Orders.ShippedDate, Orders.ShipVia, Orders.Freight, Orders.ShipName, Orders.ShipAddress, Orders.ShipCity, Orders.ShipRegion, Orders.ShipPostalCode, Orders.ShipCountry,
Customers.CompanyName, Customers.Address, Customers.City, Customers.Region, Customers.PostalCode, Customers.Country
FROM Customers
INNER JOIN Orders
Note: The table names need not be repeated unless the same column names exist in both tables. The table names are only required in the FROM, JOIN, and ON clauses, and in the latter, only because the relating column, CustomerID, has the same name in both tables.
The query syntax shown above follows ANSI (American National Standards Institute) rules and should work in the latest versions of all relational databases. Older syntax includes the join condition in the WHERE clause (theta style). Note the number of rows and columns in the result set for the Orders Query and try the same example (with fewer columns), using the older style and table aliases, as follows:
SELECT o.OrderID, o.EmployeeID, o.OrderDate, o.RequiredDate, o.ShippedDate, o.ShipVia, o.Freight, c.CompanyName, c.Address, c.City, c.Region, c.PostalCode, c.Country
FROM Customers c, Orders o WHERE c.CustomerID = o.CustomerID;
Note for MS Access users: Compare this query in design view with the ANSI style query. MS Access runs the query correctly but cannot represent it in the usual way In the graphical query interface.
shnartho commented Jun 22, 2022
Thanks a lot man
Antimatterr commented Sep 29, 2022
For the average rating of movies before and after 1980 question (movies #9), can someone help me what I am doing wrong in my query below? Instead of getting result of 0.0555555555555558, mine comes to 0.05555555555555536. Small difference but would like to understand what I am doing wrong. Thank you so much! SELECT distinct (SELECT avg(rt_avg) FROM (SELECT m.mID, avg(rt.stars) as rt_avg, year FROM Rating rt JOIN Movie m ON rt.mID=m.mID GROUP BY rt.mID) temp WHERE year<1980) (SELECT avg(rt_avg) FROM (SELECT m.mID, avg(rt.stars) as rt_avg, year FROM Rating rt JOIN Movie m ON rt.mID=m.mID GROUP BY rt.mID) temp WHERE year>1980) FROM Movie
you can use this: SELECT AVG(S1)-AVG(S2) FROM( SELECT AVG(STARS) S1 FROM MOVIE M,RATING R WHERE M.MID=R.MID and year<1980 GROUP BY M.MID ), (SELECT AVG(STARS) S2 FROM MOVIE M,RATING R WHERE M.MID=R.MID and year>1980 GROUP BY M.MID);
himanshu43215 commented Mar 10, 2023
Perform the Following Operations and Capture the query plan before each query.
- Write a query in SQL to list the Horror movies
- Write a query in SQL to find the name of all reviewers who have rated 8 or more stars
- Write a query in SQL to list all the information of the actors who played a role in the movie ‘Deliverance’.
- Write a query in SQL to find the name of the director (first and last names) who directed a movie that casted a role for 'Eyes Wide Shut'. (using subquery)
- Write a query in SQL to find the movie title, year, date of release, director and actor for those movies which reviewer is ‘Neal Wruck’
- Write a query in SQL to find all the years which produced at least one movie and that received a rating of more than 4 stars.
- Write a query in SQL to find the name of all movies who have rated their ratings with a NULL value
- Write a query in SQL to find the name of movies who were directed by ‘David’
- Write a query in SQL to list the first and last names of all the actors who were cast in the movie ‘Boogie Nights’, and the roles they played in that production. 10.Find the name of the actor who have worked in more than one movie. please help me to solve this
IMDB-Data-Analysis-in-SQL
This project was carried out to answer a set of analytical questions to suggest a movie production house on which set of actors, directors, and production houses would be the best fit for a super hit commercial movie..
Table of Content (TOC)
- Database Creation for the Project
- Table Creation
- Data Insertion
Data Analysis
- EXECUTIVE SUMMARY AND RECOMMENDATIONS
1. Overview
This analysis is carried out to support RSVP Movies with a well-analyzed list of global stars to plan a movie for the global audience in 2022.
With this, we will be able to answer a set of analytical questions to suggest RSVP Production House on which set of actors, directors, and production houses would be the best fit for a super hit commercial movie.
RSVP Movies is an Indian film production company that has produced many super-hit movies. They have usually released movies for the Indian audience but for their next project, they are planning to release a movie for the global audience in 2022.
Why this Analysis?
The production company wants to plan its every move analytically based on data and has approached for help with this new project.
We have been provided with the data of the movies that have been released in the past three years. Let’s analyze the data set and draw meaningful insights that can help them start their new project.
We will use SQL to analyze the given data and give recommendations to RSVP Movies based on the insights.
We will be carrying out the entire analytics process into four segments, where each segment leads to significant insights from different combinations of tables.
2. Database Creation for the Project
A. check the list of database.
- The very first step of any MySQL analysis is to access the database and check if related data is available or not.
- Use show databases; to access the list of databases:
b. Create Database
- Create a new database for this project.
- Use Create database IMDB;
- Use show databases; to confirm the list of databases:
c. Use Database
- Instruct the system to use *IMDB Database* by running use imdb;
3. Table Creation
Steps to follow before creating the table:.
- Download the IMDb dataset. And try to understanding every table and its importance.
- Understand the ERD and the table details. Study them carefully and understand the relationships between the table.
- Inspect each table given in the subsequent tabs and understand the features associated with each of them.
- Draft your table with the correct Data Type and Constraints in a paper or note file.
- Open your MySQL Workbench and start writing the DDL and DML commands to create the database.
Create Table
For this project we need a total of 6 tables:
a. Create Table Movie
B. create table genre, c. create table director_mapping.
| Table Name: director_mapping | Column Description | | ———– | ———– | | movie_id | Movie Id of the movie directed by a director | | name_id | Name ID of the director |
d. Create Table role_mapping
E. create table names, f. create table ratings.
Now, Run show tables; to ensure that all the six tables are created.
4. Data Insertion
In the previous steps, we created six tables. Now, we will insert the data into these tables. Here, we will be showing the syntax of 5 rows insertion into each table. (The complete data insertion syntax is available in the Repository)
a. Inserting data into Movie Table
B. inserting data into genre table, c. inserting data into director_mapping table, d. inserting data into role_mapping table, e. inserting data into names table, f. inserting data into ratings table, checking tables for inserted values:.
Select * from Movie;
Select * from Genre;
Select * from Director_Mapping;
Select * from Role_Mapping;
Select * from Names;
Select * from Ratings;
All the sample data inserted looks good. SO, we can go ahead with insertion of complete data. For insertion to work smoothly, lets drop all data from tables using TRUNCATE :
Insert Complete data
Run the command to insert complete data: IMDB File 3 Insert all data
1. Find the total number of rows in each table of the schema?
Alternative 1:.
Number of Rows after ignoring the Null Rows
Alternative 2:
Rows count inclusive of Null Rows:
TABLE_NAME Tables_in_imdb director_mapping 3867 genre 14662 movie 8519 names 23714 ratings 8230 role_mapping 15173
2. Which columns in the movie table have null values?
id_null title_null year_null date_null duration_null country_null world_null language_null production_null 0 0 0 0 0 20 3724 194 528
3.1. Find the total number of movies released each year?
Movies per year:, 3.2. find the total number of movies released each year, movies per month, 4.1 find the count of indian movies., 4.2 find the count of movies from usa, 4.3 find the count of movies which are either from india or usa, 4.4 find the count of movies that are either from india or usa and released in 2019., 5. find the unique list of the genres present in the data set, 6.1 find the movies count for each genre., 6.2 find the genre with the maximum number of movies., 6.3 find the genre with minimum number of movies., 6.4 find the top-3 genre with the maximum number of movies., 6.4 find the movies count for action genre., 6.5 find the genre count for each movie., 6.6 find the list of indian movies that belongs to 3 genre., 6.7 longest indian movie tagged with 3 genre..
‘tt6200656’, ‘Kammara Sambhavam’, ‘182’, ‘3’
6.8 Which genres are tagged with ‘Kammara Sambhavam’ movie.
genre Action Comedy Drama
7.1. How many movies belong to only one genre?
Create a list of Movies with a genre count
Restrict the list to Genre count = 1
Count the total number of rows
7.2. How many movies belong to two genres?
7.3. how many movies belong to three genres, 8.1. what is the average duration of movies in each genre, 8.2. rank the genre by the average duration of movies in each genre., 9. what is the rank of the ‘thriller’ genre of movies among all the genres in terms of the number of movies produced, 10. find the minimum and maximum values in each column of the rating table except the movie_id column, 11. which are the top 10 movies based on average rating, 12. summarize the ratings table based on the movie counts by median ratings., 13. which production house has produced the most number of hit movies (average rating > 8).
Create list of production house with count of movies where average rating > 8 and Ranked over “Movies count”
Applied CTE to pull the production house with Rank = 1
NOTE: applied (production_company IS NOT NULL) as there are few movies without production house name
14. How many movies released in each genre during March 2017 in the USA had more than 1,000 votes?
15. find movies of each genre that start with the word ‘the’ and which have an average rating > 8, 16. of the movies released between 1 april 2018 and 1 april 2019, how many were given a median rating of 8, 17. do german movies get more votes than italian movies, q18. which columns in the names table have null values, 19. who are the top three directors in the top three genres whose movies have an average rating > 8.
Pull the Top three Genre by Movie count where avg_rating > 8
Pull the Directors with Movie count where avg_rating > 8
Keeping “top_3_genres” as CTE, restrict the 2nd code to avg_rating > 8 and directors of top_3_genre
Trying Row_Number() function:
20. who are the top two actors whose movies have a median rating >= 8, 21. which are the top three production houses based on the number of votes received by their movies, 22. rank actors with movies released in india based on their average ratings. which actor is at the top of the list.
– Note: The actor should have acted in at least five Indian movies.
ALTERNTIVE 1 (Using Rank Window Function):
Alternative 2 (using cte):, 23.find out the top five actresses in hindi movies released in india based on their average ratings.
– Note: The actresses should have acted in at least three Indian movies.
24. Select thriller movies as per avg rating and classify them in the following category:
Rating > 8: Superhit movies
Rating between 7 and 8: Hit movies
Rating between 5 and 7: One-time-watch movies
Rating < 5: Flop movies
——————————————————————————————–*/
EXECUTIVE SUMMARY AND RECOMMENDATIONS {##-EXECUTIVE-SUMMARY-AND-RECOMMENDATIONS}
1. insights.
Based on 7,997 released and recorded on IMDB between 2017 and 2019, a summary of audience interest and recommendations are mentioned as below:
- Average Duration: 103.89359
- Total number of Actors: 12611 (7445 actor & 5166 Actress)
1. Year and Month wise Movie Release Pattern:
- A year wise record of movies indicates a slight decrease in number of movies from 3052 movies in 2017 to 2001 movies in 2019.
- Maximum number of movies were released in March, followed by September, October, and January. While more interesting fact is about the least number of movies being released in mid-year and end of year months, could be because of more people prefer vacation and family time in this time of year.
2. Geographical Region Distribution
- USA and India produced 1059 movies together in 2019 alone, way above half of total movies released (2001) in the year.
3. Genre Popularity
- Movies were tagged with genre tags as Drama, Fantasy, Thriller, Comedy, Horror, Family, Romance, Adventure, Action, Sci-Fi, Crime, and Mystery.
- Drama is most popular genre among all the genre with 4285 tags across three years, followed by Comedy and Thriller.
- There were 3289 movies with only one genre tags, while remaining were tagged with multiple genres.
4. The average duration of movies are around 103.89359 minutes, and even genre vise average revolves around the same figure.
5. top production houses.
- Marvel Studios rules the best Production House category with 551245 votes based on the number of votes received by the movies they have produced, followed by Syncopy, and New Line Cinema.
- Star Cinema, and Twentieth Century Fox are the top 2 multi-Lingual production house based on the most number of superhit movies.
6. Top Director
- James Mangold has given most number of Superhit Movies, followed by Soubin Shahir, Joe Russo, and Anthony Russo.
- A.L. Vijay, Andrew Jones, and Chris Stokes are the top directors based on number of movies.
7. Top Actors and Actress
- Mammootty with 8 Superhit movies is most successful actor followed by Mohanlal with 5 Superhits.
- There are quite a few number of actors with 4 Superhit movies under their name, which include Amrinder Gill, Amit Sadh, Johnny Yong Bosch, Tovino Thomas, Dulquer Salmaan, Siddique, Rajkummar Rao, Fahadh Faasil, Pankaj Tripathi, Dileesh Pothan, Joju George, and Ayushmann Khurrana.
- Vijay Sethupathi, Fahadh Faasil, and Yogi Babu are the top three Indian actors who have acted atleast in five movies.
- Taapsee Pannu, Divya Dutta, and Kriti Kharbanda are the top three Hindi Speaking actress who have acted at least in three movies.
- Parvathy Thiruvothu, Susan Brown, and Amanda Lawrence are the best rated actresses in Drama genre.
8. Top-10 movies based on average rating are: Kirket, Love in Kilnerry, Gini Helida Kathe, Runam, Fan, Android Kunjappan Version 5.25, Yeh Suhaagraat Impossible, Safe, The Brighton Miracle, and Shibu
- Based on Median rating counts, most of the movies are rated between 5 and 8, and falls under hit movie categories.
9. Top Grossing Movies
The highest-grossing movies of each year are:
i. Thank You for Your Service, a comedy movie released in 2017
ii. The Villain, a thriller movie released in 2018
iii. Joker, a drama movie released in 2019
2. Recommendation:
Based on Insights, the recommendations for RSVP are as following:
- Concentrate on multi-genre drama-comedy movies with a pinch of thriller, keeping an average duration of around 104 minutes.
- Plan for release of movie between January to March. Focus on multilingual movies which can be launched in India and USA as preferred audience market.
- Rope in either Star Cinema or Twentieth Century Fox as the production house, under the directorial of James Mangold with assistance of A.L. Vijay.
- Mammootty and Mohanlal can be the lead actors along with assistance from other side actors. Inclusion of Vijay Sethupathi would act as stardom promotion for the movie.
- Parvathy Thiruvothu is one of the most rated drama actresses to be brought in.
Movie Rating Analysis using Python
- September 22, 2021
- Machine Learning
We all watch movies for entertainment, some of us never rate it, while some viewers always rate every movie they watch. This type of viewer helps in rating movies for people who go through the movie reviews before watching any movie to make sure they are about to watch a good movie. So, if you are new to data science and want to learn how to analyze movie ratings using the Python programming language, this article is for you. In this article, I will walk you through the task of Movie Rating Analysis using Python.
Analyzing the rating given by viewers of a movie helps many people decide whether or not to watch that movie. So, for the Movie Rating Analysis task, you first need to have a dataset that contains data about the ratings given by each viewer. For this task, I have collected a dataset from Kaggle that contains two files:
- one file contains the data about the movie Id, title and the genre of the movie
- and the other file contains the user id, movie id, ratings given by the user and the timestamp of the ratings
You can download both these datasets from here .
Now let’s get started with the task of movie rating analysis by importing the necessary Python libraries and the datasets:
In the above code, I have only imported the movies dataset that does not have any column names, so let’s define the column names:
Now let’s import the ratings dataset:
The rating dataset also doesn’t have any column names, so let’s define the column names of this data also:
Now I am going to merge these two datasets into one, these two datasets have a common column as ID, which contains movie ID, so we can use this column as the common column to merge the two datasets:
As it is a beginner level task, so I will first have a look at the distribution of the ratings of all the movies given by the viewers:
So, according to the pie chart above, most movies are rated 8 by users. From the above figure, it can be said that most of the movies are rated positively.
As 10 is the highest rating a viewer can give, let’s take a look at the top 10 movies that got 10 ratings by viewers:
So, according to this dataset, Joker (2019) got the highest number of 10 ratings from viewers. This is how you can analyze movie ratings using Python as a data science beginner.
So this is how you can do movie rating analysis by using the Python programming language as a data science beginner. Analyzing the ratings given by viewers of a movie helps many people decide whether or not to watch that movie. I hope you liked this article on Movie rating analysis using Python. Feel free to ask your valuable questions in the comments section below.
Aman Kharwal
Data Strategist at Statso. My aim is to decode data science for the real world in the most simple words.
Recommended For You
Data Science Certifications to Boost Your Resume
- April 11, 2024
Here’s How to Learn Data Science for Finance
- April 10, 2024
Data Manipulation Operations Asked in Interviews
- April 9, 2024
Stock Market Anomaly Detection using Python
- April 8, 2024
Leave a Reply Cancel reply
Discover more from thecleverprogrammer.
Subscribe now to keep reading and get access to the full archive.
Type your email…
Continue reading
Welcome! This site is currently in beta. Get 10% off everything with promo code BETA10.
Data Analysis Example: Analyzing Movie Ratings with Python
64 min to complete · By CodingNomads Team
- Introduction
Introduction: Movie Ratings Data Analysis Example
Inspect the data, set up your notebook, users: users.dat, ratings: ratings.dat, movies: movies.dat, explore your data, join the datasets, visualize patterns, explore a question, top rated sci-fi movies by decades, success your very own data analysis example project.
* Post By Cagdas Yetkin , Senior Data Scientist and CodingNomads Mentor
Do you like movies? We do too! When working with our data science & analysis students, we like to use datasets that everyone can relate to – because it makes learning more fun! In this data analysis example, you will analyze a dataset of movie ratings to draw various conclusions. You will learn how to:
- Get and Clean the data
- Get the overall figures and basic statistics with their interpretation
- Join datasets, aggregate and filter your data by conditions
- Discover hidden patterns and insights
- Create summary tables
This tutorial teaches you to perform all of the above tasks using Python and its popular pandas and matplotlib libraries. You can download and run the Jupyter Notebook used in this data analysis example here .
You can download the data from the original GitHub repo – Movie Tweetings Project .
The data in this example consists of movie ratings from Twitter since 2013, updated daily. The data was created from people who connected their IMDB profile with their Twitter accounts. Whenever they rated a movie on the IMDB website, an automated process generated a standard, well-structured tweet.
These well-structured tweets look like this:
"I rated The Matrix 9/10 http://www.imdb.com/title/tt0133093/ #IMDb"
Because of this nice structure, we can use this data to learn and practice data analysis using Python.
Tip: You are highly encouraged to write the code for this data analysis example yourself! This will help you truly understand the contents of this tutorial, give you the practice you need to improve your data analysis "muscle memory" skills, and you may discover some additional interesting revelations for yourself!
To get started, confirm that you have these 3 files in your working directory:
- ratings.dat
If all these files are accessible to you, you can start off your investigation by checking what these files contain. Let’s start off by looking at the first three lines in users.dat directly in your terminal:
Your output will look similar to this:
At first it may be confusing that you can’t see any field names but these are documented in the README file as follows:
In users.dat the first field is the user_id and the second one is twitter_id .
You can see that there is a surprising amount of colons in this data snippet. Because you already know that you are working with two data fields, this means that the creators of this dataset decided to use a double-colon :: as a field separator. Interesting choice! It is helpful to keep in mind that data fields can be divided by all sorts of different separators, and it’s good to know which one is used in the data you are working with.
With a basic idea of what you can expect to see in users.dat , let’s next take a peek into movies.dat :
The output of this file will look like this:
In this file, you have three fields:
- movie_title
A single movie can belong to more than one genre, and the genres are separated by pipe characters | , another interesting choice!
After looking at movies.dat , there’s only one file left to inspect. Let’s peek into ratings.dat next:
The output you will receive should look similar to the one below:
In this third dataset, your variables are:
- rating_timestamp
And again it comes with an interesting feature: The timestamps are in unixtime format!
UNIX time is a time format often used in computer time that shows the seconds passed since January 1st, 1970. You can use online converters to translate it to a format that is easier to read for humans. If you’re interested, read more about Unix time on Wikipedia .
Now you have an overall understanding of how the raw datasets look. Next, you will import the libraries you will need for the rest of this analysis:
Let’s look a bit closer at the options you set up in the code snippet above. You:
- Give the filter-out-warnings command to have a cleaner notebook without warning messages.
- Set the max rows and max columns to some big numbers, in this case 50 . This option just makes all the columns and rows in a DataFrame more readable or visible.
- Use fivethirtyeight style to have plots like the ones on fivethirtyeight.com : a website founded by Nate Silver . If you want to explore fivethirtyeight further, I highly recommend the book: The Signal and the Noise .
These imports and adjustments create a good base setup for you to get started with your analysis. Keep in mind that while the import s are necessary, the adjustments are just to make your analysis easier and better-looking.
Read in the Data
After importing the necessary libraries, you are now ready to read the files into pandas data frames.
There are a couple of adjustments you should make while reading in the data, to make sure it will be in good shape to work with:
- Define that the separators are double colons ::
- Give the column names, so they will become the headers
- Convert the UNIX time to a datetime format
With this in mind, let’s read in users.dat , ratings.dat and movies.dat one by one:
Starting with users.dat , the following code snippet will read in the file into your notebook, register the double-colon as the separator between the fields, and add column names as well:
This creates a DataFrame() object, and you can check the first few entries of this table-like object with the .head() method:
You will see a nicely formatted output that shows the first 5 rows of your users data frame:
You successfully read in the data from the external file and now have access to it as a DataFrame() object. Let’s do the same with the other files as well.
Similar to before, you will want to read in the data and save it into a data frame, define the separator, and pass in the column names. Additionally, you will also call the .sort_values() method on the data frame right away, to sort your data by when the ratings have been created:
You will also want to convert the rating_timestamp values to actual datetime format, and you can do that in pandas like so:
Let’s peek into the first 5 rows of your newly created ratings data frame:
You output should look similar to below:
With the ratings data read in, there’s only one more file left to go.
Of course you also need to have access to information about the actual movies, to find potential correlations e.g. between ratings and movie genres. So, let’s read in that data next:
Checking the successful completion of this process with the familiar movies.head() command, you will see something similar to below:
With this, the data has been read in to the notebook. What follows next, is exploration .
To get a feeling for the data you are working with, it always helps to play around a little and create some quick stats and graphs for different aspects of it. This will help you have a better overview of what the data is about.
Since you want to find out how well movies are liked or disliked, the most important variable is the movie rating . Let’s see its distribution:
Your output should look similar to the one you can see below:
value_counts() is a quick but effective way of checking what values your variable takes. Here we see quickly that the rating score 8 was given 211699 times!
Let’s keep exploring. A histogram will show you the distribution and the describe() method will give additional basic statistics . Both of them are quite helpful to get quick insights, so let’s try them out next:
As mentioned, the .describe() method will display basic statistics about a column, so here they are for the rating column:
Next, let’s look at a visual representation of the data by creating a histogram:
The data with the above settings will produce a histogram that looks like this:
You’ll noticed that it is skewed to the left! That means that the distribution doesn’t have a symmetrical shape around the mean, and this specific off-balanced distribution has a long tail on the left hand side.
The hist() and describe() methods are in fact quite similar: One gives text output and the other gives its visual representation.
Given that both functions return the same output, you may also be able to conclude that the rating is left-skewed by looking only at the text output of your .describe() method. The relevant data for this conclusion are:
- The mean is much smaller than the median and
- 25% of the data covers only until a rating of 6
This is a bit confusing. You have seen first that the highest frequency was 8 . And then, after generating the histogram, it looked like the ratings were highest around 9 – 10 .
This difference can arise because of binning . Different amounts of bins will lead to different results. Most of the time, the person conducting the analysis decides the right number of bins after a few trials. Generally, you will have a better idea about what is the right bin size for your data set after some research and digging into it.
Playing with the bins of a histogram can have an impact on the story you are telling. The same histogram would look like this if you increase the number of bins from 10 to 30 :
You can see that this can lead to a different conclusion. If you were using the first histogram you would falsely argue that the most frequent rating was 9 or maybe 10 . However, the second one makes everything crystal clear and shows that the most frequent rating lies at 8 instead. Also, note that if you use the .value_counts() method, you wouldn’t fall into that trap.
Thanks to these methods now you have a more clear understanding about the rating variable in your data. You will focus on the user_id column next.
How many unique user_id do you have in the users data?
‘You have 68388 unique user ids in the data’
You have seen earlier that both value_counts() and describe() are quite handy. So why not combine them to learn a little more?
For instance, how many rating tweets are posted by a user on average? What is the minimum, maximum and median number of tweets posted by the users? The answer to these questions will enable you understand how active the users are: Are they frequent users or are they disappearing after shooting one single tweet?
Let’s try it out:
Running the code snippet above, you will receive another block of text-based statistics as your output:
Notice that this time you accessed the column using dot notation . In this case it does the same as accessing it through the square-bracket notation you used before, but is a little bit more convenient. Check out this StackOverflow post if you want to learn more about the limitations and differences between the two notations.
See in the above output how the mean is much greater than the median (12.83 vs 2). It means that the data is skewed to the right.
This skewness is at the extreme: Look how the max value is far, far away! Could there be someone posting more than 2000 times? Not likely.
The output also tells us that 50% of the people used it only twice but the mean is almost 13 . This is because of those users with extremely high usage numbers.
Could it be possible that they are not human beings but bots instead? That could be a great investigation topic, if you want to dive deeper.
But for this data analysis example, let’s leave this aside for now and continue by joining the datasets we have.
Joining data could be really difficult, as this tweet addresses:
Luckily, with pandas you have a user-friendly interface to join your movies data frame with the ratings data frame. This is going to be an inner join. It means that you are bringing in the movies only if there is a rating available for them:
Inspecting the first two rows with the .head(2) method shows you this:
Notice that you didn’t use the on and how parameters when you joined the data, because you set the index of both data frames to movie_id . So, the .join() method knew on which variable to join and by default this creates an inner join.
Looking at the output of the .join() operation, you have a new problem: You want to quantify the genres , but how would you count them?
One way of doing that could be creating dummies for each possible genre , such as Sci-Fi or Drama , and having a single column for each. Creating dummies means creating 0 s and 1 s just like you can see in the example below:
The data frame that gets produced by this command looks like this:
You can concatenate these dummies to the original movies_rating data frame:
Your newly created data frame will look like this:
This is almost as tidy as you want it, but it would be much more clean and useful if you could get those production years in a separate column. That would allow you to compare film productions over the years.
To accomplish this, you will practice working with the .str attribute, which is quite popular – and a lifesaver in many cases! You will:
- Make a new column by getting the 4 digits representing the year
- Remove the last 7 characters from the movie names
- Checkout the result
Let’s write the code for achieving these tasks:
Before checking out the results, let’s go ahead and reset the index on this data frame first:
Now you can see that you produce a better-formatted version of the data frame:
Congratulations! With this, you have completed the most difficult part of this data analysis example: Getting and cleaning the data. Let’s quickly recap what you did so far:
- You read the raw data into data frames
- You learned and reported basic statistics
- You joined data frames and created new fields
You did some great work if you followed all the way until here! You can now: watch the first movie in your records from 1894 as a reward 🙂
Next, you are going to visualize your data and discover some patterns. When delivering a report in a professional or academic setting, this is where things start to get very interesting!
First, you will start with visualizing the total volume of films created over the years.
Next, you will count the total number of productions for each year and plot it. The record you see for the year of 2021 should be filtered out before proceeding:
Similar to the .head() method you have encountered before, .tail() shows you a subset of the rows of your data frame. However, instead of showing the first ones, it shows you the last ones:
Aside from 2021, which you filtered out, the other interesting year here is 2020. Although more than half of the year 2020 has passed at the time of writing this article, there are only 5712 rated films and movies for the year so far. Looks like 2020 is one of the most extraordinary years in history? Or maybe the movies are so new, that people didn’t have the time to watch them yet. Or both!
You can chart a 5 year moving average of the total productions:
This will produce a graphic similar to the one below:
You can see that the 5-year moving average is in a shocking decline! What is happening here? What can be the reason? Can you formulate some hypotheses? Here are some points for you to consider:
- This was an inner join. So these are the rated movies. Perhaps site and app usage went down.
- The filming industry is in a serious crisis! They are not producing films because of COVID-19.
- People didn’t have time to watch the most recent movies. If they didn’t watch them, they don’t rate them, and you can see a decline in ratings. For example, I didn’t watch the Avengers series before doing this analysis. On the other hand, the movie Braveheart (1995) most probably had enough time to get high numbers.
Each of these hypotheses could warrant an investigation, and there might be other ideas that you can come up with yourself. Feel free to explore any of these hypotheses further on your own. Remember that practicing your skills by following your interests is one of the best ways to learn new skills and keep them sharp.
For this data analysis example, let’s continue by investigating a slightly different question:
What have people watched (or rated) most since 2000?
For this question, let’s focus on the genres with a high volume of movies. You are going to identify the top 6 genres with the highest number of movies in them, and filter them out to produce the next chart:
Unless the movie industry changed significantly in the time between writing this article and when you are reading it, your output will probably look like this:
Now, you want to get the ratings for these genres from your tidy_movie_ratings data frame, but restrict the ratings to only the movies made between 2000 and 2019:
Finally, you can create a graph showing a 2-year moving average of the total volume of rated films:
And here is your graph output for this data:
This gives a nice visual representation and helps you to interpret the data to answer the question you posed before. Here are the take-aways that I took from it:
- Drama and Thriller are the winner genres
- Seems that Sci-Fi & Adventure are not as popular
On the other hand, some patterns can be misleading since we are only looking at the absolute numbers. Therefore, another way to analyze this phenomenon would be to look at the percentage changes . This could help your decision making if you are, let’s say, in the business of online movie streaming.
So let’s give that a try and plot the percentage changes:
From this filtered data, let’s produce a 5-years moving average graph:
And the output is shown below:
You notice the decline you already spotted earlier. However, it’s interesting to see the Sci-Fi & Adventure genres moving to the top.
Indeed, Sci-Fi & Adventure movies were a real hype , and you might want to play your cards into them, especially if your business is somewhat related to global film industry trends. These two genres has the sharpest slope for the increase in receiving ratings. This may signal that there is an increasing demand and could be a valuable insight for your business.
Let’s stay with one of these hyped genres for a bit longer and explore yet another question you can answer through this data set.
Let’s say you’re still building out your imaginary streaming service, you understood that the interest in Sci-Fi movies is rising sharply, and you want to make it easy for your users to find the best Sci-Fi movies of all times. What are the movies from each decade which you could suggest to your users by default?
To answer this question, let’s start by writing the necessary steps:
- Build a scifi base table containing only the columns you need
- Filter for the records before 2020
- Create a new column called decade
- Check it out
And here’s the code to accomplish these tasks:
The first 5 rows of your new scifi data frame will look like this:
Next, you will filter for movies that have more than 10 ratings. But how can you find how many times a movie was rated? Here .groupby() comes to the rescue. After getting the counts, you will generate a new list called movie_list with the condition that a movie needs to have greater than 10 ratings. Below, the final operation will be only about getting the indices of the filtered count_group . You will achieve that by using .index.values method:
The output looks like below:
movie_list now contains those movies that have been rated more than 10 times. Next, you will filter on your scifi base table using the movie_list . Notice the usage of the .isin() method. It is quite user-friendly and straight-forward:
After you created the filtered_scifi table, you can focus on building up your metrics in order to select the best liked movies of each decade. You will look at the average rating, and you will need to .groupby() decade and movie_title .
It is important to sort the aggregated value in a descending order to get the results you are expecting. You want each group to have a maximum of 5 films, so a lambda expression can help you to loop through the decade groups and show only the top 5. Otherwise, if there are less than 5 films in a decade, you want to show only the top movie, meaning only 1 record. Finally you will round the ratings to two decimal points.
You are encouraged to chop the code shown below into single lines and see the individual result for each of them:
The output of this operation will be your top-rated Sci-Fi movies by decade:
If you want to see the values starting from 1990, you can do so by slicing the data frame:
Here are the results going back to 1990:
Congratulations! You have officially completed your first movie recommendation engine! Ok, I know it’s not quite Netflix – which uses machine learning to recommend what you should watch. However in the tables you just generated, you’ve established some rule-of-thumb recommendations based on data and logic – a solid and fun first step!
What’s more, you’ve completed your own full data analysis example project:
- You read your data as pandas data frames
- You created basic statistics and interpreted the results
- You joined data frames, applied conditions to filter them, and aggregated them
- You used data visualization to find patterns and develop hypotheses
- And you didn’t jump into conclusions and root causes. You kept your reasoning simple and skeptic
- You created summary tables
All of the above are important and common aspects of working with data.
Source Code on GitHub
If you enjoyed this data analysis example and you want to learn more and practice your skills further:
- Add More Data : You can search for some additional IMDB data freely available on the internet. Chances are they contain information about directors of the movies. You could join this data with your tidy_movie_ratings dataset and see which directors are getting top ratings for which movies over the years, and by decades. This way, you can practice everything you have learned here over again
- Build Your Service : You can write a function which takes the top_rate_by_decade data frame as input and returns a random movie from the list, further simulating a movie recommendation system
- Your Idea Here : There are limitless possibilities to practice and play with this data. Share your explorations with us if you do!
- If you want to learn more : Check out CodingNomads’ Data Science & Machine Learning Course to dive even deeper into data analysis and run full end-to-end machine learning projects on your own!
I hope you enjoyed this article and continue having fun with analyzing your datasets.
About the Author: Cagdas Yetkin is a Senior Data Scientist at Jabil where he works on Integrated Circuit Test projecst to detect anomalies using test probe measurements in Printed Circuit Board Assambly production on multiple sites and Supply Chain Digitalization for dynamic slotting, and bin optimization in factory warehouses. He develops soccer analytics and betting applications as a hobby, and enjoys traveling. Connect with him on LinkedIn and Twitter .
Editor : Martin Breuss, @martinbreuss , martinbreuss.com
IMAGES
VIDEO
COMMENTS
Cannot retrieve latest commit at this time. History. Code. 47 lines (30 loc) · 1.38 KB. package Assignments; import java.util.Scanner; public class Movie_Ratings_1 { public static void main (String [] args) { //Website ratings int web1, web2, web3; //Focus group ratings double fg1, fg2; //Movie critic rating double c; //Averages of Website and ...
Analyse reviews and popularity vs actors to see if there are any correlations to increased RoI Source increased data from the movie database that specifies primary genre as this would increase the data sets and ability to analyse. Thank you Email: [email protected] GitHub: PaulStewAus LinkedIn: paulstewartaus. Repository Files. README.md; Data
Packages. Host and manage packages
For all pairs of reviewers such that both reviewers gave a rating to the same movie, return the names of both reviewers. Eliminate duplicates, don't pair reviewers with themselves, and include each pair only once. For each pair, return the names in the pair in alphabetical order. SELECT DISTINCT Re1. name, Re2. name.
In this article, I will create a data pipeline for transferring and analyzing movie data from IMDb. The data pipeline will be created using the following tools: Data ingestion: Web scraping from IMDB using Python. Data storage: Google BigQuery. Data analysis: DBT. Data visualization: Power BI. Data orchestration: Apache Airflow.
Assignment 1: Movie Ratings? comments sorted by Best Top New Controversial Q&A Add a Comment. Accomplished_End3197 • Additional comment actions ... System.out.println("Overall movie rating: "+y); } } Reply
SELECT title, CASE WHEN avg_rating>8 THEN 'Superhit Movie' WHEN avg_rating>7 THEN 'Hit Movie' WHEN avg_rating >5 THEN 'One-time-watch Movie' ELSE 'Flop Movie' END AS movie_category FROM movie as m INNER JOIN genre as g ON m.id=g.movie_id INNER JOIN ratings as r ON m.id=r.movie_id WHERE genre ='thriller';
The reviews are scores from 1 to 5, where 5 is the best score and 1 the worst, and 0 means that a person has not watched the movie. I can represent each person's reviews in a separate vector.
Analysis of Movie ratings and revenue from 2006-2016 - kawshiksharma/ASD1_Assignment_1
I have finished assignment 1 (movie review application), please review, thank you! README file: https://github.com/trungdq88/learn-swift/tree/master/assignment-1 /cc ...
Now let's get started with the task of movie rating analysis by importing the necessary Python libraries and the datasets: 4. 1. import numpy as np. 2. import pandas as pd. 3. movies = pd.read_csv("movies.dat", delimiter='::') 4.
# top 6 genres by the total number of movies top6_genre = (tidy_movie_ratings.iloc[:, 4:-1] # get the genre columns only .sum() # sum them up .sort_values(ascending=False) # sort descending .head(6) # get the first 6 .index.values # get the genre names ) top6_genre ... Source Code on GitHub. What Next? If you enjoyed this data analysis example ...
Contribute to sameer0jethwani/ml-1--IMdb-Movie--Reviews development by creating an account on GitHub.
This process will generate a trained model that you can then use to predict the sentiment of a given piece of text. To take advantage of this tool, you'll need to do the following steps: Add the textcat component to the existing pipeline. Add valid labels to the textcat component. Load, shuffle, and split your data.
In the upper-right corner of GitHub.com, select your profile photo, then click Your organizations. Click the name of your organization. Under your organization name, click Teams. Click the name of the team. At the top of the team page, click Settings. In the left sidebar, click Code review. Select Only notify requested team members.
Table 2, Table 3 and Table 4 show the results of movie lists vs. genres, movie prediction against the movie ID, and top five movie ratings, respectively. When using the mean rating of each movie as the prediction, the testing RMSE 1 is 0.9761; Figure 8 shows the results for the training and test sets.
Saved searches Use saved searches to filter your results more quickly