Dr. Ateendra Jha
IMDb Movie Analysis - Web Scrapping and Analysis
Industry
Entertainment
Type
Web Scrapping
Language
Python
Introduction
MDb is an online database of information related to films, television programs, home videos, video games, and streaming content online – including cast, production crew and personal biographies, plot summaries, trivia, ratings, and fan and critical reviews.
Impact
Understanding of the customer interest becomes crucial to drag the audiance crowed to newly released series or movies. The interest of the audiance decides the genra to host, which eventualy assists the movie or series makers to decide to spend money in that genra.
It also becomes crucial to follow the ratings by different age groups to have insight of their interests.
Problem Statement
To identify the genera more popular.
To identify which age group has rated more for whch genera.
Approach
Solution
Related Data
# To add a new cell, type '# %%'
# To add a new markdown cell, type '# %% [markdown]'
# %% [markdown]
# # IMDb Top Rated Movies
# %% [markdown]
# ### Web Scrapping
# %%
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
import re
# %%
weburl = "https://www.imdb.com/chart/top/?ref_=nv_mv_250"
URLread = uReq(weburl)
htmlpage = URLread.read()
URLread.close()
# %%
websoup = soup(htmlpage)
# %%
filename = "IMBD-Movie2.csv"
f = open(filename, 'w')
# %%
containers = websoup.find("tbody")
rows = containers.findAll('tr')
headers = "Name###Year###Rating\n"
f.write(headers)
for row in rows:
col = row.findAll('td')
col = [x.text.strip() for x in col]
rating = (col[-3]).strip()
nameyearstring = col[1]
name = ((re.sub("^\d+","", nameyearstring))[2:-6]).strip()
year = (nameyearstring[-5:-1]).strip()
print(name,",",year,",",rating)
f.write(name+"###"+year+"###"+rating+"\n")
f.close()
# %%
import pandas as pd
# %%
df = pd.read_csv(r"C:\Users\ateen\OneDrive\Project Files\IMBD-Movie2.csv", sep="###" )
pd.set_option('max_rows', None)
df
# %%