top of page

IMDb Movie Analysis - Web Scrapping and Analysis

Industry

Entertainment

Type

Web Scrapping

Language

Python

Introduction

MDb is an online database of information related to films, television programs, home videos, video games, and streaming content online – including cast, production crew and personal biographies, plot summaries, trivia, ratings, and fan and critical reviews.

Impact

Understanding of the customer interest becomes crucial to drag the audiance crowed to newly released series or movies. The interest of the audiance decides the genra to host, which eventualy assists the movie or series makers to decide to spend money in that genra. 


It also becomes crucial to follow the ratings by different age groups to have insight of their interests.

Problem Statement

To identify the genera more popular. 

To identify which age group has rated more for whch genera.

Approach

Solution

Related Data

# To add a new cell, type '# %%'

# To add a new markdown cell, type '# %% [markdown]'

# %% [markdown]

# # IMDb Top Rated Movies

# %% [markdown]

# ### Web Scrapping

# %%

from urllib.request import urlopen as uReq

from bs4 import BeautifulSoup as soup

import re

# %%

weburl = "https://www.imdb.com/chart/top/?ref_=nv_mv_250"

URLread = uReq(weburl)

htmlpage = URLread.read()

URLread.close()

# %%

websoup = soup(htmlpage)

# %%

filename = "IMBD-Movie2.csv"

f = open(filename, 'w')

# %%

containers = websoup.find("tbody")

rows = containers.findAll('tr')

headers = "Name###Year###Rating\n"

f.write(headers)

for row in rows:

col = row.findAll('td')

col = [x.text.strip() for x in col]

rating = (col[-3]).strip()

nameyearstring = col[1]

name = ((re.sub("^\d+","", nameyearstring))[2:-6]).strip()

year = (nameyearstring[-5:-1]).strip()

print(name,",",year,",",rating)

f.write(name+"###"+year+"###"+rating+"\n")

f.close()

# %%

import pandas as pd

# %%

df = pd.read_csv(r"C:\Users\ateen\OneDrive\Project Files\IMBD-Movie2.csv", sep="###" )

pd.set_option('max_rows', None)

df

# %%


bottom of page