How to do it?:
Open the Rmarkdown file of this assignment (link) in Rstudio.
Right under each question, insert a code chunk (you can use the hotkey Ctrl + Alt + I
to add a code chunk) and code the solution for the question.
Notice that if there is eval=FALSE
in the first line of the code chunk, the chunk will not be execute.
Knit
the rmarkdown file (hotkey: Ctrl + Alt + K
) to export an html.
Publish the html file to your Githiub Page.
Submission: Submit the link on Github of the assignment to Canvas.
This assignment works with the IMDB Top 1000 data. Find out more information about this data at this link. Import the data and answer the following questions.
List all the names of the columns of the data
Which movies have the highest money earned (Gross)?
What is the lowest rating (IMDB_Rating)? List five movies have this lowest rating.
Which year have the most number of movies released in the list? What is the total of money earned on that year?
What is the total money earned per movies on average?
Calculate the average number of votes by year. Calculate the average number of votes of movies that have IMDB rating greater than 9.
Calculate the average Meta score in 2020 of movies that have number of votes in the third quartile.
(Optional - Challenging). The current Runtime
variable is not a numeric. Use the str_remove
function to remove min
from the variables then use as.numeric
to convert the variable to numeric. Calculate the average running time in the 2010s. Calculate the correlation between running time and rating (adding use="complete.obs"
in the cor
function to ignore the missing values).
We can use select_if
to select columns satisfying a condition and use summarise_if
to do calculation on columns satisfying a condition. Try the follows to understand these functions.
# Select only character columns
df %>% select_if(is.character)
# Calculate the median of all numeric columns
df %>% summarise_if(is.numeric, mean, na.rm=TRUE)