Introduction:

Dataset of “National Enrollment Data Set” is being used in this research. The entire dataset spans across 13 years (2005 to 2017), covers 41 Australian Universities, and has a total of 12896962 records, with a file size 6.4 GB. To make this usable, a stratified dataset with 1% records from each institution is created.

The objective of this research is to be able to gain useful predictions and insights, which includes:

  1. Popularity of universities and it’s relation to G8 status and Regions.
  2. Field of study that each university and region specializes (if any)
  3. Differences in the preferences of International Students and Domestic Students. (region and field of study)
  4. Indication of gender-segregation in Field of Study.
  5. Year trends of number of students enrolled in higher institutions.
  6. ATAR Distributions in different types of institutions.

We shall read the CSV (Comma Seperated Values) file into R with the data assigned to variable “natEnrol”.

#natEnrol <- read.csv(file="h:/Downloads/Intro to Data Science/Project 13th Sep/nat_enrol_1.csv", header=TRUE, sep=",")

library(readr)
natEnrol<- read_csv("nat_enrol_1.csv")

Firstly, we shall observe structure of data to have an overview of the data

str(natEnrol)

We can see that there 128967 students data in this file and there are 46 variables for each student.

From the name, type and example of the variables, we can understand some attributes of these 46 variables.

X(int) looks like an index for the students data. Only 1 variable looks like sensible numerical data, namely: noEnrol(int) which looks like number of enrolment of the student. Varible “id”(int) looks like student ID. Variable “updatewhen” (Factor) is the date and time of data updates. Variable “updatewho” (Factor) is just a record of the party who key in this data.

At this point, there are 4 variables that we can’t have good understanding yet, namely:

“mceetya_regional_remote_count”“,”asgs_regional_remote_count"" , “onshore_indicator”" and “regionalremote_value”

Others 37 variables(factor) looks like catogerial data with the number of levels specified.

The levels of the variables tell us varios important information. For example, the structure of varaibles “Institution” and “Institution_Type” show us that the are 41 Institution and these Institution are catogarized into 3 types.

It appears that there are some variables that might need some transformation later on to have more sensible levels. For example, having 18 levels for age are too much.

It appears that there are some data abnormalities and missing values as well. For example, having ATAR of 1 seems unsensible and a number of “NA”s have been detected for these 4 variables:

“mceetya_regional_remote_count”“,”asgs_regional_remote_count"" , “onshore_indicator”" and “regionalremote_value”

Now, we would like to see the summary of the data to gain further insight.

summary(natEnrol)

From the summary, we can gain various useful informations like number of international students, number of english speaking students and etc. We will come back to these later when we start to answer the questions related to our objectives.

Here, we would like to focus on highlighting data abnormalities, missing values and the link in between variables.

  1. “mceetya_regional_remote_count”“,”asgs_regional_remote_count"" , “onshore_indicator”" and “regionalremote_value” have way too many missing values. Might not be useful.

  2. “atar” have various data abnormalities, atar = 999,998 and 1 don’t make sense. Luckily, we have “ATAR_Group” that give similar information on student’s ATAR scores, we shall compare them later to see if it is possible to get sensible transformation of “atar”.

Now, we would like to see the example of the first 40 data to gain further insight.

head(natEnrol,n = 10)

X_Field_of_Education and X_Supp_field_of_education seems to be related to course name.

For example:

X_Field_of_Education X_Supp_field_of_education course_name 90900 100300 Bachelor of Laws (Honours)/B VisualArts 90900 91901 Bachelor of Laws (Honours)/B Economics 80000 91901 B Commerce/B Economics

From these 3 rows we can see that 90900 refers to Bachelor of Laws and 91901 refers to Bachelor of Economics.

X_Field_of_Education is the course code of the main course name.

X_Supp_field_of_education is the course code of the subsidary course name.

Since there are lots of missing values in “course name”, “X_Field_of_Education” and “X_Supp_field_of_education” shall give us more information about the courses enrolled by the students.

In fact, meaning of “X_Field_of_Education” and “X_Supp_field_of_education” can be found from the link below:

https://heimshelp.education.gov.au/resources/field-of-education-types

Or, if we are interested in a broader perspective, “broad_foe”" and “Narrow_FOE” can tell us the field studied by the students.

Next, we shall explore various relationships of these variables to gain meaningful insights

In this report, we will mainly use ggplot to visualize our data. Hence, we shall first import module ggplot2.

library(ggplot2)

载入程辑包:‘ggplot2’

The following object is masked _by_ ‘.GlobalEnv’:

    diamonds
library(gridExtra)

To simplify our ggplot code, we will assign variable “g”" as ggplot of data “natEnrol”.

g <- ggplot(natEnrol)

Now, we shall see if our data can provide clues for any answers related to our objectives.

Objective 1:

Popularity of universities and it’s relation to G8 status and it’s Regions.

# table() aggregates according the Institution, Institution Types and States
instSums <- table(natEnrol$Institution,natEnrol$Institution_Type,natEnrol$state )
# as.data.frame() converts table object into a data frame
instDf <- as.data.frame(instSums)
# taking only rows with non-zero frequency
instDf2 <- subset(instDf,instDf$Freq!=0)
# rename columns
colnames(instDf2)<-c("stu.Inst","inst.type","inst.state", "count")


ggplot(instDf2) + geom_bar(mapping= (aes(x= reorder(stu.Inst,count) , y = count , fill = inst.type )), stat = "identity") + coord_flip() + labs(title="Number of Students in each Institution", y ="Number of Students", x = "Institution", fill = "Types of Institution")



ggplot(instDf2) + geom_bar(mapping= (aes(x= reorder(stu.Inst,count) , y = count , fill = inst.state )), stat = "identity") + coord_flip()+ labs(title="Number of Students in each Institution", y ="Number of Students", x = "Institution", fill = "States of Institution")

Interpretions:

The Figures above shows the number of students of each institutions and their respective ranking according to the number of students.

Five out of eight G8 institutions are ranked at the top. Hinting that G8 status might be correlated with the number of students enrolled to the institution.

Interestingly, although University of Adelaide, University of Western Australia(UWA) and The Australian National University(ANU) are G8 institutions, they have relatively much less students compare to the others.

The second figure gave us some insight on this observation. It seems the top 10 institutions with the most students are from Victoria, New South Wales and Northern Territory. In fact, there are only three institutions not from these three regions in the top 20. Furthermore, all three G8 institutions with less students are from neither of these 3 regions as well.

This might indicates that the institution’s location (whether or not it is in VIC, NSW or NT) of the institution might be even more correlated with the number of students enrolled.

Data Transformation actions:

Form Table of Institution, Institution Types, State of Institution and the respective Counts. Assign new Data Frame of 4 columns to a new variable. Subsetting Datasets to omit irrevalent rows to get inner joint of 3 columns. Renaming Columns for easier handling of data.

Reason of Transformation:

To get a new Data Frame with only 4 variables that we need for these barcharts. To count number of students before geom_bar so that it is possible to arrange the barcharts according to the number of students .

Objective 2:

Field of study that each university and region specializes (if any)

sTheme <- c("#000000", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7", "#999900", "#000569", "#E600E9", "#560009", "#035050")

summary(natEnrol[natEnrol$broad_foe == "" | is.na(natEnrol$broad_foe) == TRUE,"broad_foe"])
  broad_foe        
 Length:1528       
 Class :character  
 Mode  :character  
knownFoe <- natEnrol[natEnrol$broad_foe != "" & is.na(natEnrol$broad_foe) == FALSE,]

knownFoe[knownFoe$broad_foe == "Food, Hospitality And",]$broad_foe <- "Food, Hospitality And Personal Service" 

ggplot(knownFoe) + geom_bar(mapping = aes(x= Institution, fill=broad_foe), position ="fill") + coord_flip() + labs(title="Proportions of Field in each", y ="Proportion", x = "Institution", fill = "Field of Study") + scale_fill_manual(values=sTheme)

Interpretation:

“foe” refers to Field of Study mentioned in https://heimshelp.education.gov.au/resources/field-of-education-types.

From the figure above, it seems that most institutions have no obvious specialization in any field. However, it is worth to note that ANU has the highest proportion of students studying in society and culture.

Data Transformation actions:

Count and Proportion was results of data tranformation done by geom_bar Subsetting Datasets to omit rows with empty data or NA in broad_foe (Field) Assigning subseted Data Frame to a new variable (knowFoe) for conveniece to access later.

Change Column title “Food, Hospitality And” to “Food, Hospitality and Personal Service”

Reason of Transformation:

The figures here concerns on the proportions of field of study for each institution. There are only 1528 NA or empty data among 128967 datas. With such low % of missing values, it is safe to omit, since doing so won’t affect the proportions much.

Change Column title “Food, Hospitality And” to “Food, Hospitality and Personal Service” for easier understanding.

ggplot(knownFoe) + geom_count(mapping = aes(x = state, y = broad_foe, size = stat(prop), group = state, colour = state), show.legend = c(colour=FALSE)) + labs(size = "Proportions by States") + ggtitle("Proportions of Field in each States") + ylab("Field of study") + xlab("States")

Interpretation:

From the figure above, it seems that most states have no obvious specialization in any field.

ACT have the highest proportions in Society and Culture. From previous barcharts, we know that ANU is the institute with the most students in ACT. ANU being the institution with highest percentage of students studying Society and Culture might be a contributing factor.

MUL have the highest proportions in Health and Education. There are only very small proportions of students studying on Food, Hospitality and Personal Services Overall,the most popular field of studies are : “Society & Culture”, “Management & Commerce”, “Health” and “Education” These indicates the possibility of some form of specialization of instututions in certain states/region.

Data Transformation actions:

Count and Proportion was results of data tranformation done by geom_count.

Reason of Transformation:

The figure above concerns the proportions of field of study for each states

Objective 3:

Differences in the preferences of International Students and Domestic Students. (region and field of study)

g + geom_bar(mapping = aes(x="", fill=state), position ="fill") + coord_polar("y", start=0)  + facet_grid(facets= . ~ overseas, margins = TRUE) + labs (title = "Location of Studies of International and Domestic Students" , x = NULL , y = "Percentage of enrolment in different region" , fill = "States")

Interpretation:

From the figure above, it seems that most domestic and international have similar distributions across different states. There are slightly more percentage of International students studying in Victoria than Domestic students. Vice versa for New South Wales.

Data Transformation actions:

Count and Proportion was results of data tranformation done by geom_bar.

Reason of Transformation:

The figure above concerns the proportions of states where students studies among international & Domestice students.

ggplot(knownFoe) + geom_bar(mapping = aes(x=overseas, fill=broad_foe), position ="fill") + coord_polar("y", start=0) + ggtitle("Field of Study for Domestic vs Overseas Students") + xlab("") + ylab("Percentage of enrolment in different fields") + labs(fill="Field of Study")

Interpretation:

Among International Students, there are higher percentage of students studying in field of: “Management & Commerce” , “Information Technology” and “Engineering & Related Technology”.

As for Domestic Students, there are higher percentage of students studying in field of: “Society & Culture”, “Creative Arts” and “Education”

These difference might be due to differences in preference or mentality of International and Domestic Students. It is worth while to examine the causes of such differences.

Data Transformation actions:

Count and Proportion was results of data tranformation done by geom_bar.

Reason of Transformation:

The figure above concerns the proportions of field of study for International & Domestic Students.

Objective 4:

Indication of gender-segregation in Field of Study.

a <- ggplot(knownFoe) + geom_bar(mapping = aes(x= "", fill=gender), position ="fill") + coord_flip() + labs(title = "Proportion of Gender of All Students", y = "Percentage" , x = "" , fill = "Gender") 

b <- ggplot(knownFoe) + geom_bar(mapping = aes(x= broad_foe, fill=gender), position ="fill", show.legend = FALSE) + coord_flip() + labs(title = "Proportion of Gender in different Field of Studies", y = "Percentage" , x = "Field of Study") 

grid.arrange(a,b, heights=c(1/4, 3/4), nrow = 2)

Interpretation:

Overall, there are more females than male in our samples. Field of studies of “Information Technology” and “Engeneering & Related Technology” have much higher percentage of male than female. Field of studies of “Education” and “Health” have much higher percentage of females than males. This can be an indication that there are certain Field of Studies that society viewed as “for male” or “for female”.

Data Transformation actions:

Count and Proportion was results of data tranformation done by geom_bar.

Reason of Transformation:

The figure above concerns the proportions of field of study for male & female students.

Objective 5:

Year trends of number of students enrolled in higher institutions.

g + geom_bar(mapping = aes(x= as.numeric(as.character(Year)),fill = overseas)) + labs(title = "Number of Students from 2005~2017", y = "Total number of Students" , x = "Year", fill = "") + scale_fill_manual(values = c("#0072B2", "#D55E00"))

Interpretation:

Overall, there is an upward trend of number of students enrolled in higher institutions, which is a good sign. There is a slight drop in number of students in year 2017. However, at first glance, such small fluctuation in number of students seems natural throughout the years. This might be simply due to Natural Variations.

We can also see that number of domestic students are much higher than international students throughout the years. Number of domestic students increase faster as well. The increase in total number of students seems to come mostly from the increase of Domestic Students. This observation is quite expected since government must emphasize on providing higher educations to their citizens.

Data Transformation actions:

Count was results of data tranformation done by geom_bar.

Reason of Transformation:

The figure above concerns the number of male , female and all students enrolled in each year from 2015 to 2017.

# table() aggregates according the Institution
yearSums <- table(natEnrol$Year,natEnrol$gender)
# as.data.frame() converts table object into a data frame
yearDf <- as.data.frame(yearSums)
# taking only rows with non-zero frequency
yearDf2 <- subset(yearDf,yearDf$Freq!=0)
# rename columns
colnames(yearDf2)<-c("enrol.Year", "enrol.gender", "count")

ggplot(yearDf2, mapping = aes(x = enrol.Year, y = count, group = enrol.gender , color = enrol.gender)) + geom_point() + geom_line() + labs(title = "Number of Students from 2005~2017 - Segregated by gender", y = "Number of Students of each gender" , x = "Year", color = "Gender")

Interpretation:

Throughout the years, the number of female students are persistanly higher than male students. Throughout the years, number of male and female students seems to be move hand in hand and positively correlated. The only exception is in year 2017, where number of female students increase but number of male student drops. This also indicates that the fall in total of number of students are due to fall in number of male students. The fall in number of male students are greater than the increase of female students. It might be worth while to examine the cause of the fall of male students in 2017.

Data Transformation actions:

Form Table of Year, Gender and the respective Counts. Assign new Data Frame of 3 columns to a new variable. Subsetting Datasets to omit irrevalent rows to get inner joint of 3 columns. Renaming Columns for easier handling of data.

Reason of Transformation:

To get a new Data Frame with only 3 variables that we need for this line chart. To count number of students as geom_point and geom_line does not transform data to get count automatically like geom_bar.

Objective 6:

ATAR Distributions in different types of institutions.

#Omit data with "Unknown" ATAR_Group
knownATAR <- natEnrol[natEnrol$ATAR_Group != "Unknown", ]
#Convert "atar" to numeric variables
knownATAR$atar <- as.numeric(knownATAR$atar)
#Replace ATAR_Group data of ">=99" with "99=<"
knownATAR[knownATAR$ATAR_Group == ">=99",]$ATAR_Group <- "99=<"

c <- ggplot(knownATAR) + geom_boxplot(mapping = aes(x = "", y = atar),na.rm = TRUE) + labs(x= "", y = "Overall  ATAR  score")

d <- ggplot(knownATAR) + geom_boxplot(mapping = aes(x = Institution, y = atar, fill = Institution_Type),na.rm = TRUE, show.legend = FALSE) + coord_flip() + labs(x= "", y = "ATAR  score by Institutions", title = "Boxplot of ATAR Scores")

grid.arrange(d,c, widths=c(9/10, 1/10), ncol = 2)


ggplot(knownATAR) + geom_bar(mapping = aes(x= ATAR_Group, fill = Institution_Type), position ="identity", show.legend = TRUE) + coord_flip() + labs(title="Number of students in different ATAR score ranges", y ="Number of Students", x = "ATAR Range",fill = "Type of Institution") + facet_wrap(facets = (.~Institution_Type),scales="free")


ggplot(knownATAR) + geom_bar(mapping = aes(x= ATAR_Group, fill = Institution_Type), position ="fill", show.legend = FALSE) + coord_flip() + labs(title="Proportions of Institution Types among students in different ATAR range", y ="Proportions", x = "ATAR Range")

Interpretation:

Overall, most institution in group of 8 have students with higher ATAR scores and higher median of ATAR Scores. Dispersion of ATAR scores of students in group of 8 seems to be narrower as well. (referring to their respective interquatile range) This might be due to institution in group of 8 have higher ATAR requirement.

It is also worth to take note that the overall median ATAR score of all students is slighty above 80. However, six out of eight G8 universities’ students have median ATAR score at or above 90. Even the remaining two G8 universities’s students have ATAR score median greater than 85. Interestingly, both institutions that are considered a college have students with median ATAR score around 83, which are higher than many non-G8 univerisities.

We can see that the mode ATAR group of G8 is at 90-95 while non-G8’s mode is at 80-85. ATAR score of both G8 and non-G8 seems to be loosely normal. The mode of ATAR score of students in college is at range 85-90.

Non-G8 universities have much more students in total. Perhaps, simply due to the much higher number of institutions in non-G8 category.

Last but not least, it seems that G8 universities have positive correlations with good ATAR scores. Students in group with higher ATAR score have higher percentage of them enrolled in G8. This might indicate a strong preference for G8 universities among students with good results.

Data Transformation actions:

Omit data with “Unknown” ATAR_Group Convert “atar” to numeric variables Replace ATAR_Group data of “>=99” with “99=<”

Reason of Transformation:

Data with “Unknown” ATAR_Group also have adnormal/unsensible ATAR values. Hence, it is very hard to make use of these ATAR related data.

Converting “atar” from factor to numeric allow us to compute various statistics such as median , inter-quatile range and etc.

Replacing ATAR_Group data of “>=99” with “99=<” allow barplot of ATAR_Group to be arranged from group with highest ATAR score to the lowest.

