Data Mining and Data Scientist Salary Estimates in the Philippines

http://cdn.mathjax.org/mathjax/latest/MathJax.js

Motivation

As I am going back to the Philippines to pursue further studies in Statistics, it intrigues me if Data Mining and Data Science are catching up. I am seeing some positions in jobsearch websites such as Jobstreet so as a data miner, I extracted the relevant job openings that are related to the key phrases:

  • Data Mining; and
  • Data Scientist.

These may look too specific but this is just a quick draft, anyway. Also, I did not include Data Analyst as this scopes a broader job scope diversity than the two mentioned. Also any intensive text extraction using basic Information Retrieval methods is not used.

Warning: The result of the models should not be used to provide recommendations as data the is collected using a convenience sample without performing accuracy tests, only k-fold cross validations against the training set when CART is used.

Data Set

The data is collected manually by searching for relevant job openings active today, 22 May, 2015. I have an assumption that that the data set is relatively small, and so less than 30 positions is returned. Pre-processing is done externally, in Excel, to remove currency prefix, i.e. PHP and text in experience, etc.

library(RCurl)
jobstreet <- getURL("https://raw.githubusercontent.com/foxyreign/Adhocs/master/Jobstreet.csv", ssl.verifypeer=0L, followlocation=1L) # Download from github
writeLines(jobstreet, "Jobstreet.csv")
df <- read.csv('Jobstreet.csv', head=T, sep=",") # Load dataset
df <- na.omit(df) # Exclude missing data

summary(df) # Summarize
##  Expected.Salary    Experience       Education    
## Min. : 11000 Min. : 0.000 Min. :1.000
## 1st Qu.: 19000 1st Qu.: 2.000 1st Qu.:2.000
## Median : 28000 Median : 4.000 Median :2.000
## Mean : 35016 Mean : 4.919 Mean :2.179
## 3rd Qu.: 40000 3rd Qu.: 8.000 3rd Qu.:2.000
## Max. :130000 Max. :20.000 Max. :4.000
##
## Specialization Position
## IT-Software :30 Data Mining :72
## - :17 Data Scientist:51
## IT-Network/Sys/DB Admin:14
## Actuarial/Statistics :12
## Banking/Financial : 8
## Electronics : 8
## (Other) :34

As mentioned, there are only approximately 120 job applicants which applied for these two grouped positions. Since the data does not mention if an applicant applied for more than one position, I assume that these are distinct records of applicants per position and/or position group, Data Mining and Data Scientist.

Variables

  1. Expected.Salary – numerical. The expected salary of each applicant based on their profile.
  2. Experience – ordinal but treated as numerical for easier interpretation in the later algorithms used. This is the years of work experience of the applicant.
  3. Education – categorical; not used in the models because of extreme unbalance in proportions. This is labelled as:
    • 1 – Secondary School
    • 2 – Bachelor Degree
    • 3 – Post Graduate Diploma
    • 4 – Professional Degree
  4. Specialization – categorical; not used in this analysis.
  5. Position – categorical. Data Mining or Data Scientist
  6. Education.Group – categorical. Additional variable to bin the years of experience.
# Categorize education variable
df$Education <- factor(df$Education, levels = c(1,2,3,4),
labels=(c("Secondary Sch", "Bach Degree",
"Post Grad Dip", "Prof Degree")))

# Bin years of experience
df$Experience.Group <- ifelse(df$Experience < 3, "3 Years",
ifelse(df$Experience < 5, "5 Years",
ifelse(df$Experience < 10, "10 Years", "+10 Years")))
df$Experience.Group <- factor(df$Experience.Group,
levels=c("3 Years", "5 Years", "10 Years", "+10 Years"))

# Drop variables
df <- df[, !(colnames(df) %in% c("Education","Specialization"))]

# Subsets positions
mining <- subset(df, Position == "Data Mining")
scientist <- subset(df, Position == "Data Scientist")

Distribution

As expected, Data Scientists have a higher expected salary although this is so dispersed that even if I compare these two using a t-test assuming heteroskedastic distribution, there is a significant difference between the averages expected salaries of the two positions.

require(ggplot2)
require(scales)

# Boxplot
ggplot(df, aes(x=factor(0), y=Expected.Salary, fill=Experience.Group)) +
facet_wrap(~Position) + geom_boxplot() + xlab(NULL) +
scale_y_continuous(labels = comma) +
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank(),
legend.position="bottom")
Boxplot
Distribution of Expected Salaries
# T-test
t.test(Expected.Salary ~ Position, paired = FALSE, data = df)
## 
## Welch Two Sample t-test
##
## data: Expected.Salary by Position
## t = -3.3801, df = 68.611, p-value = 0.001199
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -24086.501 -6205.983
## sample estimates:
## mean in group Data Mining mean in group Data Scientist
## 28736.11 43882.35
# Median expected salaries of Data Mining vs Data Scientist
c(median(mining$Expected.Salary), median(scientist$Expected.Salary))
## [1] 25000 30000

Come on fellow data enthusiasts, you should do better than this! The difference of their medians is just 5,000 PHP. In my honest opinion, these center values are way below based on the prospective demand of shortage of these people who can understand data in the next 10 years.

Regression

The intercept is not included in the model because I want to see the contrast between Data Mining and Data Scientist although I already computed it beforehand. Besides, though the linear regressio model shows significant value, but when doing diagnostics, linear approach is not appropriate because the residual errors are not random and depict a funnel shape based on their errors. The quantitative measures of regressions for significance cutoff are: $r_{adj}^{2}>0.80, p<0.05$

The regression output coefficients are interpreted as follows:
$$y = \beta_{0}(12,934.9) + \beta_{1}(3,336.3) + \beta_{2}$$

# Estimate coefficients of linear regression model
summary(lm(Expected.Salary ~ Experience + Position-1, data=df))
## Call:
## lm(formula = Expected.Salary ~ Experience + Position - 1, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -45312 -11123 -1280 6877 66688
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## Experience 3336.3 446.3 7.476 1.38e-11 ***
## PositionData Mining 12934.9 3024.3 4.277 3.83e-05 ***
## PositionData Scientist 26612.0 3455.8 7.701 4.27e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18350 on 120 degrees of freedom
## Multiple R-squared: 0.8136, Adjusted R-squared: 0.809
## F-statistic: 174.6 on 3 and 120 DF, p-value: < 2.2e-16
# Scatter plot
ggplot(df, aes(x=Experience, y=Expected.Salary)) +
geom_point(aes(col=Experience.Group)) +
facet_wrap(~Position) +
scale_y_continuous(labels = comma) +
stat_smooth(method="lm", fullrange = T) +
theme(legend.position="bottom")
Scatterplot
Scatterplot of Expected Salary per Year of Experience
# Diagnose LM
par(mfrow=c(1,2))
plot(lm(Expected.Salary ~ Experience + Position-1, data=df), c(1,2))
Linear Regression Estimates and Diagnostics

CART

Information Gain is used to divide the nodes based on weighted average entropy as linear regression does not do well with the data set. Of course, years of experience is more influential than the position.

Looking at the estimated salaries from the printed tree, applicants who have years of experience lower than 1.5 are approximately expecting 17,000 PHP. While those who applied for Data Mining jobs with more than 6.5 years of experience are expecting 66,000 PHP on average.

require(rpart)
require(rattle)

cart <- rpart(formula = Expected.Salary ~ Experience + Position,
data = df,
parms = list(split = "information"), # Uses information gain
model = T) # Retains model information

# Plot tree
layout(matrix(c(1,2,3,4), nrow = 1, ncol = 2, byrow = TRUE), widths=c(1.5,2.5))
barplot(cart$variable.importance,
cex.names = 0.6, cex.axis = 0.5,
sub = "Variable Importance")
fancyRpartPlot(cart, main=NULL, sub=NULL)
Rpart
Decision Tree using CART and Variable Importance
# Estimates
print(cart); printcp(cart)
## n= 123 
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 123 66101970000 35016.26
## 2) Experience< 4.5 64 7326938000 23781.25
## 4) Experience< 1.5 27 520666700 16888.89 *
## 5) Experience>=1.5 37 4587676000 28810.81
## 10) Position=Data Mining 24 1055625000 24375.00 *
## 11) Position=Data Scientist 13 2188000000 37000.00 *
## 3) Experience>=4.5 59 41933560000 47203.39
## 6) Position=Data Mining 33 9501333000 37333.33 *
## 7) Position=Data Scientist 26 25137120000 59730.77
## 14) Experience< 6.5 8 5237500000 46250.00 *
## 15) Experience>=6.5 18 17799610000 65722.22 *
## 
## Regression tree:
## rpart(formula = Expected.Salary ~ Experience + Position, data = df,
## model = T, parms = list(split = "information"))
##
## Variables actually used in tree construction:
## [1] Experience Position
##
## Root node error: 6.6102e+10/123 = 537414370
##
## n= 123
##
## CP nsplit rel error xerror xstd
## 1 0.254780 0 1.00000 1.00888 0.19245
## 2 0.110361 1 0.74522 0.81057 0.14709
## 3 0.033563 2 0.63486 0.77285 0.13292
## 4 0.031769 3 0.60130 0.76228 0.13052
## 5 0.020333 4 0.56953 0.74207 0.12969
## 6 0.010000 5 0.54919 0.70258 0.12366
Again, fellow data miners and data scientists, ask for more! You do not realize your worth with the current demand of people who can understand data.

The full paper on this can be downloaded at https://github.com/foxyreign/Adhocs/blob/master/Jobstreet.pdf

https://www.scribd.com/embeds/266354521/content?start_page=1&view_mode=scroll&access_key=key-Yw2JSzZVa8aKLVjcumfH&show_recommendations=true

Tanita 2015

Fitness Goals for 2015

Here I am again gaining back my momentum where I fell off maintaining a healthier lifestyle two years ago. This time, for sure, I will stick to it until I reach the 3rd phase of my target and maintain it. Besides, I will be the loser again if I cannot reach nor maintain this.

Since Wordpress does not allow script embedding, you need to click on my body composition dashboard so that you can interactively look at my data.

FoxyReign's Body Composition

I used moving average to smooth the data because I do not feel that a linear trend on a time-series model is not appropriate given that I have very few data points; maybe after three months of tracking, I will include that.

What really concerns me is that I see a statistically significant correlation between the increase in my fat and fat-free muscle mass. In due time, this has to change – let it be flat or better negatively correlated variables. Ill definitely look very muscular if that happens.

My Facebook Social Network

Last year, Coursera offered a Social Network Analysis course where the instructor is one of the Computational Social Scientist of Facebook, Lada Adamic. Too bad I did not finish this course because it overlapped with another course I was finishing. When this course is offered again, I will definitely complete this since I have to visualize more graphs that deals with the network of textual data using association rule learning.

So here’s my Facebook social network. I wanted to mine more information about my friends but after Facebook retired Oauth 1.0, it really restricted a lot of information that you can mine from your friends. I guess that’s a good thing because of privacy information concerns. So I can only mine the network map though I found a way to mine my textual posts through the archive feature but I have to build some program, maybe in C++ or Python, to extract the data properly.

The size of each label is based on the mixed degree, meaning our mutual friends. The colors of each line the intensity, the purple being the highest, green is middle and blue is the least.

Facebook Social Network

It is fun to see how my friends (nodes) are clustered together based on their distance of connectivity and how each single friend can be interconnected with two clusters, putting their spot in between as it is the average distance between two clusters.

This is not really a particular big network, consisting of 2,500+ nodes (friends) and 55,000+ edges (how each node is connected). Although with a pretty decent PC running at 8GB RAM, Gephi was already performing slow in running ForceAtlas 2.

See the upper left sub giant network? I probably need to clean up my friends list because they do not make sense. These are my Facebook friends who are closely related to each other by using the recommended friend list of Facebook – a behavioral pattern where a group of men who looks at the stance of your half-naked pictures.

I wonder if there’s any platform where I can upload my big network so you can interactively see the nodes and labels.

FoxyReign Facebook Network

My Facebook Social Network

Last year, Coursera offered a Social Network Analysis course where the instructor is one of the Computational Social Scientist of Facebook, Lada Adamic. Too bad I did not finish this course because it overlapped with another course I was finishing. When this course is offered again, I will definitely complete this since I have to visualize more graphs that deals with the network of textual data using association rule learning.

So here’s my Facebook social network. I wanted to mine more information about my friends but after Facebook retired Oauth 1.0, it really restricted a lot of information that you can mine from your friends. I guess that’s a good thing because of privacy information concerns. So I can only mine the network map though I found a way to mine my textual posts through the archive feature but I have to build some program, maybe in C++ or Python, to extract the data properly.

The size of each label is based on the mixed degree, meaning our mutual friends. The colors of each line the intensity, the purple being the highest, green is middle and blue is the least.

FoxyReign Facebook Network

It is fun to see how my friends (nodes) are clustered together based on their distance of connectivity and how each single friend can be interconnected with two clusters, putting their spot in between as it is the average distance between two clusters.

This is not really a particular big network, consisting of 2,500+ nodes (friends) and 55,000+ edges (how each node is connected). Although with a pretty decent PC running at 8GB RAM, Gephi was already performing slow in running ForceAtlas 2.

See the upper left sub giant network? I probably need to clean up my friends list because they do not make sense. These are my Facebook friends who are closely related to each other by using the recommended friend list of Facebook – a behavioral pattern where a group of men who looks at the stance of your half-naked pictures.

I wonder if there’s any platform where I can upload my big network so you can interactively see the nodes and labels.