In this post, I use R to answer real-world questions: How can we see clearly the work done by consultants in a time-period? What is the distribution of workload?
Why R vs. others? This is more of a learning experience than a production code for me. I’m not interested in AI/ML at this point. I want to brush up my statistics/probability knowledge. And I believe in focused communities. Python is also great but I would have to filter a lot of information.
You can also achieve the same results using Excel, Apache Superset, SPSS, etc. But I see the R community as a teacher.
For the libraries, I have decided to go with tidyverse collection. I use dplyr instead of data.table. Since I’m just learning the language there is no argument for familiarity, and I think I will have a better understanding next time I look at some verbose/expressive code.
ggplot2 is great and you should check the cheatsheets here: https://github.com/rstudio/cheatsheets
And the guide here is very high quality: https://evamaerey.github.io/ggplot2_grammar_guide/ggplot2_grammar_guide#1
Keep in mind that with ggplot2 you draw a graph layer by layer. If you want to learn the theory behind it refer to the Wikipedia article:
“ggplot2 is an implementation of Leland Wilkinson’s Grammar of Graphics—a general scheme for data visualization which breaks up graphs into semantic components such as scales and layers.”
Code & Charts
Same data in CSV:
User,Customer,Effort Hours,Date Tom,ACME Corporation,2,2020-01-23 Tom,Soylent Corp,2,2020-01-23 Joe,Soylent Corp,8,2020-01-23 Tom,Pied Piper,2,2020-01-23 John,Mom and Pop,8,2020-01-23 Harry,Newco,8,2020-01-23 Tom,ACME Corporation,4,2020-01-24 Joe,Soylent Corp,8,2020-01-24 John,ACME Corporation,6,2020-01-24 Tom,Mom and Pop,2,2020-01-24 Harry,Newco,6,2020-01-24 John,ACME Corporation,8,2020-01-25 Joe,Soylent Corp,8,2020-01-25 Tom,Pied Piper,2,2020-01-25 Harry,Mom and Pop,4,2020-01-25 Tom,Newco,2,2020-01-25 Harry,ACME Corporation,4,2020-01-25
Install these libraries:
install.packages(c("tidyverse", "readxl", "lubridate", "ggplot2", "viridis", "hrbrthemes","scales"))
Load the libraries:
library(tidyverse) library(readxl) library(lubridate) library(scales) library(viridis) library(hrbrthemes)
Import the data and do some clearing:
df <- read_excel("timesheet/example.xlsx") # You can see columns names(df) # Date column has a string class class(df$Date) # Convert it to date class df$Date <- ymd(df$Date) # This is to have clean names names(df) <- str_replace_all(names(df), c(" " = ".", "," = "")) # If Spreadsheet type for column Effort.Hours is String/Text/character df$Effort.Hours <- as.numeric(df$Effort.Hours)
Let’s see first example:
# Sum Effort Hours for each day totalhours_by_day <- df %>% group_by(Date) %>% summarise( n = n(), Effort.Hours = sum(Effort.Hours) ) # ggplot! ggplot( totalhours_by_day, aes( x = Date, y = Effort.Hours ) ) + geom_point(aes(color = Effort.Hours)) + geom_smooth() + scale_x_date(breaks = date_breaks("day"), labels = date_format("%d"))
Divide by companies . We will use facet_wrap feature. Each customer will have its own chart!
# Sum for each customer and date, ignoring User total_by_customer <- df %>% group_by(Date, Customer) %>% summarise( Hours = sum(Effort.Hours) ) # month birdview, each customer having its own small graph, thanks to facet_wrap! ggplot( total_by_customer, aes( x = Date, y = Hours ) ) + geom_col() + facet_wrap(~Customer, ncol = 5) + scale_x_date(breaks = date_breaks("day"), labels = date_format("%d")) #if your time-interval is long use week for breaks: #scale_x_date(breaks = date_breaks("week"), labels = date_format("%d"))
Here the chart gets more interesting. If you need a lot of colors, you should check this site. The easiest way is to copy “HEX json” to an array like below:
total_by_user <- df %>% group_by(User, Customer) %>% summarise( Hours = sum(Effort.Hours) ) # If you need distinct colors for a lot of categories/fills. This site is useful: # https://medialab.github.io/iwanthue/ # Just copy from "HEX json" to here colors_custom <- c( "#e470a3", "#9bc761", "#9c8fe1", "#da924e", "#5fcdb2" ) # Consultant customer diversity & workload # consultant-customer-diversity-and-workload.png ggplot(total_by_user, aes(fill = Customer, y = Hours, x = User, label = Customer)) + geom_bar(position = "stack", stat = "identity") + geom_text(size = 3, position = position_stack(vjust = 0.5)) + theme(axis.text.x = element_text(angle = -90, vjust = 0.5, hjust = 0)) + scale_fill_manual(values = colors_custom)
Make your charts beautiful with colors and themes. I use viridis colors with stacked area chart.
Stacked bar chart has problems when there is missing data for x axis. You should fill them with 0.
"complete" function to the rescue!
total_by_date <- df %>% group_by(Date, Customer) %>% summarise( Hours = sum(Effort.Hours) ) %>% ungroup() %>% # if you dont fill dates, stacked area chart will have problems. # you should have all the x axis values for all factors. You can use 0 for empty data. # complete function is very handy here complete(Date = seq(first(Date), last(Date), by = "day")) %>% complete(Customer, Date, fill = list(Hours = 0)) # viridis colors are beautiful. If you need <10 colors # Plot ggplot(total_by_date, aes(x = Date, y = Hours, fill = Customer)) + geom_area(alpha = 0.6, size = .5, colour = "white", position = "stack") + scale_fill_viridis(option = "magma", discrete = T) + # scale_fill_manual(values=colors_custom) + theme_ipsum() + ggtitle("Activity by Date and Customers") # Ref: https://www.r-graph-gallery.com/136-stacked-area-chart.html
If you only looked at the total numbers you might think Tom is an underperformer. By the way, I think you should never judge a team member/human based solely on numerical data!
Overall I think the most important thing doing data analysis is asking the right questions. It is an art that should be learned with time and should be implemented end-to-end from the data collection to visualization. Good questions lead to beautiful charts that gives meaningful answers!