Data Journalism Case Study of Self-Employment And Entrepreneurship

For a while, I was looking for a representative dataset to study various factors (or predictors in terminology of Statistics) influencing self-employment or entrepreneurship. Although Crunchbase or LinkedIn have some fantastic datasets, harvesting data any of these sources is difficult, if not impossible. Only other public dataset with a good demographic data is Census data.

US has multiple "census-like" surveys to capture these kind of data -

  1. Decennial census (10 year) - census.gov
  2. ACS or American Community Survey (1 year / 5 year) - https://www.census.gov/programs-surveys/acs/
  3. CPS - Current Population Survey (monthly) - http://www.census.gov/programs-surveys/cps.html

Although I've tried downloading the full (5 year) ACS data (http://www2.census.gov/programs-surveys/acs/data/pums/2014/5-Year/csv_pus.zip), it's huge (I don't want to fire up a Hadoop or Spark cluster yet). Instead I've selected CPS (lesser scope but good enough). Even the smallest of the three - CPS dataset have massive number of features/attributes. Luckily great folks at IPUMS - CPS have already made it easier to select few interesting columns for extracting a customized dataset for your own research or analysis. Amongst many, I've selected following columns -

  • YEAR = "Survey year"
  • SERIAL = "Household serial number"
  • HWTSUPP = "Household weight, Supplement"
  • ASECFLAG = "Flag for ASEC"
  • HFLAG = "Flag for the 3/8 file 2014"
  • OWNERSHP = "Ownership of dwelling"
  • HHINCOME = "Total household income"
  • MONTH = "Month"
  • PERNUM = "Person number in sample unit"
  • WTSUPP = "Supplement Weight"
  • RELATE = "Relationship to household head"
  • AGE = "Age"
  • RACE = "Race"
  • ASIAN = "Asian subgroup"
  • MARST = "Marital status"
  • EDUC = "Educational attainment recode"
  • OCC = "Occupation"
  • IND = "Industry"
  • CLASSWKR = "Class of worker"
  • FTOTVAL = "Total family income"
  • INCTOT = "Total personal income"
  • OFFPOV = "Official Poverty Status (IPUMS constructed)"

Thankfully they have a CSV format for extraction beyond STATA/SPSS/SAS formats.

After download, the play with the dataset may begin.

df <- read.csv("IPUMS/cps_00003.csv")

> head(df[,1:6])
  YEAR SERIAL HWTSUPP ASECFLAG HFLAG OWNERSHP
1 2012      3  540.78        1    NA       10  
2 2012      5  569.43        1    NA       10  
3 2012      5  569.43        1    NA       10  
4 2012      5  569.43        1    NA       10  
5 2012      8  576.29        1    NA       10  
6 2012      8  576.29        1    NA       10  

Performed some conversion (mostly converting to factor columns).

Let's load dplyr for further processing. Discretizing age (c(15,25,35,50)) will pay dividends later. Ignore for time being. CLASSWKR = 10|13|14 signifies self-employment according to CPS data dictionary as shown below -

  • 00 = "NIU"
  • 10 = "Self-employed"
  • 13 = "Self-employed, not incorporated"
  • 14 = "Self-employed, incorporated"
  • 20 = "Works for wages or salary"
  • 21 = "Wage/salary, private"
  • 22 = "Private, for profit"
  • 23 = "Private, nonprofit"
  • 24 = "Wage/salary, government"
  • 25 = "Federal government employee"
  • 26 = "Armed forces"
  • 27 = "State government employee"
  • 28 = "Local government employee"
  • 29 = "Unpaid family worker"
  • 99 = "Missing/Unknown"
# Self employed
df$AGE_f2 <- cut(df$AGE, breaks = c(15,25,35,50))

library(dplyr)  
self_employed <- df %>%  
                  filter(CLASSWKR == 10 | CLASSWKR == 13 | CLASSWKR == 14)
str(self_employed)  

How age varies across the spectrum of self-employment or entrepreneurship ?

library(ggplot2)  
library(ggthemes)  
ggplot(data=self_employed, aes(x = AGE, fill = CLASSWKR_f)) +  
    geom_histogram(binwidth = 1) +
    theme_economist()

summary(self_employed$AGE)  

Here is the summary stats for AGE

Min. 1st Qu.  Median    Mean 3rd Qu.    Max.  
15.00   39.00   49.00   48.64   58.00   85.00  

So in average, and as evident from the histogram, ideal age for self-employment revolves around 48-50 (not 22-25 as TechCrunch wants you believe.

This makes sense.

How marriage affects self-employment ?

A simple horizontal bar plot will be good for the visualization as we'll be inevitably compare the values.

self_employed %>%  
 arrange(MARST_f) %>%
    ggplot(aes(x = MARST_f)) + 
geom_bar(aes(fill = ..count..)) +  
    labs(x = "Marital status", y = "# of people in self-employment") +
    coord_flip() +
    theme_economist()

Guys and gals remain Married. Beyond the beautiful gift of Love and Companionship, it works for your future self-employment.

How both - marriage and age - affects self-employment as a combined force ?

A series of box-plots (each showing distribution of age) for various marital categories will be a good start for comparison.

ggplot(data=self_employed, aes(x = MARST_f, y = AGE)) +  
    geom_point(color = "firebrick") + geom_jitter(color = "deepskyblue2") +
    geom_boxplot() +
    labs(x = "Marital status", y = "Age") +
    coord_flip() +
    theme_economist()

Couple of observations from this visualization -

  • Married (spouse present) people are getting self-employed with a median year around 50. Pretty late and it's expected.
  • Widowed people are becoming their own boss well after the age of 65. Hats off.
  • Unmarried (or single) are getting into self-employment at earliest (with a median age less than 40) compared to people with other marital status.

Which industries are getting more attention as far as self-employment is concerned ?

industry_summary <- self_employed %>%  
                      group_by(IND_f) %>%
                      summarise(count = n(), age = mean(AGE)) %>%
                      arrange(-count) %>%
                      slice(1:15)
> head(industry_summary)

Source: local data frame [6 x 3]

                                                     IND_f count      age
                                                    (fctr) (int)    (dbl)
1                                             Construction  6386 46.63263  
2                                              Real estate  1646 51.93985  
3                                        Animal production  1563 54.34229  
4                                  Child day care services  1302 43.19355  
5 Management, scientific and technical consulting services  1236 51.46036  
6                      Restaurants and other food services  1155 47.37229  

Let's invoke ggplot

industry_summary %>%  
  ggplot(aes(x = IND_f, y = count, fill = count_f)) + 
    geom_bar(stat = "identity") +
    geom_text(aes(y = count, 
                  label = paste(round(count*100/sum(count),2), "%"))) +
    scale_fill_brewer(palette = "Blues") +
    labs(x = "Industries", y = "# of people in self-employment") +
    coord_flip() +
    theme_economist() +
    theme(legend.position = "none")

Yeah. So looks like beyond the Reality Distortion Field of Silicon Valley, good old businesses still rule

  • Construction
  • Real Estate
  • Animal Production
  • Child & Day Care services

How Total Household Income influences age of self-employment? Indirectly - How does the Privileged Kids play ?

At first let's inspect the entire population of self-employed people, aged more than 15. HHINCOME is the total household-level income (sum total of personal [INCTOT column] incomes of all members of a given household)

library(RColorBrewer)  
palette_colors <- c("#9ECAE1", "#6BAED6", "#4292C6", "#2171B5")  
self_employed %>%  
    filter(AGE > 15) %>%
      ggplot(aes(x = HHINCOME)) +
      geom_histogram(aes(fill = AGE_f2), binwidth = 5000) +
      scale_fill_manual(values = palette_colors) +
      scale_x_continuous(labels = scales::dollar) +
      coord_cartesian(xlim = c(0,500000)) +
      theme_economist()

> mean(df$HHINCOME)
84880.6360159081  

Looks a bit cluttered to me. Let's consider a geom_density plot

self_employed %>%  
  filter(AGE > 15) %>%
    ggplot(aes(x = HHINCOME)) +
    geom_density(aes(color = AGE_f2), size = 2) +
    scale_color_manual(values = palette_colors) +
    scale_x_continuous(labels = scales::dollar) +
    coord_cartesian(xlim = c(0,500000)) +
    theme_economist()

Let's inspect all those persons - children of the house-owner. This will dig out the privilege kid folks. It can easily be extracted using RELATE column. RELATE = 301 | 303
signifies a parent-child relationship where house-owner is the parent.

palette_colors_highlight <- c("firebrick", "#6BAED6", "#4292C6", "#2171B5")  
self_employed %>%  
  filter(AGE > 15) %>%
  filter(RELATE == 301 | RELATE == 303) %>%
    ggplot(aes(x = HHINCOME)) +
    geom_density(aes(color = AGE_f2), size = 2) +
    scale_color_manual(values = palette_colors_highlight) +
    scale_x_continuous(labels = scales::dollar) +
    coord_cartesian(xlim = c(0,500000)) +
    theme_economist()

Something interesting ! Watch carefully the density graph for the Age group 15-25. This is highlighted in RED. Watch the flat and "long-tail" nature of the graph.

In a skewed distribution with a long-tail, the mean is actually pulled towards (from median) the long-tail.

Hence mean is still a good measure of central tendency here. Let's calculate some means grouped by various age brackets.

self_employed %>%  
    filter(AGE > 15) %>%
    group_by(AGE_f2) %>%
    summarize(mean_house = mean(HHINCOME, na.rm = TRUE), 
              mean_personal = mean(INCTOT, na.rm = TRUE))

   AGE_f2 mean_house mean_personal
   (fctr)      (dbl)         (dbl)
1 (15,25]   84037.82      22671.72  
2 (25,35]   87596.49      43584.75  
3 (35,50]  111096.61      60378.59  
4      NA  114710.42      67122.33  

Now let's perform the identical aggregation for the group where person is a child of the house-owner (eka privileged kid)

   AGE_f2 mean_house mean_personal
   (fctr)      (dbl)         (dbl)
1 (15,25]  110611.70      15221.19  
2 (25,35]  121524.75      32108.54  
3 (35,50]   91837.06      29378.92  
4      NA   78753.95      35913.70  

Observe, with these segmentation, suddenly, the mean household income of the (15,25] age group has jumped 30% to a $110,611 even though mean personal income has actually came down by 32%.

Let's investigate a bit more through visualization -

self_employed %>%  
  filter(AGE > 15) %>%
    ggplot(aes(x = HHINCOME)) +
    geom_histogram(aes(fill = CLASSWKR_f), binwidth = 10000) +    
    facet_grid(AGE_f2 ~ .) +
    scale_fill_manual(values = palette_colors_highlight) +
    scale_x_continuous(labels = scales::dollar) +
    coord_cartesian(xlim = c(0,500000)) +
    theme_economist()

We have splitted the distributions by age bracket. The top distribution looks pretty feeble.
Now, let's add the magic filtering criteria (privileged kid criteria)

self_employed %>%  
  filter(AGE > 15) %>%
  filter(RELATE == 301 | RELATE == 303) %>%
    ggplot(aes(x = HHINCOME)) +
    geom_histogram(aes(fill = CLASSWKR_f), binwidth = 5000) +
    facet_grid(AGE_f2 ~ .) +
    scale_fill_manual(values = palette_colors_highlight) +
    scale_x_continuous(labels = scales::dollar) +
    coord_cartesian(xlim = c(0,500000)) +
    theme_economist()

Observe how strong the distribution is for the (15,25] age bracket.

This hints:

Younger people (age: 15-25) having high household income attain a high chance of being self-employed


Conclusion:

Based on this dataset and coupled with my own life experience, please don't give too much commitments to startups when you are younger, unless of course, you hail from a privileged or wealthy background (business family is icing on the cake). Give it a shot, if it fails pivot few times quickly and learn as much as you can by documenting why it didn't work. Share it with others (Quora / Blog etc). Don't get burned out too quickly by working on someone else's startup. Beyond that, focus on a topic you love or really passionate about or time freezes when you work on that (ACID test: Imagine, you have built a successful company based on this topic. In a typical Boardroom politics, Board of Directors kicked you out from your own company showing some crazy reasons. It's 2 pm. You are sitting alone in your apartment. After last few days of expected "low", you are again feeling passionate about the topic and feeling the itch to do something on that - not necessarily a venture.)

You don't have to necessarily work in a typical corporate environment, but if you do, focus more on relationship building and social interactions even though you are a hardcore geek or nerd. Otherwise do something to pay the bill, gaining immense knowledge and experience on your chosen topic side-by-side. 10000 hours rule still works on all technical subjects. Just focus on becoming an well-known expert on that topic. Do whatever to achieve that (do a PhD, Masters, CFA, MBA; get dozen of relevant certifications; participate online competitions - I don't care). Build a solid Knowledge Portfolio of your work. Give yourself a favour by not reading any media posts/articles with strong Survivorship Bias (ex. TechCrunch).

Meanwhile hire an Investment Advisor and start building your Investment Portfolio. A good mix of Stock/ETF, CD and REIT/Real-estate will do. Build an emergency fund (6 * Monthly gross salary should be enough) and set it aside. Monitor the portfolio with the help of your advisor. Try learning investment basics and equity research techniques (screening etc). Your knowledge portfolio and investment portfolio should be growing side-by-side.

Also hire a Nutritionist and Fitness Coach (consider this as investment on Personal Health Portfolio). Start building your health slowly. A fit mind needs a fit body. This is super-critical to remain in an optimally healthy state.

If you feel the itch, you may take another shot in entrepreneurship or self-employment before getting Married. Try the consultancy route utilizing your Knowledge Portfolio built so far. Don't listen to experts saying consultancy is a dead-end. You will learn lots of valuable skills from consultancy experience - client interactions, project delivery, client satisfaction, diverse problems faced by clients etc.

Get married and enjoy the love and companionship. Enjoy life while building three portfolios simultaneously (Knowledge, Investment and Personal Health)

The more your portfolio grows, the better position you will be to take multiple shots towards self-employment or entrepreneurship. Some people drop the idea completely, as they age. That's fine too. After all, it's your life. But if you do retain the itch, tap into your chosen topic and associated Knowledge Portfolio to harvest new ideas for your venture. Start slow with a business model, outsource all non-critical parts and pivot quickly with learning documented. Repeat until successful. Remember: average age of self-employment is around 50 in the CPS dataset. Life literally can start at 50 as long as you are mentally and physically fit.