Intro

I'm reading a lot of startup articles/blogposts these days. Many of them describe difficulties around finding the right people, and difficulties about building cost-effective teams. This is what this blogpost will focus on.

The problem of selecting candidates to form a team has been studied more thoroughly 1 and has been modeled in different ways.

This blogpost will focus on finding optimal teams while at the same time complying with some constraints depending on the needs for that particular team.

The scenarios described are those where a business owner wants to staff a project according to business needs and automatically select the initial team members from a talent pool of qualified candidates.

Overview

All throughout this blogpost we'll first build a candidate pool(using a PostgreSQL query) and then we'll use the GLPK modeling language (GNU Linear Programming Toolkit) to find an optimal team while taking into consideration constraints and goals/objectives for that particular situation.

The SQL query will find the right candidates for the team and it will write its output in a CSV file. The CSV file will then be used by the GLPK model to compute an optimal team.

Below, there's a diagram describing the composition of each type of team we'll cover

Web app team (minimizing the cost, and allow multiple roles for one team member)

In this example, we're aiming to build a team comprised of the following roles:

  • backend
  • frontend
  • web designer
  • sysadmin

In this particular case we want to minimize the cost. Some of the candidates that make their way into the solution might fill both roles (which they actually do when these programs are run on real data), such as one candidate might fill the roles {frontend,backend} , which sounds a lot like fullstack, or similarly, a candidate might fill the roles {sysadmin,backend} which is close to the definition of devops.

Building the candidate pool

\f ','
\pset footer off
\o '../modules/team-solver-3-exp.csv'
WITH candidates AS (
    SELECT *
    FROM odesk_freelancer
    WHERE tzoffset IS NOT NULL AND bill_rate > 0
), backend AS (
    SELECT
    DISTINCT ON (f.freelancer_id, bill_rate)
    f.freelancer_id,
    f.bill_rate AS rate,
    'backend'::text AS role,
    tzoffset
    FROM candidates f
    JOIN odesk_job_feedback jf ON f.freelancer_id = jf.freelancer_id
    JOIN (
        SELECT
        j.*
        FROM odesk_job j
        JOIN odesk_search_job sj ON sj.job_id = j.job_id
        WHERE tsv_basic @@ to_tsquery('backend & developer & (php | nodejs | python)')
    ) j ON jf.job_id = j.job_id
    ORDER BY bill_rate
    LIMIT 400
), frontend AS (
    SELECT
    DISTINCT ON (f.freelancer_id, bill_rate)
    f.freelancer_id,
    f.bill_rate AS rate,
    'frontend'::text AS role,
    tzoffset
    FROM candidates f
    JOIN odesk_job_feedback jf ON f.freelancer_id = jf.freelancer_id
    JOIN (
        SELECT
        j.*
        FROM odesk_job j
        JOIN odesk_search_job sj ON sj.job_id = j.job_id
        WHERE tsv_basic @@ to_tsquery('frontend & developer & (javascript | angular | backbone | reactjs)')
    ) j ON jf.job_id = j.job_id
    ORDER BY bill_rate
    LIMIT 400
), designer AS (
    SELECT
    DISTINCT ON (f.freelancer_id, bill_rate)
    f.freelancer_id,
    f.bill_rate AS rate,
    'design'::text AS role,
    tzoffset
    FROM candidates f
    JOIN odesk_job_feedback jf ON f.freelancer_id = jf.freelancer_id
    JOIN (
        SELECT
        j.*
        FROM odesk_job j
        JOIN odesk_search_job sj ON sj.job_id = j.job_id
        WHERE tsv_basic @@ to_tsquery('web & designer & html & (css | photoshop | illustrator)')
    ) j ON jf.job_id = j.job_id
    ORDER BY bill_rate
    LIMIT 400
), sysadmin AS (
    SELECT
    DISTINCT ON (f.freelancer_id, bill_rate)
    f.freelancer_id,
    f.bill_rate AS rate,
    'sysadmin'::text AS role,
    tzoffset
    FROM candidates f
    JOIN odesk_job_feedback jf ON f.freelancer_id = jf.freelancer_id
    JOIN (
        SELECT
        j.*
        FROM odesk_job j
        JOIN odesk_search_job sj ON sj.job_id = j.job_id
        WHERE tsv_basic @@ to_tsquery('(linux | centos | redhat | debian | ubuntu) & server & (sysadmin | (system & admin) | (system & administrator) | (system & administration))')
    ) j ON jf.job_id = j.job_id
    ORDER BY bill_rate
    LIMIT 400
)
SELECT
freelancer_id,
rate,
role,
tzoffset
FROM (
    SELECT * FROM frontend
    UNION ALL
    SELECT * FROM backend
    UNION ALL
    SELECT * FROM designer
    UNION ALL
    SELECT * FROM sysadmin
) x;
\o

Finding the optimal team

## run with
##    glpsol --math team-solver-1.mod
##

param budget;
set F, dimen 4;
set SKILLS;
table data IN "CSV" "team-solver-3-exp.csv": F <- [freelancer_id,rate,role,tzoffset];

/* Indices */
set I := setof{(f,r,o,t) in F} f;

/* Assignment */
var a{i in I}, binary;

/*
   Create another set of tuples containing only the freelancer_id and the rate.
   Each freelancer will only appear once (since this is a set), and this is important
   because although he may have multiple roles, his rate is only added once
   to the total cost.
*/ 
set B := setof{(f,r,o,t) in F} (f,r);

/* The set B is used(instead of F) in order to avoid the wrong cost by adding a freelancer's rate twice
   because he has two skills */
minimize cost: sum {i in I, (f,r) in B: i == f} r * a[i];

/* 
   Constraint 1: cover all skills 
   (each skill is covered by at least one freelancer)
*/

s.t. c1 {s in SKILLS}:
	(sum {(f,r,o,t) in F: s == o} a[f]) >= 1;

/* Constraint 2: stay within budget */
s.t. c8 :
	sum{(f,r,o,t) in F, i in I: i == f} r * a[f] <= budget;

solve;

printf "Freelancers picked:\n";

# go over freelancers that have at least one skill active
# and print their skills
for {i in I: a[i] == 1} {
   printf "%s ", i;
   printf "[";
   printf {(f,r,o,t) in F: f == i} " %s", o;
   printf " ]";
   printf "\n";
}
printf "\n";

printf "Optimal cost:\n";
printf "%.2f", cost;
printf "\n";

data;

# project budget
param budget := 100;
set SKILLS := 
	frontend
	backend
	sysadmin
	design;

end;

Web app team (team members have separate roles, the cost is minimized)

This will be a similar situation to the one described above. The difference here, is we'll have the following roles:

  • 2 x backend
  • 2 x frontend
  • 2 x web designer
  • 1 x sysadmin
  • 1 x project manager

Here are the constraints we'll use here:

  • one candidate in the solution covers exactly one role (we won't allow for candidates to cover multiple roles like we did in the previous case)
  • the total cost(the sum of the hourly rates for each member of the team) is capped by the budget
  • we want all the team members to be within ± 2h of UTC+00 so they can attend a daily meeting
  • we want the total cost of the team minimized

Building the candidate pool

\f ','
\a
\pset footer off
\o 'team-solver-1-exp.csv'
WITH candidates AS (
    SELECT *
    FROM odesk_freelancer
    WHERE tzoffset IS NOT NULL AND bill_rate > 0
), backend AS (
    SELECT
    f.freelancer_id,
    f.bill_rate AS rate,
    'backend'::text AS role,
    tzoffset
    FROM candidates f
    JOIN odesk_job_feedback jf ON f.freelancer_id = jf.freelancer_id
    JOIN (
        SELECT
        j.*
        FROM odesk_job j
        JOIN odesk_search_job sj ON sj.job_id = j.job_id
        WHERE tsv_basic @@ to_tsquery('backend & developer & (php | nodejs | python)')
    ) j ON jf.job_id = j.job_id
    ORDER BY bill_rate
    LIMIT 400
), frontend AS (
    SELECT
    f.freelancer_id,
    f.bill_rate AS rate,
    'frontend'::text AS role,
    tzoffset
    FROM candidates f
    JOIN odesk_job_feedback jf ON f.freelancer_id = jf.freelancer_id
    JOIN (
        SELECT
        j.*
        FROM odesk_job j
        JOIN odesk_search_job sj ON sj.job_id = j.job_id
        WHERE tsv_basic @@ to_tsquery('frontend & developer & javascript')
    ) j ON jf.job_id = j.job_id
    ORDER BY bill_rate
    LIMIT 400
), designer AS (
    SELECT
    f.freelancer_id,
    f.bill_rate AS rate,
    'design'::text AS role,
    tzoffset
    FROM candidates f
    JOIN odesk_job_feedback jf ON f.freelancer_id = jf.freelancer_id
    JOIN (
        SELECT
        j.*
        FROM odesk_job j
        JOIN odesk_search_job sj ON sj.job_id = j.job_id
        WHERE tsv_basic @@ to_tsquery('web & designer & css & html')
    ) j ON jf.job_id = j.job_id
    ORDER BY bill_rate
    LIMIT 400
), sysadmin AS (
    SELECT
    f.freelancer_id,
    f.bill_rate AS rate,
    'sysadmin'::text AS role,
    tzoffset
    FROM candidates f
    JOIN odesk_job_feedback jf ON f.freelancer_id = jf.freelancer_id
    JOIN (
        SELECT
        j.*
        FROM odesk_job j
        JOIN odesk_search_job sj ON sj.job_id = j.job_id
        WHERE tsv_basic @@ to_tsquery('(linux | centos | redhat | debian | ubuntu) & server & (sysadmin | (system & admin) | (system & administrator) | (system & administration))')
    ) j ON jf.job_id = j.job_id
    ORDER BY bill_rate
    LIMIT 400
), project_manager AS (
    SELECT
    f.freelancer_id,
    f.bill_rate AS rate,
    'project_manager'::text AS role,
    tzoffset
    FROM candidates f
    JOIN odesk_job_feedback jf ON f.freelancer_id = jf.freelancer_id
    JOIN (
        SELECT
        j.*
        FROM odesk_job j
        JOIN odesk_search_job sj ON sj.job_id = j.job_id
        WHERE tsv_basic @@ to_tsquery('(manager | management) & team & scrum')
    ) j ON jf.job_id = j.job_id
    ORDER BY bill_rate
    LIMIT 400
)
SELECT
-- row_number() over () AS freelancer_id,
freelancer_id,
rate,
role,
tzoffset
FROM (
    SELECT * FROM frontend
    UNION
    SELECT * FROM backend
    UNION
    SELECT * FROM designer
    UNION
    SELECT * FROM sysadmin
    UNION
    SELECT * FROM project_manager
) x;

Finding the optimal team

## run with
##    glpsol --math team-solver-1.mod
##
## use-case: tyical web development team

param budget;
param tz_max;
param tz_ref;
set F, dimen 4;
table data IN "CSV" "team-solver-1-exp.csv": F <- [freelancer_id,rate,role,tzoffset];


# indices
set I := setof{(f,r,o,t) in F} f;

# assignment
var a{I}, binary;



minimize cost:
	sum{(f,r,o,t) in F} r * a[f];



## two backend
s.t. c1 :
	sum{(f,r,'backend',t) in F} a[f] = 2;

## two frontend
s.t. c2 :
	sum{(f,r,'frontend',t) in F} a[f] = 2;

## two designers
s.t. c3 :
	sum{(f,r,'design',t) in F} a[f] = 2;

## one sysadmin
s.t. c4 :
	sum{(f,r,'sysadmin',t) in F} a[f] = 1;

## one project manager
s.t. c5 :
	sum{(f,r,'project_manager',t) in F} a[f] = 1;

## timezone constraints
s.t. c6 {(f,r,o,t) in F}: (t - tz_ref) * a[f] <= tz_max;
s.t. c7 {(f,r,o,t) in F}: (tz_ref - t) * a[f] <= tz_max;

## stay within budget
s.t. c8 :
	sum{(f,r,o,t) in F} r * a[f] <= budget;


solve;

printf "Freelancers picked:\n";
printf {(f,r,o,t) in F: a[f] == 1} " %s", f;
# printf {(f,r,o,t) in F: a[f] == 1} " %i", f;
printf "\n";

printf "Optimal cost:\n";
printf "%.2f", cost;
printf "\n";

data;

# project budget
param budget := 100;
param tz_max := 7200;
param tz_ref := 0;

end;

Sysadmin team (covering a 24h cycle)

This is a similar situation, except now we want to select sysadmins to take care of the site reliability/operations area of a web product owned by a business.

If the product has a global userbase then it needs support all throughout the 24 hour day.

This is usually modeled by allocating team members to work in shifts to cover parts of the 24 hour business day, but we won't do that.

Since this is a distributed team, the actual advantage here is people can work on their local 09:00-17:00 but at the same time they can still cover the business need of offering support for the product all across the 24h cycle.

So we split the 24h day into 3 intervals relative to UTC:

  • 00:00-08:00
  • 08:00-16:00
  • 16:00-24:00

After this, we'll find 2 sysadmins whose local 09:00 is 00:00 UTC, 2 sysadmins whose local 09:00 is 08:00 UTC and 2 sysadmins whose local 09:00 is 16:00 UTC 2. We will keep this constraint as well as minimizing the total cost of the team.

Building the candidate pool

\f ','
\a
\pset footer off
\o 'team-solver-2-exp.csv'
WITH candidates AS (
    SELECT *
    FROM odesk_freelancer
    WHERE tzoffset IS NOT NULL AND bill_rate > 0
), sysadmin AS (
    SELECT
    DISTINCT ON (f.freelancer_id, bill_rate)
    f.freelancer_id,
    f.bill_rate AS rate,
    'sysadmin'::text AS role,
    -- their local 09:00 in UTC as seconds elapsed since 00:00 UTC
    -- (the start of their workday)
    (((9 * 3600) - tzoffset + (24 * 3600)) % (24 * 3600)) AS utc_start_hour
    FROM candidates f
    JOIN odesk_job_feedback jf ON f.freelancer_id = jf.freelancer_id
    JOIN (
        SELECT
        j.*
        FROM odesk_job j
        JOIN odesk_search_job sj ON sj.job_id = j.job_id
        WHERE tsv_basic @@ to_tsquery('(linux | centos | redhat | debian | ubuntu) & server & (sysadmin | (system & admin) | (system & administrator) | (system & administration))')
    ) j ON jf.job_id = j.job_id
    ORDER BY bill_rate, freelancer_id
    LIMIT 400
)
SELECT
freelancer_id,
rate,
role,
utc_start_hour AS start_hour
FROM sysadmin;

Finding the optimal team

## Run with
##    glpsol --math team-solver-1.mod
##
## use-case: sysadmin covering the 24hour business day
##
## timezone offsets to cover
## intervals
## - 00:00-08:00
## - 08:00-16:00
## - 16:00-24:00
##
## (the respective timezine offsets are 0, 28800,57600 for the start
##  of each interval)
##
## These are intervals in GMT+0 so they will have to be adapted to the
## freelancer's schedule (which would like to work 09:00-17:00 on his
## local timezone)
##
## The solution will be comprised of 3 groups, each covering a third 
## of the 24-hour cycle.
##
## Exactly two sysadmins will be in each group.
##

param budget;
param tz_delta;
set OFFSETS;
set F, dimen 4;
table data IN "CSV" "team-solver-2-exp.csv": F <- [freelancer_id,rate,role,start_hour];

/* freelancer indices */
set I := setof{(f,r,o,t) in F} f;

/* Assignment */
var a{i in I, t in OFFSETS}, binary;

minimize cost:
	sum{(f,r,o,t) in F, s in OFFSETS} r * a[f,s];


/* Constraint 1: each group needs exactly 2 sysadmins */
s.t. c1 {t in OFFSETS}:
	sum {i in I} a[i,t] = 2;

/* Constraint 2: each sysadmin is assigned to at most one of the 3 groups */
s.t. c2 {i in I}:
	sum {s in OFFSETS} a[i,s] <= 1;

/*
   Constraint 3: a sysadmin is a viable candidate for an interval
   if his timezone matches that interval.

   We want to ensure that |t-s| < tz_delta
*/

# t - s < tz_delta
s.t. c3 {(f,r,o,t) in F, s in OFFSETS}: 
	 a[f,s] * (s - t) <= tz_delta;

# s - t < tz_delta
s.t. c4 {(f,r,o,t) in F, s in OFFSETS}: 
	 a[f,s] * (t - s) <= tz_delta;

solve;

printf "Sysadmins picked:\n";
printf {(f,r,o,t) in F, s in OFFSETS: a[f,s] == 1} "%i => %s \n", s, f;
printf "\n";


printf "Optimal cost:\n";
printf "%.2f", cost;
printf "\n";


data;
param budget := 100; # project budget
set OFFSETS  := 0 28800 57600;
param tz_delta := 3600;

end;

Conclusion

The same type of reasoning can be adapted to build other types of teams too:

  • game development teams
  • global call-center/customer support teams
  • teams developing mobile apps

Since the profiles are acquired from an online job market, we have several other features for each profile:

  • the candidate's rating by previous clients he's worked with
  • the candidate's total work hours, and work hours spent on certain types of projects
  • the location (city/country) of the candidate
  • the candidate's spoken language
  • the candidate's availability (20-30h/week, 40h/week)

We could've also made constraints using these features and fine-tune our search in various other ways.

While searching for the best team, the fullstack and devops roles arise naturally because of certain constraints(candidates with multiple skills can be selected, they fill multiple constraints and contribute to minimizing the total cost). We've looked at some ways of building distributed teams automatically using candidate profiles from a database.

Footnotes:

1

There's an interesting paper I came across on arxiv that describes a hiring game and relates it to the set cover problem In that same paper, a greedy set-cover approximation algorithm is presented.

2

Usually teams that work in shifts would also need a supervisor, and they also need overlap to ensure issues are communicated and synchronized with the next shift. These can be modeled as well. The timezone overlap is captured very well in this small app.