About a year ago I came across this article by the brilliant Maxime Beauchemin (who has slowly become one of my favorite speakers at coding talks) describing the journey of a brand new career in the software developer field - the Data Engineer. At the time it seemed like an interesting venture, but I hadn't put much thought into it. Now that I am closer to graduation, I've realized this is exactly what I want to do (despite having read his follow up) but I don't remember taking a course on "Data Engineering". Sure, I imagined the job was a combination of Big Data and Database Management courses, but I figured I would take some time to see what people with the job recommended.
As with most posts like this, first thing we want to do is to define what Data Engineering is. This O'Reilly article defines a Data Engineer as someone who has specialized their skills in creating software solutions around big data, and uses something like 10-30 different solutions (a Jesse Anderson staple!) to develop data pipelines. Robert Chang also has an incredible and in-depth three part beginner's guide to data engineering that goes over the details of what is needed to prepare for the job. Finally, Nate Kupp wrote a short piece on what setting up a data infrastructure looks like. These handful of articles have lead me to understand that a data engineer's job is to efficiently and effectively funnel data from several databases into one larger database for Data and Business analysts to gleam information from. Now that I've got a strong understanding of what the job is, it's time to figure out the skills I need to get the job. To Indeed!
Instinctively, first thing I did after typing "Data Engineer" into Indeed was grab a pen and paper and start writing down the skills needed for the first couple job listings. By the time I made it to the third listing I realized what a mistake I was making. Why manually search for skills when I could automate it? I first began looking for an API to give me JSON data on job data. Indeed has a closed API, so I had to create an account. After thirty minutes of waiting for them to "approve" my account I remembered there's more than one way to skin a cat. I decided a quick Beautiful Soup powered script would help.
import requests
from bs4 import BeautifulSoup
job_urls = []
page_multiplier = 10
for pages in range(4):
indeed_search = 'https://www.indeed.com/jobs?q=%22Data+Engineer%22&start=' +
str(pages * page_multiplier)
response = requests.get(indeed_search)
soup = BeautifulSoup(response.content, "html.parser")
job_titles = soup.findAll('a', class_="turnstileLink")
for job in job_titles:
job_urls.append(job['href'])
print("Url list has", len(job_urls), "urls")
with open("indeed_data.txt", "a") as f:
for job_page in job_urls:
try:
job_url = 'https://www.indeed.com' + job_page
response = requests.get(job_url)
soup = BeautifulSoup(response.content, "html.parser")
output = soup.find(class_="summary").text
f.write(output)
# captures ill-formed job page listings (usually ad related)
except AttributeError:
print("Something wrong with url:", job_page)
print("Completed the scrape!")
This little script scraped the first 5 pages of an Indeed job search and puts all of the text of the job summaries into a single text file for parsing.
from stop_words import stop_words
import string
word_dict = {}
with open("indeed_data_python.txt", "r") as f:
for line in f:
line = line.strip("\n")
line = ''.join(c for c in line if c not in string.punctuation)
line = line.lower()
word_list = line.split(" ")
for word in word_list:
if word and word not in stop_words:
if word not in word_dict:
word_dict[word] = 1
word_dict[word] += 1
top_100_list = [ (word, word_dict[word]) for word in sorted(word_dict, key=word_dict.get , reverse=True) ]
print(top_100_list[:100])
The first issue I ran into was the words coming up as the most used. Words like 'data', 'experience', and 'quality' were showing up as the most commonly used words and are completely useless. Time to dial down my scope a bit. Instead of using a "stop words" approach where it includes everything but any given stop words, it only responds with instances of words that match a given "start word". I created a list of words built from MasterInDataScience.org's Data Engineering skills segment and ran the second script against them again. This time the output was much more useful.
Top Data Engineer Skills (70 unique jobs)
- Analysis
- SQL
- Python
- Warehouse
- Hadoop/Spark
Top Python Data Engineer Skills (64 unique jobs)
- Python (Obviously)
- SQL
- Database/Warehouse
- Java
- Linux