Independently Learning Data Engineering

Jun 15, 2018

About a year ago I came across this article by the brilliant Maxime Beauchemin (who has slowly become one of my favorite speakers at coding talks) describing the journey of a brand new career in the software developer field - the Data Engineer. At the time it seemed like an interesting venture, but I hadn't put much thought into it. Now that I am closer to graduation, I've realized this is exactly what I want to do (despite having read his follow up) but I don't remember taking a course on "Data Engineering". Sure, I imagined the job was a combination of Big Data and Database Management courses, but I figured I would take some time to see what people with the job recommended.

As with most posts like this, first thing we want to do is to define what Data Engineering is. This O'Reilly article defines a Data Engineer as someone who has specialized their skills in creating software solutions around big data, and uses something like 10-30 different solutions (a Jesse Anderson staple!) to develop data pipelines. Robert Chang also has an incredible and in-depth three part beginner's guide to data engineering that goes over the details of what is needed to prepare for the job. Finally, Nate Kupp wrote a short piece on what setting up a data infrastructure looks like. These handful of articles have lead me to understand that a data engineer's job is to efficiently and effectively funnel data from several databases into one larger database for Data and Business analysts to gleam information from. Now that I've got a strong understanding of what the job is, it's time to figure out the skills I need to get the job. To Indeed!

Instinctively, first thing I did after typing "Data Engineer" into Indeed was grab a pen and paper and start writing down the skills needed for the first couple job listings. By the time I made it to the third listing I realized what a mistake I was making. Why manually search for skills when I could automate it? I first began looking for an API to give me JSON data on job data. Indeed has a closed API, so I had to create an account. After thirty minutes of waiting for them to "approve" my account I remembered there's more than one way to skin a cat. I decided a quick Beautiful Soup powered script would help.

import requests
from bs4 import BeautifulSoup

job_urls = []
page_multiplier = 10

for pages in range(4):
    indeed_search = 'https://www.indeed.com/jobs?q=%22Data+Engineer%22&start=' + 
    str(pages * page_multiplier)

    response = requests.get(indeed_search)
    soup = BeautifulSoup(response.content, "html.parser")
    job_titles = soup.findAll('a', class_="turnstileLink")

    for job in job_titles:
        job_urls.append(job['href'])


print("Url list has", len(job_urls), "urls")

with open("indeed_data.txt", "a") as f:
    for job_page in job_urls:
        try:
            job_url = 'https://www.indeed.com' + job_page
            response = requests.get(job_url)
            soup = BeautifulSoup(response.content, "html.parser")
            output = soup.find(class_="summary").text

            f.write(output)

        # captures ill-formed job page listings (usually ad related)
        except AttributeError:
            print("Something wrong with url:", job_page)


print("Completed the scrape!")    
        
This little script scraped the first 5 pages of an Indeed job search and puts all of the text of the job summaries into a single text file for parsing.

The parsing is where the information began to be less useful. I adapted an algorithm from one of my data pipeline MOOC projects to find the most commonly used words in the job listings.
from stop_words import stop_words
import string

word_dict = {}
with open("indeed_data_python.txt", "r") as f:
    for line in f:
        line = line.strip("\n")
        line = ''.join(c for c in line if c not in string.punctuation)
        line = line.lower()

        word_list = line.split(" ")

        for word in word_list:
            if word and word not in stop_words:
                if word not in word_dict:
                    word_dict[word] = 1

                word_dict[word] += 1


top_100_list = [ (word, word_dict[word]) for word in sorted(word_dict, key=word_dict.get , reverse=True) ]
print(top_100_list[:100])
The first issue I ran into was the words coming up as the most used. Words like 'data', 'experience', and 'quality' were showing up as the most commonly used words and are completely useless. Time to dial down my scope a bit. Instead of using a "stop words" approach where it includes everything but any given stop words, it only responds with instances of words that match a given "start word". I created a list of words built from MasterInDataScience.org's Data Engineering skills segment and ran the second script against them again. This time the output was much more useful.

Top Data Engineer Skills (70 unique jobs)

  • Analysis
  • SQL
  • Python
  • Warehouse
  • Hadoop/Spark

Analysis? Vague, but I can somewhat work with that. I've spent the last year or so playing with SQL and Python so that's in the hat as well. I'll have to some more reading on what a "data warehouse" is (It's just another word for a Database, right?). Only technology I didn't have experience with was either Hadoop or Spark and I've considered them the 'final boss' of Data Engineering for a while. I'll have to work on a couple projects on my own to get familiar with them. Time to run it again with the "Python Data Engineer" query.

Top Python Data Engineer Skills (64 unique jobs)

  • Python (Obviously)
  • SQL
  • Database/Warehouse
  • Java
  • Linux

Most of these make sense. Java is probably the most surprising to me. Either way, for my ideal job as a Python Data Engineer, it looks like I'm pretty well equipped. Time to work on these skills!

After going back and looking at the script I realized there were a couple improvements I could make to improve if I wanted to use the scraper any more. I could have the initial scraper check for url duplicates as well as use the nltk library to search for tech related word phrases instead of individual words in a dictionary.

Code for this blog post
Referenced URLs:
  1. https://medium.freecodecamp.org/the-rise-of-the-data-engineer-91be18f1e603
  2. https://medium.com/@maximebeauchemin/the-downfall-of-the-data-engineer-5bfb701e5d6b
  3. https://www.oreilly.com/ideas/data-engineers-vs-data-scientists
  4. https://medium.com/@rchang/a-beginners-guide-to-data-engineering-part-i-4227c5c457d7
  5. https://medium.com/@natekupp/getting-started-the-3-stages-of-data-infrastructure-556dac82e825
  6. https://www.crummy.com/software/BeautifulSoup/?
  7. http://www.mastersindatascience.org/careers/data-engineer/
  8. https://www.nltk.org/