Introduction
In this project, we analyze a dataset containing information about layoffs across various companies. Using PostgreSQL we will explore a dataset “layoffs” which provides insights into the number of layoffs in different companies in 2022 and 2023.
What We Will Be Doing
The process will entail:
Setting up the Environment
Loading the Data
Data Cleaning
Data Exploration
Prerequisites
PostgreSQL: Ensure you have PostgreSQL installed on your machine. This will be the primary database management system used for this project.
pgAdmin: A graphical user interface for managing PostgreSQL databases, which will simplify database interactions and data visualization.
Basic SQL Knowledge: Familiarity with SQL concepts such as querying, data manipulation, and understanding of table relationships is recommended.
CSV File: Download the "layoffs.csv" file, which contains the dataset we will analyze.
Setting up the Environment
Okay, let’s go! First thing first Open and Launch pgAdmin and connect to the PostgreSQL Server. Right-click on the “Databases” node on the left sidebar and “Create” a new “Database”. I will name mine layoffs_db and then “Save”.
For organizational purposes let’s create a new schema named layoffs_analysis by left-clicking on database, choosing query tool and running the command:
CREATE SCHEMA layoffs_analysis;
Let’s initialize a new table in the layoffs_analysis Schema and check whether it was created.
CREATE TABLE Layoffs (
company VARCHAR(255),
location VARCHAR(255),
industry VARCHAR(100),
total_laid_off VARCHAR,
percentage_laid_off VARCHAR,
date VARCHAR,
stage VARCHAR(50),
country VARCHAR(100),
funds_raised_millions VARCHAR
);
SELECT * FROM Layoffs;
Uploading the Data
Let’s now load the downloaded csv file into the newly created Layoffs Table. I found that the easiest option is to use the Windows CMD to upload the data. Launch the CMD and run the following command:
psql -U your_username -d your_database
After which you will be prompted to enter your password
In my case it will be as follows where postgres is my_username and layoffs_db is my database:
psql -U postgres -d layoffs_db
Following this the psql shell will open where you can now copy the data from the path folder of its location to the Layoffs table we created.
\copy layoffs_analysis.layoffs FROM 'C:\Users\Ken\Downloads\cpy_layoffs.csv' DELIMITER ',' CSV HEADER NULL 'NULL';
Here is a screenshot of the process for me:
Let’s check if the table has been populated
SELECT * FROM layoffs_analysis.Layoffs;
The process is a success and i have a table with 2361 rows of data!
Data Cleaning
We are now going to clean the data, this is a multi-part process that will generally involve the following steps:
Removing duplicates
Standardizing the data
Deal with null values or blank values
Remove unnecessary columns
For the purposes of safe experimentation lets duplicate our layoffs table and create a new one layoffs_staging
-- Lets create a duplicate table
CREATE TABLE layoffs_analysis.layoffs_staging
(LIKE layoffs_analysis.layoffs INCLUDING ALL);
INSERT INTO layoffs_analysis.layoffs_staging
SELECT *
FROM layoffs_analysis.layoffs;
-- Check
SELECT * FROM layoffs_analysis.layoffs_staging;
I don’t know about you but am getting tired having to type up the schema name everytime i run a query let’s take care of that by setting up the search path and confirm everything is running okay.
SET search_path TO layoffs_analysis;
SELECT * FROM layoffs_staging;
Removing Duplicates
Given that we don’t have a unique ID removing the duplicates is going to be challenging, to solve this we are going to use row_number() over with partition by and group the data by all columns creating a new column row_num. The logic is that since we are grouping by all columns all the data will have a rank of 1. If it is more than 1 then it can be considered a duplicate.
Let’s just save the above query as a CTE and query the results where row_num > 1 which represent the duplicates
Given that we cannot directly update or delete from a CTE, in postgresql atleast the best option we have is to create a new table and delete the duplicate values from there. In the following commands I create a new table and copy the values from the duplicate_cte into it.
-- New Table
CREATE TABLE layoffs_staging2 (
company VARCHAR(255),
location VARCHAR(255),
industry VARCHAR(255),
total_laid_off INTEGER,
percentage_laid_off DECIMAL(5,2),
date DATE,
stage VARCHAR(50),
country VARCHAR(255),
funds_raised_millions DECIMAL(10,2),
row_num INT
);
-- Check the new table
SELECT * FROM layoffs_staging2;
Let’s insert the data, as per our criteria, for my case I was forced to recast some of the types as it was resulting in type errors
INSERT INTO layoffs_staging2
SELECT
company, location, industry,
total_laid_off::INTEGER,
percentage_laid_off::NUMERIC,
date::DATE, stage, country,
funds_raised_millions::NUMERIC,
ROW_NUMBER() OVER (
PARTITION BY
company, location, industry,
total_laid_off, percentage_laid_off, 'date',
stage, country, funds_raised_millions) AS row_num
FROM layoffs_staging;
If we now run SELECT \ FROM layoffs_staging2* one should the same table s before with an added column row_num.
Now that we have our table deleting the duplicate values is as simple as follows:
DELETE
FROM layoffs_staging2
WHERE row_num > 1;
Standardizing the Data
This entails finding issues in the database and fixing them. Looking at the company column, it looks better with the spaces removed so lets do that.
SELECT company, TRIM(company)
FROM layoffs_staging2;
UPDATE layoffs_staging2
SET company = TRIM(company);
Let’s look at the industry, there are many industries, and order by 1 (the First Column)
-- Lets Change things like the Cryptocurrency example
SELECT *
FROM layoffs_staging2
WHERE industry LIKE 'Crypto%';
UPDATE layoffs_staging2
SET industry='Crypto'
WHERE industry LIKE 'Crypto%';
-- Check
SELECT DISTINCT(industry)
FROM layoffs_staging2;
Lets look at location and country, there is a classic issue of United States with a dot and one without lets remove that
The issue can be solved with either of these two queries
-- Solution 1
UPDATE layoffs_staging2
SET country='United States'
WHERE country LIKE 'United States%';
-- Solution 2
SELECT DISTINCT country,
TRIM(TRAILING '.' FROM country)
FROM layoffs_staging2
ORDER BY 1;
-- Check
SELECT DISTINCT(country)
FROM layoffs_staging2
WHERE country LIKE 'United States%';
Lets change the Date to a Date type and also cast the Date column to a Date Type
SELECT date,
TO_DATE(date, 'MM/DD/YY')
FROM layoffs_staging2;
UPDATE layoffs_staging2
SET date = TO_DATE(date, 'MM/DD/YYYY')
-- Change the data type of date column to Date
-- Also attept to cast the date to the DATE type
ALTER TABLE layoffs_staging2
ALTER COLUMN date TYPE DATE USING date::DATE;
Looking at Null and Blank Values
SELECT *
FROM layoffs_staging2
WHERE total_laid_off IS NULL
AND percentage_laid_off IS NULL;
Such values can be removed, we will address that later.
Lets look at industry column
In some instances we can substitute data, in the example below row 1 has Travel as Industry, since it is also Airbnb we can also substitute Travel as the industry
Lets figure out how to do that. First let’s find all instances in the table where the companies are the same, one has industry and the other doesn’t. We will do this using a self-join. Self-joins are useful for querying hierarchical structures and for comparing the relationship between rows such as in this case.
Let’s make it official
UPDATE layoffs_staging2 t1
SET industry = t2.industry
FROM layoffs_staging2 t2
WHERE t1.company = t2.company
AND (t1.industry IS NULL OR t1.industry ='')
AND t2.industry IS NOT NULL;
-- Check if it worked
SELECT *
FROM layoffs_staging2 t1
JOIN layoffs_staging2 t2
ON t1.company = t2.company
WHERE (t1.industry IS NULL OR t1.industry = '')
AND t2.industry IS NOT NULL;
When checking some entries are still empty for industry. The reason for this could be because they are blanks and not nulls. So let’s attempt to make the blanks nulls and try to update again.
-- Change the blanks to nulls
UPDATE layoffs_staging2
SET industry = NULL
WHERE industry='';
-- Update the industries with similar titles like before
UPDATE layoffs_staging2 t1
SET industry = t2.industry
FROM layoffs_staging2 t2
WHERE t1.company = t2.company
AND (t1.industry IS NULL OR t1.industry ='')
AND t2.industry IS NOT NULL;
-- Check
SELECT *
FROM layoffs_staging2
WHERE industry IS NULL
OR industry ='';
Checking, all is good except one row for the company-Bally’s. The reason for this is because there is no other row to populate Bally’s value for industry. That’s why its null.
Given that we don’t have any value(total employees before layoffs) we could use to populate the columns total_laid_off and percentage_laid_off we can just leave it at that with regard to dealing with null values.
Removing Unnecessary Columns
Given that these rows with missing values for total_laid_off AND percentage_laid_off can’t be derived, it is best if we delete them from the table entirely.
DELETE
FROM layoffs_staging2
WHERE total_laid_off IS NULL
AND percentage_laid_off IS NULL;
Now we are good.
Now let’s drop one final column row_num that we had created earlier and we will be done for the first part.
-- Drop column row_num
ALTER TABLE layoffs_staging2
DROP COLUMN row_num;
For the next part we are going to conduct exploratory analysis on this data. We are going to find trends, patterns and conduct some complex queries. It’s going to be fun.
For recap we:
Removed duplicates
Standardized the data
Dealt will null and blank values
Remove unnecessary Columns or Rows