AWK commands equivalent to SQL query – Data manipulation

4. SQL JOINS in AWK

Sometime we have normalised files from source that we need to denormalized to get some report. If you have the same requirement to join two different files then this page is for you.

INNER JOIN –

Consider we have to JOIN EMP table and DEPT table to get below information:
Employee Number, Employee Name, Dept No, Dept Name

SELECT
      E.EMPNO,
      E.ENAME,
      E.DEPTNO,
      D.DNAME
FROM EMP E
INNER JOIN DEPT D
ON E.DEPTNO = D.DEPTNO;
$ awk -F, 'NR==FNR{key[$1]=$2;} $8 in key {print $1,$2,$8,key[$8]}' DEPT.csv EMP.csv | column -nt

Here we are passing two files in awk. You need to understand important part of this command is NR==FNR. NR is the total row number and FNR is the current file row number. Order of file (DEPT.csv EMP.csv) is also important as it controls NR/FNR.
So awk will process DEPT.csv first. For this file, NR is equal to FNR for all records. Next file EMP.csv, NR will increase but FNR will be reset to 1 at record number 1. As our driving file is EMP, I kept it at last.
So till NR==FNR, we are just creating an associative array with key field(DEPTNO – $1) and value = DNAME in it.
The moment reading from EMP.csv started, FNR is 1. Hence NR not equal to FNR. then $8 from EMP file will be compared to associative array created in previous step. If there is a match, print desired columns. Voila !! You are done with the INNER JOIN.
Let me know if you need more explanation on this.

LEFT OUTER JOIN –

In awk, LEFT OUTER join is not much different from INNER JOIN. We just need to change the condition “$8 in key” to “NR!=FNR”:

$ awk -F, 'NR==FNR{key[$1]=$2;} NR!=FNR{print $1,$2,$8,key[$8]}' DEPT_LJ.csv EMP.csv | column -nt

So we are printing all records from EMP file and printing along with DEPT name. I have created one DEPT_LJ.csv file with DEPTNO 20 deleted. In the left outer join we have DNAME as blank for DEPTNO = 20.

This post is only related to SQL operation in AWK program. I will post a separate tutorial on AWK for all other beautyness of awk and use cases of the same. Let me know in case you need any more explanation on this topic. Thank you for your time. Stay tuned..

Regards,
Subhadip

6 thoughts on “AWK commands equivalent to SQL query – Data manipulation”

  1. The problem is that awk is NOT a CSV parser.
    I.e. with the , separator, if we have a field value of “SURNAME,NAME” in this case the AWK threats it as two different fields.

    Use goawk instead.

    1. Yes. Good point. If the delimiter is part of the column value itself, then awk will fail to operate. We always encourage to use character like pipe or something else as delimiter which doesn’t present in the data. GoAWK is good solution for this. But yet to available as part of distributions.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top