M. Rincón Guided by asteriks

Working with biggish data files

2023-07-28

Last week I worked with over 10,000 files of tabular data, each with about 50,000 rows and 10 columns separated by a '|'. This is the kind of problem that falls in the uncanny valley between small and biggish data. My goal was to make some quick checks and, if possible, concatenate the files into a single csv file I could load into Python.

The first step was to make sure that the number and order of the columns was the same for all files:

head -q -n 1 *.dat | sort | uniq

While the revealed that not all headers were the same, the differences were insignificant. For example, some files used "NET_CHANGE" while other used "NET CHANGE". One file seemed to quote the first line. Other files were terminated with carriage return and new line pairs. But worryingly, some files seemed to start with an empty line. I didn't know if these files were empty, or if they have a different format. I made a list of the files, and opened a sample:

head -n 1 -v *.dat | grep -B1 -E $'^\r$'

It turned out that these files had an extra empty line terminated by a carriage return on the first line. Lucky. I removed the carriage returns and the empty lines and all the files:

sed -i $'s/\r//' *.dat && sed -i '/^[[:blank:]]*$/ d' *.dat

Now all the headers appeared to be about the same, and I shouldn't have any empty lines. Still, I took a second look:

grep -E $'^$' *.dat

At this point I noticed that some file sizes were only big enough to hold a header line. Looking at their tails proved my suspicion correct. I deleted them using vim:

:r !du -ckhs * | grep "^4\.0K"
:%s/^4.*\t/rm
:w !sh

At this point I went ahead and concatenated the files. Since I only wanted to preserve one header, I used awk.

head -n1 2023_04_27.dat > all.txt
awk FNR!=1 *.dat >> all.txt
du -h all.txt
wc -l all.txt

The result was one file with a little under six hundred million lines. Because I wanted a csv file, I removed any existing commas, and changed the separator from '|' to ',':

sed -i 's/,//g' all.txt && sed -i 's/|/,/g' all.txt

I also checked that the resulting file wasn't jagged, and it wasn't:

cat input.csv | grep -v "^#" | awk "{print NF}" FS=, | uniq

At this point I could have fired Dask, but I thought I could do better and reduce the file to something Polars could handle. After taking a look at the original files, I suspected some redundancy. For example, one column showed the three character product code, another had the numerical code, and another a product description. This suggested I could use one of the codes as a key I could use to map to a second file with the numerical code and description. Using shuf I made a sample, split it, and recovered the original file using join and diff. Nice. The file could be divided, and doing that reduced the size of the data be over 60% without loosing any information.

cut -d',' -f1,4-7,8 all.txt > px.csv
cut -d',' -f2-3,5-6,11 all.txt | sort | uniq > description.csv

Now the final file was small enough for me to inspect locally with polars.