Imagine it's Friday afternoon. Almost everybody left for the weekend, you want to do the same. But the phone rings: there is this script, which creates a file, which needs to be imported, but the import fails. You get a file, which can be imported successfully, now it's up to you to find out why the script does not do what it's supposed to do.
It's a long script you are not very familiar with, you go to the lines that write the file, look at the columns, everything seems reasonable. You notice at some point some duplicated columns, you shortly wonder what's the point of it, but hey, there must be a reason you are just not aware of... Half an hour later still no progress, so you start comparing the files. And suddenly you notice that your file has 28 columns while the comparison file has 30 columns. What? You count again your columns. 30.
But in the output file there are only 28. Where do the missing 2 go?
Remember the duplicated columns? This is what happens. Let's take the mtcars dataset.
`%>%` <- magrittr::`%>%`
mtcars %>%
dplyr::glimpse()
# Rows: 32
# Columns: 11
# $ mpg <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8, 16.4, 17.3, 15.2, 10.4, 10…
# $ cyl <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8, 8, 8, 8, 4, 4, 4, 8, 6, 8,…
# $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, 167.6, 167.6, 275.8, 275.8, 2…
# $ hp <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 180, 205, 215, 230, 66, 52, 65…
# $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92, 3.07, 3.07, 3.07, 2.93, 3.…
# $ wt <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, 3.440, 3.440, 4.070, 3.730, 3…
# $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, 18.30, 18.90, 17.40, 17.60, 1…
# $ vs <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0,…
# $ am <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1,…
# $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3, 3, 4, 5, 5, 5, 5, 5,…
# $ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2, 2, 4, 2, 1, 2, 2, 4, 6, 8,…
Ok, you knew that. Let's select some columns:
mtcars %>%
dplyr::select(
mpg,
cyl
) %>%
head(3)
# mpg cyl
# Mazda RX4 21.0 6
# Mazda RX4 Wag 21.0 6
# Datsun 710 22.8 4
Nothing spectacular here. You can even rename your columns if you want:
mtcars %>%
dplyr::select(
mpg,
some_cyl = cyl
) %>%
head(3)
# mpg some_cyl
# Mazda RX4 21.0 6
# Mazda RX4 Wag 21.0 6
# Datsun 710 22.8 4
Again, nothing spectacular. Let's assume I want the same column twice:
mtcars %>%
dplyr::select(
mpg,
mpg
) %>%
head(3)
# mpg
# Mazda RX4 21.0
# Mazda RX4 Wag 21.0
# Datsun 710 22.8
See what's happening? No error, no warning, just that the column appears only once.
My thought: ok, let's give it another name:
mtcars %>%
dplyr::select(
mpg,
some_mpg = mpg
) %>%
head(3)
# some_mpg
# Mazda RX4 21.0
# Mazda RX4 Wag 21.0
# Datsun 710 22.8
Nope, does not work. I actually have to create a column with the duplicated content:
mtcars %>%
dplyr::mutate(some_mpg = mpg) %>%
dplyr::select(
mpg,
some_mpg
) %>%
head(3)
# mpg some_mpg
# Mazda RX4 21.0 21.0
# Mazda RX4 Wag 21.0 21.0
# Datsun 710 22.8 22.8
Did you know that? I didn't. Otherwise I could have left earlier for the weekend...
It may be that this is a corner case. Still, I would have greatly appreciated if I could at least get warned in such cases...
Of course, this debugging journey would also have been shorter if the importing tool would have given clear error message, stating exactly what columns are missing. But that tool was not in my hands, not something I was able at that point to change.
You know the saying, we all do our best? I know I am (at least most of the time - I naively imagine...). What I learned meanwhile: this is not enough. It's not enough to give your best. Not even if you do it 100% of your time. It's about what your best is. You can spend much more less time, but still do it more effectively. Not convinced? Check this TED talk presented by Tim Ferris.
Make a promise. Show up. Do the work. Repeat.