You are writing a Spark DataFrame to a CSV file with header line on HDFS.

df.write.csv('output_folder', header=True)

Because your DataFrame is partitioned, you get multiple CSV files in your output folder. Each file will get a header with column names.

So far so good, but now you want to dump that to a single local CSV file. The straightforward way is using HDFS's getmerge, which will concatenate all the files to a local file:

hadoop fs -getmerge output_folder result.csv

Unfortunately, because getmerge just concatenates the HDFS files, the CSV header will be repeated in various places in the output file. This will cause quite some headaches when loading the data in tools like for example pandas.

A typical solution is making sure that the DataFrame has only a single partition before writing to HDFS, so that there is only a single output file:

df.coalesce(1).write.csv('output_folder', header=True)

But sometimes you can not or don't want to repartition the DataFrame before writing to HDFS (e.g. to avoid memory issues). For these cases you can use the following little awk snippet to get rid of duplicated header lines in your CSV file:

awk 'BEGIN{f=""}{if($0!=f){print $0}if(NR==1){f=$0}}' result.csv > postprocessed.csv