File extension histogram - Stefaan Lippens inserts content here

To get a quick overview of a git repo, I wanted to know what kind of files it contained, for example through a simple histogram of file extensions.

I figured this would be a nice test for ChatGPT:

Can you write me a bash script to walk the current directory recursively and build a histogram of the file extensions?

and it returned me this bash script:

#!/bin/bash

# Initialize an associative array to store the histogram
declare -A histogram

# Function to traverse directories recursively
walk_directory() {
    local dir="$1"
    local file_extension

    # Loop through all files and directories in the current directory
    for entry in "$dir"/*; do
        if [ -f "$entry" ]; then
            # Get the file extension of the current file
            file_extension="${entry##*.}"

            # Increment the count for the file extension in the histogram
            ((histogram[$file_extension]++))
        elif [ -d "$entry" ]; then
            # If it's a directory, recursively call the function on it
            walk_directory "$entry"
        fi
    done
}

# Call the function with the current directory as an argument
walk_directory "."

# Print the histogram
echo "Histogram of File Extensions:"
for ext in "${!histogram[@]}"; do
    echo "$ext: ${histogram[$ext]}"
done

It kind of works, but it doesn't properly handle files without an extension or folders with an extension. And the size of the script doesn't feel proportional to the simplicity of the task.

I asked a follow-up question to rewrite it in Python, which improved the extension handling, but the script still felt too bloated (explicit for loops, explicit recursion, ...). I tried to nudge ChatGPT to make it more concise, unfortunately without much success, so I finally decided to just go the old-fashioned way of hand-crafting something that can be copy-pasted as a command line oneliner.

Basic version:

$ python -c 'import pathlib, collections; print(collections.Counter(p.suffix for p in pathlib.Path(".").glob("**/*.*")))'
Counter({'.py': 17895, '.pyc': 14187, '.pyi': 2557, '.dat': 2493, '.mo': 949, '.po': 949, ...

Print the histogram line-by-line:

$ python -c 'import pathlib, collections; print("\n".join(f"{c:6d} {e}" for (e, c) in collections.Counter(p.suffix for p in pathlib.Path(".").glob("**/*.*")).most_common()))'
 17895 .py
 14187 .pyc
  2557 .pyi
  2493 .dat
   949 .mo
   949 .po
...