A simple Shell Script for Log Analysis: Extracting and Counting Apache Access Logs

Saumik Satapathy
4 min readFeb 4, 2025

--

Extracting and reading the logs always been a challenge for sysadmins/devops. Mostly many people depends on third party tools like ELK to read and analyze the logs. In this demo I will show how can a simple shellscript save us without knowing any third party tools.

For this demo I have installed an Apache web server in a centos7 VM. Put a simple index page and another simple webpage to demonstrate this. Without wasting further on the intro let me share the script followed by explanation line by line. I have attached a video to understand it in a better way.

#!/bin/bash

# Path to the Apache log file
LOG_FILE="/var/log/httpd/access_log"

# Extract IP, date, time, and status code, then count occurrences
sudo awk '
{
# Extract IP address (first column in Apache logs)
ip_address = $1;

# Extract the day from the timestamp inside [ ]
match($4, /\[([0-9]+)/, day);

# Split the timestamp into month, year, hour, and minute
split($4, time_parts, "[/:]");

# Get the status code (ignore missing or invalid values)
status_code = $9;
if (status_code ~ /^[0-9]+$/) {
# Format the output key as: "IP DD/MM/YYYY HH:MM StatusCode"
key = ip_address " " day[1] "/" time_parts[2] "/" time_parts[3] " " time_parts[4] ":" time_parts[5] " " status_code;

# Count occurrences of each unique key
count[key]++;
}
}

# Print the results after processing all log lines
END {
for (entry in count) {
print count[entry], entry;
}
}' "$LOG_FILE" | sort -nr

Explanation

  1. Shebang (First Line)
#!/bin/bash

Tells the system that this script should be executed using the bash shell.

2. Define the Apache Log File Path

LOG_FILE="/var/log/httpd/access_log"
  • Stores the path of the Apache log file in the variable LOG_FILE.
  • This allows easy modification of the script to work with different log locations.

3. Run awk with sudo to Process the Log File

sudo awk '
---------
---------
---------
'
  • Uses sudo to ensure access to the log file (since /var/log/httpd/access_log usually
    requires root access).
  • Calls awk to process the log file. awk is a powerful text-processing tool that reads and processes each line. (We can use cut command to achieve this but that will be too lengthy and difficult to manage.

4. Extract the IP Address

ip_address = $1;
  • Apache logs store the client’s IP address in the first field ($1).
  • This assigns the IP address to the variable ip_address.
    Example:
    49.207.203.25 — — [31/Jan/2025:05:39:56 +0000] “GET / HTTP/1.1” 404 3460 “-” “-”
    Here, $1 = 49.207.203.25, so ip_address = 49.207.203.25.

5. Extract the Day from the Timestamp

match($4, /\[([0-9]+)/, day);
  • $4 contains the timestamp in the format: [31/Jan/2025:05:39:56 +0000]
  • match() extracts the day (3️1️) using the regular expression \[([0–9]+), which looks for the first number inside square brackets [ ].
  • The result is stored in day[1], so day[1] = 31.

6. Split the Timestamp into parts

split($4, time_parts, "[/:]");
  • split() breaks the timestamp ($4) into different parts using : and / as delimiters.
  • Example breakdown for [31/Jan/2025:05:39:56 +0000]:
    time_parts[2] = “Jan” # Month
    time_parts[3] = “2025” # Year
    time_parts[4] = “05” # Hour
    time_parts[5] = “39” # Minute

7. Extract the HTTP Status Code

status_code = $9;
  • The HTTP status code is in the 9th field ($9) in the log format.
  • Example:
    “GET / HTTP/1.1” 404 3460 “-” “-”
    Here, $9 = 404, so status_code = 404.

8. Ignore Invalid Status Codes(Optional)

if [[ "$status_code" =~ ^[0-9]+$ ]]; then
----------
----------
fi
  • Ensures the status code is numeric using the regular expression /^[0–9]+$/.
  • This skips lines where $9 is missing or “-”.

9. Create a Unique Key for Counting

key = ip_address " " day[1] "/" time_parts[2] "/" time_parts[3] " " time_parts[4] ":" time_parts[5] " " status_code;
  • Formats the extracted data into a single string (key) like below,
    “49.207.203.25 31/Jan/2025 05:39 404”.
  • This key uniquely identifies each IP, date, time, and status code.

10. Count Each Occurrence

count[key]++;
  • Increments the count for each unique key.
  • If the same IP generates the same error multiple times, it adds up the occurrences.

11. Process All Log Lines & Print the Final Output.

END {
for (entry in count) {
print count[entry], entry;
}
}
  • Runs after processing all log lines.
  • Loops through all counted entries and prints them in the format:
    COUNT IP DATE TIME STATUS_CODE

12. Sort the Output in Descending Order

"$LOG_FILE" | sort -nr
  • Sorts the output by occurrence count (-n for numeric, -r for reverse order).
  • Most frequent errors appear first.

Reference Materials;

https://amzn.in/d/dGcNEE0

https://youtu.be/4DqCnn9mZus

--

--

Saumik Satapathy
Saumik Satapathy

Written by Saumik Satapathy

A passionate software Engineer with good hands on experience in the field of DevOps/SRE. Love to share knowledge and intersted to learn from others.

No responses yet