A simple Shell Script for Log Analysis: Extracting and Counting Apache Access Logs
Extracting and reading the logs always been a challenge for sysadmins/devops. Mostly many people depends on third party tools like ELK to read and analyze the logs. In this demo I will show how can a simple shellscript save us without knowing any third party tools.
For this demo I have installed an Apache web server in a centos7 VM. Put a simple index page and another simple webpage to demonstrate this. Without wasting further on the intro let me share the script followed by explanation line by line. I have attached a video to understand it in a better way.
#!/bin/bash
# Path to the Apache log file
LOG_FILE="/var/log/httpd/access_log"
# Extract IP, date, time, and status code, then count occurrences
sudo awk '
{
# Extract IP address (first column in Apache logs)
ip_address = $1;
# Extract the day from the timestamp inside [ ]
match($4, /\[([0-9]+)/, day);
# Split the timestamp into month, year, hour, and minute
split($4, time_parts, "[/:]");
# Get the status code (ignore missing or invalid values)
status_code = $9;
if (status_code ~ /^[0-9]+$/) {
# Format the output key as: "IP DD/MM/YYYY HH:MM StatusCode"
key = ip_address " " day[1] "/" time_parts[2] "/" time_parts[3] " " time_parts[4] ":" time_parts[5] " " status_code;
# Count occurrences of each unique key
count[key]++;
}
}
# Print the results after processing all log lines
END {
for (entry in count) {
print count[entry], entry;
}
}' "$LOG_FILE" | sort -nr
Explanation
- Shebang (First Line)
#!/bin/bash
Tells the system that this script should be executed using the bash shell.
2. Define the Apache Log File Path
LOG_FILE="/var/log/httpd/access_log"
- Stores the path of the Apache log file in the variable LOG_FILE.
- This allows easy modification of the script to work with different log locations.
3. Run awk with sudo to Process the Log File
sudo awk '
---------
---------
---------
'
- Uses sudo to ensure access to the log file (since /var/log/httpd/access_log usually
requires root access). - Calls awk to process the log file. awk is a powerful text-processing tool that reads and processes each line. (We can use cut command to achieve this but that will be too lengthy and difficult to manage.
4. Extract the IP Address
ip_address = $1;
- Apache logs store the client’s IP address in the first field ($1).
- This assigns the IP address to the variable ip_address.
Example:
49.207.203.25 — — [31/Jan/2025:05:39:56 +0000] “GET / HTTP/1.1” 404 3460 “-” “-”
Here, $1 = 49.207.203.25, so ip_address = 49.207.203.25.
5. Extract the Day from the Timestamp
match($4, /\[([0-9]+)/, day);
- $4 contains the timestamp in the format: [31/Jan/2025:05:39:56 +0000]
- match() extracts the day (3️1️) using the regular expression \[([0–9]+), which looks for the first number inside square brackets [ ].
- The result is stored in day[1], so day[1] = 31.
6. Split the Timestamp into parts
split($4, time_parts, "[/:]");
- split() breaks the timestamp ($4) into different parts using : and / as delimiters.
- Example breakdown for [31/Jan/2025:05:39:56 +0000]:
time_parts[2] = “Jan” # Month
time_parts[3] = “2025” # Year
time_parts[4] = “05” # Hour
time_parts[5] = “39” # Minute
7. Extract the HTTP Status Code
status_code = $9;
- The HTTP status code is in the 9th field ($9) in the log format.
- Example:
“GET / HTTP/1.1” 404 3460 “-” “-”
Here, $9 = 404, so status_code = 404.
8. Ignore Invalid Status Codes(Optional)
if [[ "$status_code" =~ ^[0-9]+$ ]]; then
----------
----------
fi
- Ensures the status code is numeric using the regular expression /^[0–9]+$/.
- This skips lines where $9 is missing or “-”.
9. Create a Unique Key for Counting
key = ip_address " " day[1] "/" time_parts[2] "/" time_parts[3] " " time_parts[4] ":" time_parts[5] " " status_code;
- Formats the extracted data into a single string (key) like below,
“49.207.203.25 31/Jan/2025 05:39 404”. - This key uniquely identifies each IP, date, time, and status code.
10. Count Each Occurrence
count[key]++;
- Increments the count for each unique key.
- If the same IP generates the same error multiple times, it adds up the occurrences.
11. Process All Log Lines & Print the Final Output.
END {
for (entry in count) {
print count[entry], entry;
}
}
- Runs after processing all log lines.
- Loops through all counted entries and prints them in the format:
COUNT IP DATE TIME STATUS_CODE
12. Sort the Output in Descending Order
"$LOG_FILE" | sort -nr
- Sorts the output by occurrence count (-n for numeric, -r for reverse order).
- Most frequent errors appear first.
Reference Materials;
https://amzn.in/d/dGcNEE0