How To Download and Process OVH daily raw logs to extract useful information

Intro

I have a Web Hosting plans at OVH. I'm not that happy with the Website visit statistics tools. Urchin is deprecated, OVHcloud Web Statistics is still young and Awstats which I found great but it only show daily stats.

This is the reason why I decided to find a way to download logs and to treat them manually.

Configuration

Create dedicated log user account

First thing to do is to create a dedicated log user from the OVHCloud Web Control Panel.

OVH | OVH main web interface OVH | OVH login web interface OVH | OVH OVHCloud Web Control Panel OVH | Ribbon Menu OVH | Create a new user step 1 OVH | Create a new user step 2 OVH | Create a new user step 3 OVH | Create a new user step 3 OVH | Create a new user step 3

We now have everything we need to download our logs.

Download Logs

GNU/Linux

Set variables :

[user@host ~]$ USR=ovhlogsuser
[user@host ~]$ PASS=Myverycomplexpassw0rD
[user@host ~]$ URL=https://log.clusterXXX.hosting.ovh.net/shebangthedolphins.net/
[user@host ~]$ DOMAIN=$(awk -F '/' '{ print $4 }' <<< $URL)
[user@host ~]$ DOMAIN=shebangthedolphins.net

Download

[user@host ~]$ wget --http-user="$USR" --http-password="$PASS" -A *gz -r -nd ""$URL"/logs/logs-10-2020/"
[user@host ~]$ ls -lh
total 596K
-rw------- 1 std std  55 10 déc.   2019 robots.txt.tmp
-rw-r--r-- 1 std std 24K  2 oct.  02:56 shebangthedolphins.net-01-10-2020.log.gz
-rw-r--r-- 1 std std 17K  3 oct.  02:19 shebangthedolphins.net-02-10-2020.log.gz
-rw-r--r-- 1 std std 14K  4 oct.  02:32 shebangthedolphins.net-03-10-2020.log.gz
-rw-r--r-- 1 std std 15K  5 oct.  02:34 shebangthedolphins.net-04-10-2020.log.gz
[…]
-rw-r--r-- 1 std std 16K 28 oct.  06:18 shebangthedolphins.net-27-10-2020.log.gz
-rw-r--r-- 1 std std 16K 29 oct.  06:08 shebangthedolphins.net-28-10-2020.log.gz
-rw-r--r-- 1 std std 15K 30 oct.  06:08 shebangthedolphins.net-29-10-2020.log.gz
-rw-r--r-- 1 std std 14K 31 oct.  06:08 shebangthedolphins.net-30-10-2020.log.gz
-rw-r--r-- 1 std std 52K  1 nov.  06:08 shebangthedolphins.net-31-10-2020.log.gz
[user@host ~]$ wget --http-user="$USR" --http-password="$PASS" "$URL"/logs/logs-$(/bin/date --date='1 days ago' '+%m-%Y')/"$DOMAIN"-$(/bin/date --date='1 days ago' '+%d-%m-%Y').log.gz
[user@host ~]$ ls -lh
total 20K
-rw-r--r-- 1 std std 18K 25 nov.  06:29 shebangthedolphins.net-24-11-2020.log.gz
[user@host ~]$ perl-rename -v 's/(.*)-(\d\d)-(\d\d)-(\d\d\d\d)(.*)/$4-$3-$2-$1$5/' *gz
[user@host ~]$ ls -lh
total 596K
-rw-r--r-- 1 std std 24K  2 oct.  02:56 2020-10-01-shebangthedolphins.net.log.gz
-rw-r--r-- 1 std std 17K  3 oct.  02:19 2020-10-02-shebangthedolphins.net.log.gz
-rw-r--r-- 1 std std 14K  4 oct.  02:32 2020-10-03-shebangthedolphins.net.log.gz
[user@host ~]$ wget --http-user="$USR" --http-password="$PASS" "$URL"/logs/logs-$(/bin/date --date='1 days ago' '+%m-%Y')/"$DOMAIN"-$(/bin/date --date='1 days ago' '+%d-%m-%Y').log.gz -O /tmp/$(/bin/date --date='1 days ago' '+%Y-%m-%d')-"$DOMAIN".log.gz
[user@host ~]$ ls -lh /tmp/*gz
-rw-r--r-- 1 std std 18K 25 nov.  06:29 /tmp/2020-11-24-shebangthedolphins.net.log.gz
[user@host ~]$ for DAY in $(seq 1 30); do wget --http-user="$USR" --http-password="$PASS" "$URL"/logs/logs-$(/bin/date --date=''$DAY' days ago' '+%m-%Y')/"$DOMAIN"-$(/bin/date --date=''$DAY' days ago' '+%d-%m-%Y').log.gz -O /tmp/$(/bin/date --date=''$DAY' days ago' '+%Y-%m-%d')-"$DOMAIN".log.gz; done
[user@host ~]$ ls -lh /tmp/*gz
-rw-r--r-- 1 std std 17K 26 oct.  06:12 /tmp/2020-10-25-shebangthedolphins.net.log.gz
-rw-r--r-- 1 std std 14K 27 oct.  06:44 /tmp/2020-10-26-shebangthedolphins.net.log.gz
-rw-r--r-- 1 std std 16K 28 oct.  06:18 /tmp/2020-10-27-shebangthedolphins.net.log.gz
[...]
-rw-r--r-- 1 std std 15K 23 nov.  06:03 /tmp/2020-11-22-shebangthedolphins.net.log.gz
-rw-r--r-- 1 std std 18K 24 nov.  06:38 /tmp/2020-11-23-shebangthedolphins.net.log.gz
-rw-r--r-- 1 std std 18K 25 nov.  06:29 /tmp/2020-11-24-shebangthedolphins.net.log.gz

Windows/PowerShell

Set variables :

PS C:\Users\std> $user = "ovhlogsuser"
PS C:\Users\std> $pass = "Myverycomplexpassw0rD"
PS C:\Users\std> $secpasswd = ConvertTo-SecureString $pass -AsPlainText -Force
PS C:\Users\std> $credential = New-Object System.Management.Automation.PSCredential($user, $secpasswd)
PS C:\Users\std> $domain = "shebangthedolphins.net"
PS C:\Users\std> $url = "https://log.clusterXXX.hosting.ovh.net/$domain/"

Download

PS C:\Users\std> Invoke-WebRequest -Credential $credential -Uri ("$url" + "logs/logs-" + $((Get-Date).AddDays(-1).ToString("MM-yyyy")) + "/$domain" + "-" + $((Get-Date).AddDays(-1).ToString("dd-MM-yyyy"))  + ".log.gz") -OutFile "$((Get-Date).AddDays(-1).ToString("yyyy-MM-dd"))-$domain.log"
PS C:\Users\std> dir


    Directory: C:\Users\std


Mode                LastWriteTime         Length Name
----                -------------         ------ ----
-a----       05/12/2020     15:45         360238 2020-12-04-shebangthedolphins.net.log
PS C:\Users\std>  1..30 | ForEach-Object { Invoke-WebRequest -Credential $credential -Uri ("$url" + "logs/logs-" + $((Get-Date).AddDays(-"$_").ToString("MM-yyyy")) + "/$domain" + "-" + $((Get-Date).AddDays(-"$_").ToString("dd-MM-yyyy"))  + ".log.gz") -OutFile "$((Get-Date).AddDays(-"$_").ToString("yyyy-MM-dd"))-$domain.log" }

Extract informations

GNU/Linux

Now that we've downloaded our logs files, we can use command line to extract useful informations.

Page views statistics

[user@host ~]$ DOMAIN=shebangthedolphins.net
[user@host ~]$ zgrep -viE 'Bytespider|Trident|bot|404|GET \/ HTTP|BingPreview|Seekport Crawler' 2020-11-24-shebangthedolphins.net.log.gz | grep html | awk '{ print $1" "$11 }' | grep "$DOMAIN" | sort | uniq | awk '{ print $2 }' | sort | uniq -c | sort -n | tr -s "[ ]" | sed 's/^ //'
1 "http://shebangthedolphins.net/backup_burp.html"
1 "http://shebangthedolphins.net/backup_burp.html"
1 "http://shebangthedolphins.net/prog_autoit_backup.html"
1 "http://shebangthedolphins.net/prog_powershell_kesc.html"
1 "http://shebangthedolphins.net/vpn_ipsec_06linux-to-linux_tunnel-x509.html"
1 "https://shebangthedolphins.net/fr/windows_grouppolicy_execute_powershell_script.html"
1 "https://shebangthedolphins.net/gnulinux_courier.html"
1 "https://shebangthedolphins.net/gnulinux_vnc_remotedesktop.html"
1 "https://shebangthedolphins.net/vpn_openvpn_windows_server.html"
1 "https://shebangthedolphins.net/windows_icacls.html"
1 "http://www.shebangthedolphins.net/vpn_ipsec_03linux-to-windows_transport-psk.html"
2 "https://shebangthedolphins.net/windows_mssql_alwayson.html"
3 "https://shebangthedolphins.net/fr/vpn_openvpn_buster.html"
7 "https://shebangthedolphins.net/ubiquiti_ssh_commands.html"
[user@host ~]$ DOMAIN=shebangthedolphins.net
[user@host ~]$ for i in *.log.gz; do echo "------------"; echo "$i"; zgrep -viE 'Bytespider|Trident|bot|404|GET \/ HTTP|BingPreview|Seekport Crawler' "$i" | grep html | awk '{ print $1" "$11 }' | grep "$DOMAIN" | sort | uniq | awk '{ print $2 }' | wc -l; done
------------
2020-11-18-shebangthedolphins.net.log.gz
19
------------
2020-11-19-shebangthedolphins.net.log.gz
19
------------
2020-11-20-shebangthedolphins.net.log.gz
24
------------
2020-11-21-shebangthedolphins.net.log.gz
8
------------
2020-11-22-shebangthedolphins.net.log.gz
16
------------
2020-11-23-shebangthedolphins.net.log.gz
15
------------
2020-11-24-shebangthedolphins.net.log.gz
13
[user@host ~]$ DOMAIN=shebangthedolphins.net
[user@host ~]$ YEAR=2020
[user@host ~]$ for i in $(seq -w 1 12); do echo "------------"; echo "$YEAR-$i"; zgrep -viE 'Bytespider|Trident|bot|404|GET \/ HTTP|BingPreview|Seekport Crawler' $YEAR-"$i"*.log.gz | grep html | awk '{ print $1" "$11 }' | grep "$DOMAIN" | sort | uniq | awk '{ print $2 }' | wc -l; done 2>/dev/null
------------
2020-01
101
------------
2020-02
73
------------
2020-03
92
------------
2020-04
91
------------
2020-05
87
------------
2020-06
73
------------
2020-07
81
------------
2020-08
97
------------
2020-09
135
------------
2020-10
151
------------
2020-11
154

Page views statistics from a search engine

[user@host ~]$ zgrep "html HTTP.*200.*[0-9]\{4\} \"\(https://www.google\|https://www.bing\|https://www.qwant\|https://duckduckgo\)" 2020-11-24-shebangthedolphins.net.log.gz | sed 's/.*GET \(.*\.html\).*/\1/' | sort | uniq -c | sort -n
      1 /backup_burp.html
      1 /fr/windows_grouppolicy_execute_powershell_script.html
      1 /gnulinux_vnc_remotedesktop.html
      1 /prog_powershell_kesc.html
      1 /vpn_ipsec_03linux-to-windows_transport-psk.html
      1 /vpn_ipsec_06linux-to-linux_tunnel-x509.html
      1 /vpn_openvpn_windows_server.html
      1 /windows_icacls.html
      1 /windows_mssql_alwayson.html
      2 /fr/vpn_openvpn_buster.html
      6 /ubiquiti_ssh_commands.html
[user@host ~]$ zcat *.log.gz | grep "html HTTP.*200.*[0-9]\{4\} \"\(https://www.google\|https://www.bing\|https://www.qwant\|https://duckduckgo\)" | sed 's/.*GET \(.*\.html\).*/\1/' | sort | uniq -c | sort -n
[...]
     11 /gnulinux_vnc_remotedesktop.html
     11 /openbsd_packetfilter.html
     12 /fr/gnulinux_nftables_examples.html
     15 /windows_imule.html
     20 /backup_burp.html
     36 /fr/vpn_openvpn_buster.html
     38 /vpn_openvpn_windows_server.html
    112 /ubiquiti_ssh_commands.html
[user@host ~]$ YEAR=2020
[user@host ~]$ for i in $(seq -w 1 12); do echo "------------"; echo "$YEAR-$i"; zcat $YEAR-"$i"*.log.gz | grep "html HTTP.*200.*[0-9]\{4\} \"\(https://www.google\|https://www.bing\|https://www.qwant\|https://duckduckgo\)" | sed 's/.*GET \(.*\.html\).*/\1/' | sort | uniq -c | sort -n | wc -l; done 2>/dev/null
------------
2020-01
25
------------
2020-02
26
------------
2020-03
27
------------
2020-04
28
------------
2020-05
25
------------
2020-06
27
------------
2020-07
33
------------
2020-08
29
------------
2020-09
67
------------
2020-10
58
------------
2020-11
68

Script

A script that I use to quickly see evolution of the search results.

Code v1

#! /bin/sh
for LOGS in *.log.gz; do
	echo "-----------------------------"
	echo "LOGS : $LOGS"
	for i in www.google www.bing www.qwant duckduckgo; do
		RESULT=$(zgrep "html HTTP.*200.*[0-9]\{4\} \"https://"$i"" $LOGS | wc -l)
		echo "$i = $RESULT"
	done
done

Output

-----------------------------
LOGS : 2020-11-21-shebangthedolphins.net.log.gz
www.google = 6
www.bing = 1
www.qwant = 0
duckduckgo = 4
-----------------------------
LOGS : 2020-11-22-shebangthedolphins.net.log.gz
www.google = 10
www.bing = 2
www.qwant = 0
duckduckgo = 6
-----------------------------
LOGS : 2020-11-23-shebangthedolphins.net.log.gz
www.google = 7
www.bing = 6
www.qwant = 2
duckduckgo = 1
-----------------------------
LOGS : 2020-11-24-shebangthedolphins.net.log.gz
www.google = 12
www.bing = 2
www.qwant = 0
duckduckgo = 3

Code v2

#! /bin/sh
for LOGS in "$1"*.log.gz; do
        TOTAL=0
        echo "-----------------------------"
        echo "LOGS : $LOGS"
        for i in www.google www.bing www.qwant duckduckgo www.ecosia.org; do
                RESULT=$(zgrep "html HTTP.*200.*[0-9]\{4\} \"https://"$i"" $LOGS | wc -l)
                echo "$i = $RESULT"
                TOTAL=$(($TOTAL+$RESULT))
        done
        echo "TOTAL $(date -d "${LOGS%-shebangthedolphins.net.log.gz}" '+%A') : $TOTAL"
done

Output

[user@host ~]$ . ./scripts.sh 2021-02-1 
-----------------------------
LOGS : 2021-02-10-shebangthedolphins.net.log.gz
www.google = 31
www.bing = 15
www.qwant = 1
duckduckgo = 23
www.ecosia.org = 0
TOTAL wednesday : 70
-----------------------------
LOGS : 2021-02-11-shebangthedolphins.net.log.gz
www.google = 38
www.bing = 11
www.qwant = 2
duckduckgo = 24
www.ecosia.org = 0
TOTAL thursday : 75
-----------------------------

Windows/PowerShell

Page views statistics

PS C:\Users\std> $domain = "shebangthedolphins.net"
PS C:\Users\std> Select-String .\2020-11-05-shebangthedolphins.net.log -NotMatch -Pattern "Bytespider","Trident","bot","404","GET / HTTP","BingPreview","Seekport Crawler" | Select-String -Pattern "html" | %{"{0} {1}" -f $_.Line.ToString().Split(' ')[0],$_.Line.ToString().Split(' ')[10]} | Select-String -Pattern "$domain.*html" | Sort-Object | Get-Unique | %{"{0}" -f $_.Line.ToString().Split(' ')[1]} | group -NoElement | Sort-Object Count |  %{"{0} {1}" -f $_.Count, $_.Name }
1 "https://shebangthedolphins.net/fr/prog_introduction.html"
1 "https://shebangthedolphins.net/fr/prog_sh_check_snmp_synology.html"
1 "http://shebangthedolphins.net/openbsd_network_interfaces.html"
1 "https://shebangthedolphins.net/fr/menu.html"
1 "https://shebangthedolphins.net/fr/windows_commandes.html"
1 "https://shebangthedolphins.net/fr/windows_run_powershell_taskschd.html"
1 "http://shebangthedolphins.net/prog_sh_check_snmp_synology.html"
1 "https://shebangthedolphins.net/fr/windows_grouppolicy_reset.html"
1 "https://shebangthedolphins.net/fr/windows_grouppolicy_update_policy.html"
1 "https://shebangthedolphins.net/virtualization_kvm_windows10.html"
1 "https://shebangthedolphins.net/windows_event_on_usb.html"
1 "https://shebangthedolphins.net/prog_powershell_movenetfiles.html"
1 "http://shebangthedolphins.net/prog_autoit_backup.html"
1 "https://shebangthedolphins.net/fr/vpn_openvpn_buster.html"
1 "http://shebangthedolphins.net/prog_powershell_kesc.html"
1 "https://shebangthedolphins.net/index.html"
1 "http://shebangthedolphins.net/gnulinux_courier.html"
4 "https://shebangthedolphins.net/ubiquiti_ssh_commands.html"
6 "https://shebangthedolphins.net/vpn_openvpn_windows_server.html"

References

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Contact :