How To Download and Process OVH daily raw logs to extract useful information

Intro

OVH logo

I have a Web Hosting plans at OVH. I'm not that happy with the Website visit statistics tools. Urchin is deprecated, OVHcloud Web Statistics is still young and Awstats which I found great but it only show daily stats.

This is the reason why I decided to find a way to download logs and to treat them manually.

Configuration

  • OS : GNU/Linux and Windows 10
  • wget : 1.21

Create dedicated log user account

First thing to do is to create a dedicated log user from the OVHCloud Web Control Panel.

OVH | OVH main web interface OVH | OVH login web interface
  • From the OVHCloud Web Control Panel click on your hosted plan :
OVH | OVH OVHCloud Web Control Panel
  • From the Ribbon Menu click More+ then Statistics and logs :
OVH | Ribbon Menu
  • From Statistics and logs menu, click Create a new user :
OVH | Create a new user step 1
  • Set a user name, and click Next :
OVH | Create a new user step 2
  • Respect the password requirements, and click Next :
OVH | Create a new user step 3
  • Click Confirm, to create :
OVH | Create a new user step 3
  • Copy the https://log.clusterXXX.hosting.ovh.net/YOUR_DOMAIN/ url :
OVH | Create a new user step 3

We now have everything we need to download our logs.

Download Logs

GNU/Linux

Set variables :

  • Set your Log User variable :
[user@host ~]$ USR=ovhlogsuser
  • Set your Log Password variable :
[user@host ~]$ PASS=Myverycomplexpassw0rD
  • Set your URL variable :
[user@host ~]$ URL=https://log.clusterXXX.hosting.ovh.net/shebangthedolphins.net/
  • Set your Domain variable :
[user@host ~]$ DOMAIN=$(awk -F '/' '{ print $4 }' <<< $URL)
    • Or simply :
[user@host ~]$ DOMAIN=shebangthedolphins.net

Download

  • Download all logs for a given month in the current directory :
[user@host ~]$ wget --http-user="$USR" --http-password="$PASS" -A *gz -r -nd ""$URL"/logs/logs-10-2020/"
[user@host ~]$ ls -lh
total 596K
-rw------- 1 std std  55 10 déc.   2019 robots.txt.tmp
-rw-r--r-- 1 std std 24K  2 oct.  02:56 shebangthedolphins.net-01-10-2020.log.gz
-rw-r--r-- 1 std std 17K  3 oct.  02:19 shebangthedolphins.net-02-10-2020.log.gz
[…]
-rw-r--r-- 1 std std 14K 31 oct.  06:08 shebangthedolphins.net-30-10-2020.log.gz
-rw-r--r-- 1 std std 52K  1 nov.  06:08 shebangthedolphins.net-31-10-2020.log.gz
  • Download last log to current directory :
[user@host ~]$ wget --http-user="$USR" --http-password="$PASS" "$URL"/logs/logs-$(/bin/date --date='1 days ago' '+%m-%Y')/"$DOMAIN"-$(/bin/date --date='1 days ago' '+%d-%m-%Y').log.gz
[user@host ~]$ ls -lh
total 20K
-rw-r--r-- 1 std std 18K 25 nov.  06:29 shebangthedolphins.net-24-11-2020.log.gz
  • Reformat the files names shebangthedolphins.net-dd-mm-yyyyy.log.gz to yyyyy-mm-dd-shebangthedolphins.net.log.gz with perl-rename :
[user@host ~]$ perl-rename -v 's/(.*)-(\d\d)-(\d\d)-(\d\d\d\d)(.*)/$4-$3-$2-$1$5/' *gz
[user@host ~]$ ls -lh
total 596K
-rw-r--r-- 1 std std 24K  2 oct.  02:56 2020-10-01-shebangthedolphins.net.log.gz
-rw-r--r-- 1 std std 17K  3 oct.  02:19 2020-10-02-shebangthedolphins.net.log.gz
-rw-r--r-- 1 std std 14K  4 oct.  02:32 2020-10-03-shebangthedolphins.net.log.gz
  • Download log to specific directory /tmp/ :
[user@host ~]$ wget --http-user="$USR" --http-password="$PASS" "$URL"/logs/logs-$(/bin/date --date='1 days ago' '+%m-%Y')/"$DOMAIN"-$(/bin/date --date='1 days ago' '+%d-%m-%Y').log.gz -O /tmp/$(/bin/date --date='1 days ago' '+%Y-%m-%d')-"$DOMAIN".log.gz
[user@host ~]$ ls -lh /tmp/*gz
-rw-r--r-- 1 std std 18K 25 nov.  06:29 /tmp/2020-11-24-shebangthedolphins.net.log.gz
  • Download the last 30 logs files in the /tmp/ :
[user@host ~]$ for DAY in $(seq 1 30); do wget --http-user="$USR" --http-password="$PASS" "$URL"/logs/logs-$(/bin/date --date=''$DAY' days ago' '+%m-%Y')/"$DOMAIN"-$(/bin/date --date=''$DAY' days ago' '+%d-%m-%Y').log.gz -O /tmp/$(/bin/date --date=''$DAY' days ago' '+%Y-%m-%d')-"$DOMAIN".log.gz; done
[user@host ~]$ ls -lh /tmp/*gz
-rw-r--r-- 1 std std 17K 26 oct.  06:12 /tmp/2020-10-25-shebangthedolphins.net.log.gz
-rw-r--r-- 1 std std 14K 27 oct.  06:44 /tmp/2020-10-26-shebangthedolphins.net.log.gz
[...]
-rw-r--r-- 1 std std 18K 24 nov.  06:38 /tmp/2020-11-23-shebangthedolphins.net.log.gz
-rw-r--r-- 1 std std 18K 25 nov.  06:29 /tmp/2020-11-24-shebangthedolphins.net.log.gz

Windows/PowerShell

Set variables :

  • Set your Log User variables :
PS C:\Users\std> $user = "ovhlogsuser"
  • Set your Log Password variables :
PS C:\Users\std> $pass = "Myverycomplexpassw0rD"
PS C:\Users\std> $secpasswd = ConvertTo-SecureString $pass -AsPlainText -Force
PS C:\Users\std> $credential = New-Object System.Management.Automation.PSCredential($user, $secpasswd)
  • Set your Domain variable :
PS C:\Users\std> $domain = "shebangthedolphins.net"
  • Set your URL variable :
PS C:\Users\std> $url = "https://log.clusterXXX.hosting.ovh.net/$domain/"

Download

  • Download last log in text format to current directory :
PS C:\Users\std> Invoke-WebRequest -Credential $credential -Uri ("$url" + "logs/logs-" + $((Get-Date).AddDays(-1).ToString("MM-yyyy")) + "/$domain" + "-" + $((Get-Date).AddDays(-1).ToString("dd-MM-yyyy"))  + ".log.gz") -OutFile "$((Get-Date).AddDays(-1).ToString("yyyy-MM-dd"))-$domain.log"
PS C:\Users\std> dir


    Directory: C:\Users\std


Mode                LastWriteTime         Length Name
----                -------------         ------ ----
-a----       05/12/2020     15:45         360238 2020-12-04-shebangthedolphins.net.log
  • Download the 30 last logs file in text format :
PS C:\Users\std>  1..30 | ForEach-Object { Invoke-WebRequest -Credential $credential -Uri ("$url" + "logs/logs-" + $((Get-Date).AddDays(-"$_").ToString("MM-yyyy")) + "/$domain" + "-" + $((Get-Date).AddDays(-"$_").ToString("dd-MM-yyyy"))  + ".log.gz") -OutFile "$((Get-Date).AddDays(-"$_").ToString("yyyy-MM-dd"))-$domain.log" }

Extract informations

GNU/Linux

Now that we've downloaded our logs files, we can use command line to extract useful informations.

Page views statistics

  • List and count most page views for specific log file (2020-11-24-shebangthedolphins.net.log.gz) :
[user@host ~]$ DOMAIN=shebangthedolphins.net
[user@host ~]$ zgrep -viE 'Bytespider|Trident|bot|404|GET \/ HTTP|BingPreview|Seekport Crawler' 2020-11-24-shebangthedolphins.net.log.gz | grep html | awk '{ print $1" "$11 }' | sort | uniq | awk '{ print $2 }' | sort | uniq -c | sort -n | tr -s "[ ]" | sed 's/^ //' | grep "$DOMAIN"
1 "http://shebangthedolphins.net/backup_burp.html"
1 "http://shebangthedolphins.net/backup_burp.html"
1 "http://shebangthedolphins.net/prog_autoit_backup.html"
1 "http://shebangthedolphins.net/prog_powershell_kesc.html"
1 "http://shebangthedolphins.net/vpn_ipsec_06linux-to-linux_tunnel-x509.html"
1 "https://shebangthedolphins.net/fr/windows_grouppolicy_execute_powershell_script.html"
1 "https://shebangthedolphins.net/gnulinux_courier.html"
1 "https://shebangthedolphins.net/gnulinux_vnc_remotedesktop.html"
1 "https://shebangthedolphins.net/vpn_openvpn_windows_server.html"
1 "https://shebangthedolphins.net/windows_icacls.html"
1 "http://www.shebangthedolphins.net/vpn_ipsec_03linux-to-windows_transport-psk.html"
2 "https://shebangthedolphins.net/windows_mssql_alwayson.html"
3 "https://shebangthedolphins.net/fr/vpn_openvpn_buster.html"
7 "https://shebangthedolphins.net/ubiquiti_ssh_commands.html"
  • List and count page views for all log files (*.log.gz) :
[user@host ~]$ DOMAIN=shebangthedolphins.net
[user@host ~]$ for i in *.log.gz; do echo "------------"; echo "$i"; zgrep -viE 'Bytespider|Trident|bot|404|GET \/ HTTP|BingPreview|Seekport Crawler' "$i" | grep html | awk '{ print $1" "$11 }' | grep "$DOMAIN" | sort | uniq | awk '{ print $2 }' | wc -l; done
------------
2020-11-19-shebangthedolphins.net.log.gz
19
------------
2020-11-20-shebangthedolphins.net.log.gz
24
------------
2020-11-21-shebangthedolphins.net.log.gz
8
------------
2020-11-22-shebangthedolphins.net.log.gz
16
------------
2020-11-23-shebangthedolphins.net.log.gz
15
------------
2020-11-24-shebangthedolphins.net.log.gz
13
  • List and count page views for all log files (*.log.gz) and by month :
[user@host ~]$ DOMAIN=shebangthedolphins.net
[user@host ~]$ YEAR=2020
[user@host ~]$ for i in $(seq -w 1 12); do echo "------------"; echo "$YEAR-$i"; zgrep -viE 'Bytespider|Trident|bot|404|GET \/ HTTP|BingPreview|Seekport Crawler' $YEAR-"$i"*.log.gz | grep html | awk '{ print $1" "$11 }' | grep "$DOMAIN" | sort | uniq | awk '{ print $2 }' | wc -l; done 2>/dev/null
------------
2020-01
101
------------
2020-02
73
------------
2020-03
92
------------
2020-04
91
------------
2020-05
87
------------
2020-06
73
------------
2020-07
81
------------
2020-08
97
------------
2020-09
135

Page views statistics from a search engine

  • List and count page views from search engines (here for google, bing, qwant and duckduckgo) for specific log file (2020-11-24-shebangthedolphins.net.log.gz) :
[user@host ~]$ zgrep "html HTTP.*200.*[0-9]\{4\} \"\(https://www.google\|https://www.bing\|https://www.qwant\|https://duckduckgo\)" 2021-08-20-shebangthedolphins.net.log.gz | grep html | awk '{ print $1" "$7 }' | sort -n | uniq | awk '{ print $2 }' | sort | uniq -c | sort -n
[…]
      8 /windows_grouppolicy_manage_searchbox.html
      9 /windows_icacls.html
     10 /vpn_openvpn_bullseye.html
     11 /gnulinux_vnc_remotedesktop.html
     11 /windows_rds_mfa.html
     12 /fr/vpn_openvpn_windows_server.html
     14 /windows_grouppolicy_shutdown.html
     15 /gnulinux_nftables_examples.html
     25 /vpn_openvpn_windows_server.html
     72 /ubiquiti_ssh_commands.html
  • List and count page views from search engines (here for google, bing, qwant and duckduckgo) and for all log files (*.log.gz) :
[user@host ~]$ zcat *.log.gz | grep "html HTTP.*200.*[0-9]\{4\} \"\(https://www.google\|https://www.bing\|https://www.qwant\|https://duckduckgo\)" | grep html | awk '{ print $1" "$7 }' | sort -n | uniq | awk '{ print $2 }' | sort | uniq -c | sort -n
[…]
     11 /gnulinux_vnc_remotedesktop.html
     11 /openbsd_packetfilter.html
     12 /fr/gnulinux_nftables_examples.html
     15 /windows_imule.html
     20 /backup_burp.html
     36 /fr/vpn_openvpn_buster.html
     38 /vpn_openvpn_windows_server.html
    112 /ubiquiti_ssh_commands.html
  • List and count page views from search engines (here for google, bing, qwant and duckduckgo), view by month :
[user@host ~]$ YEAR=2021
[user@host ~]$ for i in $(seq -w 1 12); do echo "------------"; echo "$YEAR-$i"; zcat $YEAR-"$i"*.log.gz | grep "html HTTP.*200.*[0-9]\{4\} \"\(https://www.google\|https://www.bing\|https://www.qwant\|https://duckduckgo\)" | grep html | awk '{ print $1" "$7 }' | sort -n | uniq | wc -l; done 2>/dev/null
------------
2021-01
1311
------------
2021-02
1650
------------
2021-03
2566
------------
2021-04
3511
------------
2021-05
5452
------------
2021-06
6922
------------
2021-07
6437
------------
2021-08
4788

Scripts

Scripts that I use to quickly see evolution of the search results.

Script v1

  • First version, displays results by search engine.
Code
#! /bin/sh
for LOGS in *.log.gz; do
	echo "-----------------------------"
	echo "LOGS : $LOGS"
	for i in www.google www.bing www.qwant duckduckgo; do
		RESULT=$(zgrep "html HTTP.*200.*[0-9]\{4\} \"https://"$i"" $LOGS | wc -l)
		echo "$i = $RESULT"
	done
done
Output
-----------------------------
LOGS : 2020-11-21-shebangthedolphins.net.log.gz
www.google = 6
www.bing = 1
www.qwant = 0
duckduckgo = 4
-----------------------------
LOGS : 2020-11-22-shebangthedolphins.net.log.gz
www.google = 10
www.bing = 2
www.qwant = 0
duckduckgo = 6
-----------------------------
LOGS : 2020-11-23-shebangthedolphins.net.log.gz
www.google = 7
www.bing = 6
www.qwant = 2
duckduckgo = 1

Script v2

  • Slight improvements :
    • Accept an argument to specify the period
    • Displays the total
Code
#! /bin/sh
for LOGS in "$1"*.log.gz; do
        TOTAL=0
        echo "-----------------------------"
        echo "LOGS : $LOGS"
        for i in www.google www.bing www.qwant duckduckgo www.ecosia.org; do
                RESULT=$(zgrep "html HTTP.*200.*[0-9]\{4\} \"https://"$i"" $LOGS | wc -l)
                echo "$i = $RESULT"
                TOTAL=$(($TOTAL+$RESULT))
        done
	echo "TOTAL $(date -d $(awk -F'-' '{ print $1"-"$2"-"$3 }' <<< $LOG) '+%A') : $TOTAL"
done
Output
[user@host ~]$ sh ./std_ovh.sh 2021-02-1
-----------------------------
LOGS : 2021-02-10-shebangthedolphins.net.log.gz
www.google = 31
www.bing = 15
www.qwant = 1
duckduckgo = 23
www.ecosia.org = 0
TOTAL wednesday : 70
-----------------------------
LOGS : 2021-02-11-shebangthedolphins.net.log.gz
www.google = 38
www.bing = 11
www.qwant = 2
duckduckgo = 24
www.ecosia.org = 0
TOTAL thursday : 75
-----------------------------

Script v3

  • Uge improvements :
    • Able to export to csv format (in a /tmp/stats.csv file) with -c or -p argument
Code
#! /bin/sh
# Role : Extract ovh logs stats
# Author : http://shebangthedolphins.net/

pages=false
csv=false

usage()
{
        echo "usage: ./std_ovh.sh YYYY-MM-DD"
        echo "[-c] : export total stats to /tmp/stats.csv file"
        echo "[-p <url>|<XX most viewed pages>] : export specific <url> or XX most viewed urls to /tmp/stats.csv file"
        echo "ex : ./std_ovh.sh 2021-03"
        echo "ex : ./std_ovh.sh 2021-03 -c"
        echo "ex : ./std_ovh.sh 2021-03 -p vpn_openvpn_windows_server.html"
        echo "ex : ./std_ovh.sh 2021-03 -p 10"
        exit 3
}

case "$1" in
        *)
                LOGS=$1
                shift    # Remove the first argument (wich will be for example 2021-09-)
                while getopts "p:ch" OPTNAME; do
                        case "$OPTNAME" in
                                p)
                                        ARGP=${OPTARG}
                                        pages=true
                                        ;;
                                c)
                                        csv=true
                                        ;;
                                h)
                                        usage
                                        ;;
                                *)
                                        usage
                                        ;;
                        esac
                done
esac

# show help if no arguments or -c AND -p are set
if [[ ( -z "$LOGS" ) || ( $pages == "true" && $csv == "true" ) ]] ; then
    usage
fi

# create /tmp/stats.csv header
if $csv; then echo "date,google,bing,qwant,ddg,ecosia" > /tmp/stats.csv; fi
if $pages; then
        if [[ "$ARGP" =~ ^[0-9]+$ ]]; then
                HEAD=$ARGP
                HTML="html"
        else
                HTML=$ARGP
                HEAD=1
        fi
        for i in $(zgrep "$HTML HTTP.*200.*[0-9]\{4\} \"https://\(www.google\|www.bing\|www.qwant\|duckduckgo\|www.ecosia.org\)" $LOGS*.log.gz | sed 's/.*GET \(.*\) HTTP.*/\1/' | sort | uniq -c | sort -rn | head -n $HEAD | awk '{ print $2 }');
        do
                csv_header=$csv_header,$i
        done
        csv_header="date"$csv_header
        echo "$csv_header" > /tmp/stats.csv
        TMPFILE=mktemp #create mktemp file and put path inside TMPFILE variable. This file is used to improve perf (we store logs needed only in it).
fi
for LOGS in $(ls -1 $LOGS*.log.gz); do
        if $pages
        then
                csv_data=$(date -d $(awk -F'-' '{ print $1"-"$2"-"$3 }' <<< $LOGS) '+%Y.%m.%d')
                zgrep "html HTTP.*200.*[0-9]\{4\} \"https://\(www.google\|www.bing\|www.qwant\|duckduckgo\|www.ecosia.org\)" $LOGS > $TMPFILE #put interesting results inside TMPFILE
                for i in $(sed 's/,/\n/g' /tmp/stats.csv | grep "html")
                do
                        csv_data=$csv_data","$(zgrep "$i HTTP.*200.*[0-9]\{4\} \"https://\(www.google\|www.bing\|www.qwant\|duckduckgo\|www.ecosia.org\)" $TMPFILE | wc -l)
                done
                echo "$csv_data" >> /tmp/stats.csv

        else
                TOTAL=0
                echo "-----------------------------"
                echo "LOGS : $LOGS"
                CSV=$(awk -F"-" '{ print $1"-"$2"-"$3 }' <<< $LOGS)
                for i in www.google www.bing www.qwant duckduckgo www.ecosia.org; do
                        RESULT=$(zgrep "html HTTP.*200.*[0-9]\{4\} \"https://"$i"" $LOGS | wc -l)
                        echo "$i = $RESULT"
                        TOTAL=$(($TOTAL+$RESULT))
                        if $csv; then
                                CSV=$CSV,"$RESULT"
                        fi
                done
                echo "TOTAL $(date -d $(awk -F'-' '{ print $1"-"$2"-"$3 }' <<< $LOGS) '+%A %d %b %Y') : $TOTAL"
                if $csv; then echo "$CSV" >> /tmp/stats.csv; fi
        fi
done
if $pages; then rm $TMPFILE; fi #remove TMPFILE	
Usage / Output
  • Export stats to csv file :
[user@host ~]$ sh ./std_ovh.sh 2021- -c
[user@host ~]$ tail /tmp/stats.csv
date,google,bing,qwant,ddg,ecosia
2021-03-30,82,10,1,38,0
2021-03-31,87,26,5,32,0
[…]
2021-04-07,70,17,5,20,1
2021-04-08,71,19,5,29,0
  • Which allow us to create pretty graphs with LibreOffice :
OVH | Statistics graph with libreoffice
OVH stats under LibreOffice
  • Export three most visited pages stats to csv file :
[user@host ~]$ sh ./std_ovh.sh 2021-09 -p 3
[user@host ~]$ tail /tmp/stats.csv
date,/ubiquiti_ssh_commands.html,/vpn_openvpn_bullseye.html,/vpn_openvpn_windows_server.html
2021.09.01,89,28,39
2021.09.02,109,19,41
[…]
2021.09.06,82,33,44
2021.09.07,94,37,29
  • Export vpn_openvpn_bullseye.html stats to csv file :
[user@host ~]$ sh ./std_ovh.sh 2021-09 -p vpn_openvpn_bullseye.html
[user@host ~]$ tail /tmp/stats.csv
date,/vpn_openvpn_bullseye.html
2021.09.01,28
2021.09.02,19
[…]
2021.09.06,33
2021.09.07,37

Windows/PowerShell

Page views statistics

  • List and count most page views for specific log file (2020-11-05-shebangthedolphins.net.log) :
PS C:\ > $domain = "shebangthedolphins.net"
PS C:\ > Select-String .\2020-11-05-shebangthedolphins.net.log -NotMatch -Pattern "Bytespider","Trident","bot","404","GET / HTTP","BingPreview","Seekport Crawler" | Select-String -Pattern "html" | %{"{0} {1}" -f $_.Line.ToString().Split(' ')[0],$_.Line.ToString().Split(' ')[10]} | Select-String -Pattern "$domain.*html" | Sort-Object | Get-Unique | %{"{0}" -f $_.Line.ToString().Split(' ')[1]} | group -NoElement | Sort-Object Count |  %{"{0} {1}" -f $_.Count, $_.Name }
1 "https://shebangthedolphins.net/fr/prog_introduction.html"
1 "https://shebangthedolphins.net/fr/prog_sh_check_snmp_synology.html"
1 "http://shebangthedolphins.net/openbsd_network_interfaces.html"
1 "https://shebangthedolphins.net/fr/menu.html"
1 "https://shebangthedolphins.net/fr/windows_commandes.html"
1 "https://shebangthedolphins.net/fr/windows_run_powershell_taskschd.html"
1 "http://shebangthedolphins.net/prog_sh_check_snmp_synology.html"
1 "https://shebangthedolphins.net/fr/windows_grouppolicy_reset.html"
1 "https://shebangthedolphins.net/fr/windows_grouppolicy_update_policy.html"
1 "https://shebangthedolphins.net/virtualization_kvm_windows10.html"
1 "https://shebangthedolphins.net/windows_event_on_usb.html"
1 "https://shebangthedolphins.net/prog_powershell_movenetfiles.html"
1 "http://shebangthedolphins.net/prog_autoit_backup.html"
1 "https://shebangthedolphins.net/fr/vpn_openvpn_buster.html"
1 "http://shebangthedolphins.net/prog_powershell_kesc.html"
1 "https://shebangthedolphins.net/index.html"
1 "http://shebangthedolphins.net/gnulinux_courier.html"
4 "https://shebangthedolphins.net/ubiquiti_ssh_commands.html"
6 "https://shebangthedolphins.net/vpn_openvpn_windows_server.html"

References

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Contact :