rss logo

How To download and process OVH daily raw logs to extract useful information

Intro

OVH logo

I have a Web Hosting plans at OVH. I'm not that happy with the Website visit statistics tools. Urchin is deprecated, OVHcloud Web Statistics is still young and Awstats which I found great but it only show daily stats.

This is the reason why I decided to find a way to download logs and to treat them manually.

Configuration

  • OS : GNU/Linux and Windows 10
  • wget : 1.21

Create dedicated log user account

First thing to do is to create a dedicated log user from the OVHCloud Web Control Panel.

OVH | OVH main web interface OVH | OVH login web interface
  • From the OVHCloud Web Control Panel click on your hosted plan :
OVH | OVH OVHCloud Web Control Panel
  • From the Ribbon Menu click More+ then Statistics and logs :
OVH | Ribbon Menu
  • From Statistics and logs menu, click Create a new user :
OVH | Create a new user step 1
  • Set a user name, and click Next :
OVH | Create a new user step 2
  • Respect the password requirements, and click Next :
OVH | Create a new user step 3
  • Click Confirm, to create :
OVH | Create a new user step 3
  • Copy the https://log.clusterXXX.hosting.ovh.net/YOUR_DOMAIN/ url :
OVH | Create a new user step 3

We now have everything we need to download our logs.

Download Logs

GNU/Linux

Set variables :

  • Set your Log User variable :
[user@host ~]$ USR=ovhlogsuser
  • Set your Log Password variable :
[user@host ~]$ PASS=Myverycomplexpassw0rD
  • Set your URL variable :
[user@host ~]$ URL=https://log.clusterXXX.hosting.ovh.net/shebangthedolphins.net/
  • Set your Domain variable :
[user@host ~]$ DOMAIN=$(awk -F '/' '{ print $4 }' <<< $URL)
  • Or simply :
[user@host ~]$ DOMAIN=shebangthedolphins.net

Download

  • Download all logs for a given month in the current directory :
[user@host ~]$ wget --http-user="$USR" --http-password="$PASS" -A *gz -r -nd ""$URL"/logs/logs-10-2020/" [user@host ~]$ ls -lh total 596K -rw------- 1 std std 55 10 déc. 2019 robots.txt.tmp -rw-r--r-- 1 std std 24K 2 oct. 02:56 shebangthedolphins.net-01-10-2020.log.gz -rw-r--r-- 1 std std 17K 3 oct. 02:19 shebangthedolphins.net-02-10-2020.log.gz […] -rw-r--r-- 1 std std 14K 31 oct. 06:08 shebangthedolphins.net-30-10-2020.log.gz -rw-r--r-- 1 std std 52K 1 nov. 06:08 shebangthedolphins.net-31-10-2020.log.gz
  • Download last log to current directory :
[user@host ~]$ wget --http-user="$USR" --http-password="$PASS" "$URL"/logs/logs-$(/bin/date --date='1 days ago' '+%m-%Y')/"$DOMAIN"-$(/bin/date --date='1 days ago' '+%d-%m-%Y').log.gz [user@host ~]$ ls -lh total 20K -rw-r--r-- 1 std std 18K 25 nov. 06:29 shebangthedolphins.net-24-11-2020.log.gz
  • Reformat the files names shebangthedolphins.net-dd-mm-yyyyy.log.gz to yyyyy-mm-dd-shebangthedolphins.net.log.gz with perl-rename :
[user@host ~]$ perl-rename -v 's/(.*)-(\d\d)-(\d\d)-(\d\d\d\d)(.*)/$4-$3-$2-$1$5/' *gz [user@host ~]$ ls -lh total 596K -rw-r--r-- 1 std std 24K 2 oct. 02:56 2020-10-01-shebangthedolphins.net.log.gz -rw-r--r-- 1 std std 17K 3 oct. 02:19 2020-10-02-shebangthedolphins.net.log.gz -rw-r--r-- 1 std std 14K 4 oct. 02:32 2020-10-03-shebangthedolphins.net.log.gz
  • Download log to specific directory /tmp/ :
[user@host ~]$ wget --http-user="$USR" --http-password="$PASS" "$URL"/logs/logs-$(/bin/date --date='1 days ago' '+%m-%Y')/"$DOMAIN"-$(/bin/date --date='1 days ago' '+%d-%m-%Y').log.gz -O /tmp/$(/bin/date --date='1 days ago' '+%Y-%m-%d')-"$DOMAIN".log.gz [user@host ~]$ ls -lh /tmp/*gz -rw-r--r-- 1 std std 18K 25 nov. 06:29 /tmp/2020-11-24-shebangthedolphins.net.log.gz
  • Download the last 30 logs files in the /tmp/ :
[user@host ~]$ for DAY in $(seq 1 30); do wget --http-user="$USR" --http-password="$PASS" "$URL"/logs/logs-$(/bin/date --date=''$DAY' days ago' '+%m-%Y')/"$DOMAIN"-$(/bin/date --date=''$DAY' days ago' '+%d-%m-%Y').log.gz -O /tmp/$(/bin/date --date=''$DAY' days ago' '+%Y-%m-%d')-"$DOMAIN".log.gz; done [user@host ~]$ ls -lh /tmp/*gz -rw-r--r-- 1 std std 17K 26 oct. 06:12 /tmp/2020-10-25-shebangthedolphins.net.log.gz -rw-r--r-- 1 std std 14K 27 oct. 06:44 /tmp/2020-10-26-shebangthedolphins.net.log.gz [...] -rw-r--r-- 1 std std 18K 24 nov. 06:38 /tmp/2020-11-23-shebangthedolphins.net.log.gz -rw-r--r-- 1 std std 18K 25 nov. 06:29 /tmp/2020-11-24-shebangthedolphins.net.log.gz

Windows/PowerShell

Set variables :

  • Set your Log User variables :
PS C:\Users\std> $user = "ovhlogsuser"
  • Set your Log Password variables :
PS C:\Users\std> $pass = "Myverycomplexpassw0rD" PS C:\Users\std> $secpasswd = ConvertTo-SecureString $pass -AsPlainText -Force PS C:\Users\std> $credential = New-Object System.Management.Automation.PSCredential($user, $secpasswd)
  • Set your Domain variable :
PS C:\Users\std> $domain = "shebangthedolphins.net"
  • Set your URL variable :
PS C:\Users\std> $url = "https://log.clusterXXX.hosting.ovh.net/$domain/"

Download

  • Download last log in text format to current directory :
PS C:\Users\std> Invoke-WebRequest -Credential $credential -Uri ("$url" + "logs/logs-" + $((Get-Date).AddDays(-1).ToString("MM-yyyy")) + "/$domain" + "-" + $((Get-Date).AddDays(-1).ToString("dd-MM-yyyy")) + ".log.gz") -OutFile "$((Get-Date).AddDays(-1).ToString("yyyy-MM-dd"))-$domain.log" PS C:\Users\std> dir Directory: C:\Users\std ode LastWriteTime Length Name ---- ------------- ------ ---- -a---- 05/12/2020 15:45 360238 2020-12-04-shebangthedolphins.net.log
  • Download the 30 last logs file in text format :
PS C:\Users\std> 1..30 | ForEach-Object { Invoke-WebRequest -Credential $credential -Uri ("$url" + "logs/logs-" + $((Get-Date).AddDays(-"$_").ToString("MM-yyyy")) + "/$domain" + "-" + $((Get-Date).AddDays(-"$_").ToString("dd-MM-yyyy")) + ".log.gz") -OutFile "$((Get-Date).AddDays(-"$_").ToString("yyyy-MM-dd"))-$domain.log" }

Extract informations

GNU/Linux

Now that we've downloaded our logs files, we can use command line to extract useful informations.

Page views statistics

  • List and count most page views for specific log file (2020-11-24-shebangthedolphins.net.log.gz) :
[user@host ~]$ DOMAIN=shebangthedolphins.net [user@host ~]$ zgrep -viE 'Bytespider|Trident|bot|404|GET \/ HTTP|BingPreview|Seekport Crawler' 2020-11-24-shebangthedolphins.net.log.gz | grep html | awk '{ print $1" "$11 }' | sort | uniq | awk '{ print $2 }' | sort | uniq -c | sort -n | tr -s "[ ]" | sed 's/^ //' | grep "$DOMAIN" 1 "http://shebangthedolphins.net/backup_burp.html" 1 "http://shebangthedolphins.net/backup_burp.html" 1 "http://shebangthedolphins.net/prog_autoit_backup.html" 1 "http://shebangthedolphins.net/prog_powershell_kesc.html" 1 "http://shebangthedolphins.net/vpn_ipsec_06linux-to-linux_tunnel-x509.html" 1 "https://shebangthedolphins.net/fr/windows_grouppolicy_execute_powershell_script.html" 1 "https://shebangthedolphins.net/gnulinux_courier.html" 1 "https://shebangthedolphins.net/gnulinux_vnc_remotedesktop.html" 1 "https://shebangthedolphins.net/vpn_openvpn_windows_server.html" 1 "https://shebangthedolphins.net/windows_icacls.html" 1 "http://www.shebangthedolphins.net/vpn_ipsec_03linux-to-windows_transport-psk.html" 2 "https://shebangthedolphins.net/windows_mssql_alwayson.html" 3 "https://shebangthedolphins.net/fr/vpn_openvpn_buster.html" 7 "https://shebangthedolphins.net/ubiquiti_ssh_commands.html"
  • List and count page views for all log files (*.log.gz) :
[user@host ~]$ DOMAIN=shebangthedolphins.net [user@host ~]$ for i in *.log.gz; do echo "------------"; echo "$i"; zgrep -viE 'Bytespider|Trident|bot|404|GET \/ HTTP|BingPreview|Seekport Crawler' "$i" | grep html | awk '{ print $1" "$11 }' | grep "$DOMAIN" | sort | uniq | awk '{ print $2 }' | wc -l; done ------------ 2020-11-19-shebangthedolphins.net.log.gz 19 ------------ 2020-11-20-shebangthedolphins.net.log.gz 24 ------------ 2020-11-21-shebangthedolphins.net.log.gz 8 ------------ 2020-11-22-shebangthedolphins.net.log.gz 16 ------------ 2020-11-23-shebangthedolphins.net.log.gz 15 ------------ 2020-11-24-shebangthedolphins.net.log.gz 13
  • List and count page views for all log files (*.log.gz) and by month :
[user@host ~]$ DOMAIN=shebangthedolphins.net [user@host ~]$ YEAR=2020 [user@host ~]$ for i in $(seq -w 1 12); do echo "------------"; echo "$YEAR-$i"; zgrep -viE 'Bytespider|Trident|bot|404|GET \/ HTTP|BingPreview|Seekport Crawler' $YEAR-"$i"*.log.gz | grep html | awk '{ print $1" "$11 }' | grep "$DOMAIN" | sort | uniq | awk '{ print $2 }' | wc -l; done 2>/dev/null ------------ 2020-01 101 ------------ 2020-02 73 ------------ 2020-03 92 ------------ 2020-04 91 ------------ 2020-05 87 ------------ 2020-06 73 ------------ 2020-07 81 ------------ 2020-08 97 ------------ 2020-09 135

Page views statistics from a search engine

  • List and count page views from search engines (here for google, bing, qwant and duckduckgo) for specific log file (2020-11-24-shebangthedolphins.net.log.gz) :
[user@host ~]$ zgrep "html HTTP.*200.*[0-9]\{4\} \"\(https://www.google\|https://www.bing\|https://www.qwant\|https://duckduckgo\)" 2021-08-20-shebangthedolphins.net.log.gz | grep html | awk '{ print $1" "$7 }' | sort -n | uniq | awk '{ print $2 }' | sort | uniq -c | sort -n […] 8 /windows_grouppolicy_manage_searchbox.html 9 /windows_icacls.html 10 /vpn_openvpn_bullseye.html 11 /gnulinux_vnc_remotedesktop.html 11 /windows_rds_mfa.html 12 /fr/vpn_openvpn_windows_server.html 14 /windows_grouppolicy_shutdown.html 15 /gnulinux_nftables_examples.html 25 /vpn_openvpn_windows_server.html 72 /ubiquiti_ssh_commands.html
  • List and count page views from search engines (here for google, bing, qwant and duckduckgo) and for all log files (*.log.gz) :
[user@host ~]$ zcat *.log.gz | grep "html HTTP.*200.*[0-9]\{4\} \"\(https://www.google\|https://www.bing\|https://www.qwant\|https://duckduckgo\)" | grep html | awk '{ print $1" "$7 }' | sort -n | uniq | awk '{ print $2 }' | sort | uniq -c | sort -n […] 11 /gnulinux_vnc_remotedesktop.html 11 /openbsd_packetfilter.html 12 /fr/gnulinux_nftables_examples.html 15 /windows_imule.html 20 /backup_burp.html 36 /fr/vpn_openvpn_buster.html 38 /vpn_openvpn_windows_server.html 112 /ubiquiti_ssh_commands.html
  • List and count page views from search engines (here for google, bing, qwant and duckduckgo), view by month :
[user@host ~]$ YEAR=2021 [user@host ~]$ for i in $(seq -w 1 12); do echo "------------"; echo "$YEAR-$i"; zcat $YEAR-"$i"*.log.gz | grep "html HTTP.*200.*[0-9]\{4\} \"\(https://www.google\|https://www.bing\|https://www.qwant\|https://duckduckgo\)" | grep html | awk '{ print $1" "$7 }' | sort -n | uniq | wc -l; done 2>/dev/null ------------ 2021-01 1311 ------------ 2021-02 1650 ------------ 2021-03 2566 ------------ 2021-04 3511 ------------ 2021-05 5452 ------------ 2021-06 6922 ------------ 2021-07 6437 ------------ 2021-08 4788

Scripts

Scripts that I use to quickly see evolution of the search results.

Script v1

  • First version, displays results by search engine.
Code
#! /bin/sh for LOGS in *.log.gz; do echo "-----------------------------" echo "LOGS : $LOGS" for i in www.google www.bing www.qwant duckduckgo; do RESULT=$(zgrep "html HTTP.*200.*[0-9]\{4\} \"https://"$i"" $LOGS | wc -l) echo "$i = $RESULT" done done
Output
----------------------------- LOGS : 2020-11-21-shebangthedolphins.net.log.gz www.google = 6 www.bing = 1 www.qwant = 0 duckduckgo = 4 ----------------------------- LOGS : 2020-11-22-shebangthedolphins.net.log.gz www.google = 10 www.bing = 2 www.qwant = 0 duckduckgo = 6 ----------------------------- LOGS : 2020-11-23-shebangthedolphins.net.log.gz www.google = 7 www.bing = 6 www.qwant = 2 duckduckgo = 1

Script v2

  • Slight improvements :
    • Accept an argument to specify the period
    • Displays the total
Code
#! /bin/sh for LOGS in "$1"*.log.gz; do TOTAL=0 echo "-----------------------------" echo "LOGS : $LOGS" for i in www.google www.bing www.qwant duckduckgo www.ecosia.org; do RESULT=$(zgrep "html HTTP.*200.*[0-9]\{4\} \"https://"$i"" $LOGS | wc -l) echo "$i = $RESULT" TOTAL=$(($TOTAL+$RESULT)) done echo "TOTAL $(date -d $(awk -F'-' '{ print $1"-"$2"-"$3 }' <<< $LOG) '+%A') : $TOTAL" done
Output
[user@host ~]$ sh ./std_ovh.sh 2021-02-1 ----------------------------- LOGS : 2021-02-10-shebangthedolphins.net.log.gz www.google = 31 www.bing = 15 www.qwant = 1 duckduckgo = 23 www.ecosia.org = 0 TOTAL wednesday : 70 ----------------------------- LOGS : 2021-02-11-shebangthedolphins.net.log.gz www.google = 38 www.bing = 11 www.qwant = 2 duckduckgo = 24 www.ecosia.org = 0 TOTAL thursday : 75 -----------------------------

Script v3

  • Uge improvements :
    • Able to export to csv format (in a /tmp/stats.csv file) with -c or -p argument
Code
#! /bin/sh # Role : Extract ovh logs stats # Author : http://shebangthedolphins.net/ pages=false csv=false usage() { echo "usage: ./std_ovh.sh YYYY-MM-DD" echo "[-c] : export total stats to /tmp/stats.csv file" echo "[-p <url>|<XX most viewed pages>] : export specific <url> or XX most viewed urls to /tmp/stats.csv file" echo "ex : ./std_ovh.sh 2021-03" echo "ex : ./std_ovh.sh 2021-03 -c" echo "ex : ./std_ovh.sh 2021-03 -p vpn_openvpn_windows_server.html" echo "ex : ./std_ovh.sh 2021-03 -p 10" exit 3 } case "$1" in *) LOGS=$1 shift # Remove the first argument (wich will be for example 2021-09-) while getopts "p:ch" OPTNAME; do case "$OPTNAME" in p) ARGP=${OPTARG} pages=true ;; c) csv=true ;; h) usage ;; *) usage ;; esac done esac # show help if no arguments or -c AND -p are set if [[ ( -z "$LOGS" ) || ( $pages == "true" && $csv == "true" ) ]] ; then usage fi # create /tmp/stats.csv header if $csv; then echo "date,google,bing,qwant,ddg,ecosia" > /tmp/stats.csv; fi if $pages; then if [[ "$ARGP" =~ ^[0-9]+$ ]]; then HEAD=$ARGP HTML="html" else HTML=$ARGP HEAD=1 fi for i in $(zgrep "$HTML HTTP.*200.*[0-9]\{4\} \"https://\(www.google\|www.bing\|www.qwant\|duckduckgo\|www.ecosia.org\)" $LOGS*.log.gz | sed 's/.*GET \(.*\) HTTP.*/\1/' | sort | uniq -c | sort -rn | head -n $HEAD | awk '{ print $2 }'); do csv_header=$csv_header,$i done csv_header="date"$csv_header echo "$csv_header" > /tmp/stats.csv TMPFILE=mktemp #create mktemp file and put path inside TMPFILE variable. This file is used to improve perf (we store logs needed only in it). fi for LOGS in $(ls -1 $LOGS*.log.gz); do if $pages then csv_data=$(date -d $(awk -F'-' '{ print $1"-"$2"-"$3 }' <<< $LOGS) '+%Y.%m.%d') zgrep "html HTTP.*200.*[0-9]\{4\} \"https://\(www.google\|www.bing\|www.qwant\|duckduckgo\|www.ecosia.org\)" $LOGS > $TMPFILE #put interesting results inside TMPFILE for i in $(sed 's/,/\n/g' /tmp/stats.csv | grep "html") do csv_data=$csv_data","$(zgrep "$i HTTP.*200.*[0-9]\{4\} \"https://\(www.google\|www.bing\|www.qwant\|duckduckgo\|www.ecosia.org\)" $TMPFILE | wc -l) done echo "$csv_data" >> /tmp/stats.csv else TOTAL=0 echo "-----------------------------" echo "LOGS : $LOGS" CSV=$(awk -F"-" '{ print $1"-"$2"-"$3 }' <<< $LOGS) for i in www.google www.bing www.qwant duckduckgo www.ecosia.org; do RESULT=$(zgrep "html HTTP.*200.*[0-9]\{4\} \"https://"$i"" $LOGS | wc -l) echo "$i = $RESULT" TOTAL=$(($TOTAL+$RESULT)) if $csv; then CSV=$CSV,"$RESULT" fi done echo "TOTAL $(date -d $(awk -F'-' '{ print $1"-"$2"-"$3 }' <<< $LOGS) '+%A %d %b %Y') : $TOTAL" if $csv; then echo "$CSV" >> /tmp/stats.csv; fi fi done if $pages; then rm $TMPFILE; fi #remove TMPFILE
Usage / Output
  • Export stats to csv file :
[user@host ~]$ sh ./std_ovh.sh 2021- -c [user@host ~]$ tail /tmp/stats.csv date,google,bing,qwant,ddg,ecosia 2021-03-30,82,10,1,38,0 2021-03-31,87,26,5,32,0 […] 2021-04-07,70,17,5,20,1 2021-04-08,71,19,5,29,0
  • Which allow us to create pretty graphs with LibreOffice :
OVH | Statistics graph with libreoffice
OVH stats under LibreOffice
  • Export three most visited pages stats to csv file :
[user@host ~]$ sh ./std_ovh.sh 2021-09 -p 3 [user@host ~]$ tail /tmp/stats.csv date,/ubiquiti_ssh_commands.html,/vpn_openvpn_bullseye.html,/vpn_openvpn_windows_server.html 2021.09.01,89,28,39 2021.09.02,109,19,41 […] 2021.09.06,82,33,44 2021.09.07,94,37,29
  • Export vpn_openvpn_bullseye.html stats to csv file :
[user@host ~]$ sh ./std_ovh.sh 2021-09 -p vpn_openvpn_bullseye.html [user@host ~]$ tail /tmp/stats.csv date,/vpn_openvpn_bullseye.html 2021.09.01,28 2021.09.02,19 […] 2021.09.06,33 2021.09.07,37

Windows/PowerShell

Page views statistics

  • List and count most page views for specific log file (2020-11-05-shebangthedolphins.net.log) :
PS C:\ > $domain = "shebangthedolphins.net" PS C:\ > Select-String .\2020-11-05-shebangthedolphins.net.log -NotMatch -Pattern "Bytespider","Trident","bot","404","GET / HTTP","BingPreview","Seekport Crawler" | Select-String -Pattern "html" | %{"{0} {1}" -f $_.Line.ToString().Split(' ')[0],$_.Line.ToString().Split(' ')[10]} | Select-String -Pattern "$domain.*html" | Sort-Object | Get-Unique | %{"{0}" -f $_.Line.ToString().Split(' ')[1]} | group -NoElement | Sort-Object Count | %{"{0} {1}" -f $_.Count, $_.Name } 1 "https://shebangthedolphins.net/fr/prog_introduction.html" 1 "https://shebangthedolphins.net/fr/prog_sh_check_snmp_synology.html" 1 "http://shebangthedolphins.net/openbsd_network_interfaces.html" 1 "https://shebangthedolphins.net/fr/menu.html" 1 "https://shebangthedolphins.net/fr/windows_commandes.html" 1 "https://shebangthedolphins.net/fr/windows_run_powershell_taskschd.html" 1 "http://shebangthedolphins.net/prog_sh_check_snmp_synology.html" 1 "https://shebangthedolphins.net/fr/windows_grouppolicy_reset.html" 1 "https://shebangthedolphins.net/fr/windows_grouppolicy_update_policy.html" 1 "https://shebangthedolphins.net/virtualization_kvm_windows10.html" 1 "https://shebangthedolphins.net/windows_event_on_usb.html" 1 "https://shebangthedolphins.net/prog_powershell_movenetfiles.html" 1 "http://shebangthedolphins.net/prog_autoit_backup.html" 1 "https://shebangthedolphins.net/fr/vpn_openvpn_buster.html" 1 "http://shebangthedolphins.net/prog_powershell_kesc.html" 1 "https://shebangthedolphins.net/index.html" 1 "http://shebangthedolphins.net/gnulinux_courier.html" 4 "https://shebangthedolphins.net/ubiquiti_ssh_commands.html" 6 "https://shebangthedolphins.net/vpn_openvpn_windows_server.html"

References

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Contact :

contact mail address