User:Dsimic/Traffic stats calculation

__TOC__

Automated monthly statistics calculation

{{See also

| User talk:Dsimic#Wikiviewstats utility

| User talk:Dsimic#An alternative data source for traffic stats calculation

| User:Dsimic#CREATED-ARTICLES

| l3 = User:Dsimic § Articles I've created

}}

{{Details

| Wikipedia:Statistics

| WP:VIEWSSTATS{{!}}Wikipedia:Statistics § Page views

| Wikipedia:Pageview statistics

}}

Below is a rather simple PHP program that fetches monthly page views statistics provided by the Pageview API in JSON format (that's a public API developed and maintained by the Wikimedia Foundation, see also its detailed [https://wikimedia.org/api/rest_v1/?doc#!/Pageviews_data/get_metrics_pageviews_per_article_project_access_agent_article_granularity_start_end REST API documentation]), for a specified list of articles, and calculates their total monthly views and average views per day. The fetched page views statistics don't include spider- or bot-generated traffic. The program is intended to be run interactively from a command-line interface (CLI); instead of running it locally, on a machine capable of executing PHP scripts, you may also use some of the freely available online PHP development environments.

Initially, this program used the page views statistics provided by {{URL|http://stats.grok.se/}} in JSON format, but that web service unfortunately became no longer updated around mid-January 2016, and it remains defunct {{As of|2016|06|lc=yes}}. If needed, you can also have a look at that older version of the program code and documentation.

As pretty much everything else here on Wikipedia, I'm releasing this program code under the terms of the CC BY-SA 3.0 license, so please feel free to use it and modify according to your needs. Of course, feel free to use my talk page to {{Edit|User talk:Dsimic|section=new|leave me a message}} in case you have any questions, suggestions, bug reports, etc.

= Source code =

Before running this program, you need to modify the list of articles contained in the $articles variable (what's in the code below is the list of articles [https://tools.wmflabs.org/sigma/created.py?name=Dsimic&server=enwiki&max=100&ns=,,&redirects=none I've created] or started), and to modify the month and year for which statistics are to be fetched and calculated, which are specified through the FETCH_MONTH and FETCH_YEAR constants, respectively. When the program is configured to calculate statistics for the current month, it takes into account only the whole/elapsed days; as a result, running the program on the first day of the month to calculate current month statistics isn't supported. Also, in case whole days are missing in the statistics data available from the Pageview API, the program doesn't count in such zero-page-views days when calculating the averages. The FETCH_PROJECT constant selects the encyclopedia: en.wikipedia.org is for the English Wikipedia, de.wikipedia.org is for the German Wikipedia, etc.

Just as a note, getting ready-to-run PHP code of this program is as easy as viewing the Wiki code of this page and copying what's between the and tags. The program code below is the latest available version, and it is updated on this page after any improvements or bugfixes are implemented.

define('FETCH_MONTH', '01'); // MM

define('FETCH_YEAR', '2016'); // YYYY

define('FETCH_PROJECT', 'en.wikipedia.org'); // "en.wikipedia.org", "de.wikipedia.org", etc.

$articles = array('Stagefright (bug)',

'Row hammer',

'Address generation unit',

'UniDIMM',

'kdump (Linux)',

'kernfs (BSD)',

'kernfs (Linux)',

'ftrace',

'Android Runtime',

'WebScaleSQL',

'Intel X99',

'HipHop Virtual Machine',

'kpatch',

'kGraft',

'CoreOS',

'ARM Cortex-A17',

'Solid-state storage',

'Port Control Protocol',

'zswap',

'Emdebian Grip',

'ThinkPad 8',

'Laravel',

'OpenLMI',

'Open vSwitch',

'Distributed Overlay Virtual Ethernet',

'Management Component Transport Protocol',

'Buildroot',

'dm-cache',

'bcache',

'SATA Express',

'OpenZFS',

'List of Eurocrem packages',

'M.2',

'Eurocrem');

// ---------------------------------------------

// obviously, configurable stuff ends here

// ---------------------------------------------

define('CHUNK_SIZE', 10); // articles, imposed by the Pageview API rate limit (see below)

define('CHUNK_SLEEP', 1); // seconds, also related to the API rate limit

define('EXIT_SUCCESS', 0); // program exit codes

define('EXIT_FAILURE', 1);

set_time_limit(0);

ini_set('memory_limit', 67108864);

ini_set('default_socket_timeout', 90);

// a few short helper functions

function plural_output($value, $unit) {

return (number_format($value) . " {$unit}" . ((abs($value) != 1) ? 's' : ''));

}

function progress_message($message = '.') {

static $last_message = null;

$now = microtime(true);

$ret_val = false;

if (($last_message === null) ||

(($now - $last_message) > 0.5)) { // one message every 0.5 seconds

echo($message);

$last_message = $now;

$ret_val = true; // the message has been printed

}

return ($ret_val);

}

// prepare the cURL handles for all articles

echo("\nFetching statistics data: ");

$start_time = microtime(true);

$handles = array();

$articles_total = count($articles);

$day_of_month = @date('j');

$current_month = (FETCH_MONTH == @date('m'));

if ($articles_total == 0) { // a small sanity check

echo("no articles specified!\n");

exit(EXIT_FAILURE);

}

if ($current_month && ($day_of_month == 1)) { // account only the whole days, also knowing

echo("no elapsed days in current month!\n"); // that the Pageview API rejects invalid dates

exit(EXIT_FAILURE);

}

$days_total = !$current_month

? cal_days_in_month(CAL_GREGORIAN, FETCH_MONTH, FETCH_YEAR)

: ($day_of_month - 1);

$fetch_range = FETCH_YEAR . FETCH_MONTH . '01/' .

FETCH_YEAR . FETCH_MONTH . sprintf('%02d', $days_total);

for ($id = 0; $id < $articles_total; $id++) {

$handles[$id] = curl_init();

curl_setopt($handles[$id], CURLOPT_URL, 'https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/' .

FETCH_PROJECT . '/all-access/user/' .

rawurlencode(ucfirst($articles[$id])) . "/daily/{$fetch_range}");

curl_setopt($handles[$id], CURLOPT_HEADER, false);

curl_setopt($handles[$id], CURLOPT_RETURNTRANSFER, true);

curl_setopt($handles[$id], CURLOPT_SSL_VERIFYPEER, false);

curl_setopt($handles[$id], CURLOPT_CONNECTTIMEOUT, 20);

curl_setopt($handles[$id], CURLOPT_TIMEOUT, 60);

curl_setopt($handles[$id], CURLOPT_DNS_CACHE_TIMEOUT, 3600);

curl_setopt($handles[$id], CURLOPT_FORBID_REUSE, false);

curl_setopt($handles[$id], CURLOPT_FRESH_CONNECT, false);

curl_setopt($handles[$id], CURLOPT_MAXCONNECTS, 10);

curl_setopt($handles[$id], CURLOPT_USERAGENT, 'https://en.wikipedia.org/wiki/User_talk:Dsimic');

}

progress_message();

// run the cURL handles in chunks because the Pageview API imposes a rate limit,

// which, as of June 1, 2016, is specified at 10 requests per second, although

// it seems to be happily handling *much* higher rates

$handle_all = curl_multi_init();

$chunks = ceil(1.0 * $articles_total / CHUNK_SIZE);

$output = array();

$error_messages = array('Parsing JSON data failed' => -1);

$views_total = 0;

$failures = 0;

$days_available = array();

$php_version = explode('.', phpversion(), 3);

if (($php_version[0] >= 5) && // available since PHP 5.5.0

($php_version[1] >= 5)) {

curl_multi_setopt($handle_all, CURLMOPT_PIPELINING, true);

curl_multi_setopt($handle_all, CURLMOPT_MAXCONNECTS, 10);

}

for ($chunk = 0; $chunk < $chunks; $chunk++) { // fetch one chunk at a time

$id_limit = min(($chunk + 1) * CHUNK_SIZE, $articles_total);

for ($id = $chunk * CHUNK_SIZE; $id < $id_limit; $id++) // all articles in this chunk

curl_multi_add_handle($handle_all, $handles[$id]);

do { // fetch the articles stats data in JSON format...

$status = curl_multi_exec($handle_all, $running);

progress_message();

} while (($status == CURLM_CALL_MULTI_PERFORM) ||

($running > 0));

for ($id = $chunk * CHUNK_SIZE; $id < $id_limit; $id++) { // ... and process it

$json = curl_multi_getcontent($handles[$id]);

if (($json == '') || // is the JSON Ok?

(($json = json_decode($json, true)) === null) ||

!array_key_exists('items', $json) ||

!is_array($json['items'])) {

++$failures;

if (($message = curl_error($handles[$id])) != '') { // for some reason, curl_errno()

if (!array_key_exists($message, $error_messages)) { // always returns zero here

$errno = -1 * count($error_messages) - 1;

$error_messages[$message] = $errno;

}

else // already seen

$errno = $error_messages[$message];

}

else // below -1 are the cURL errors

$errno = -1;

$output[$id] = $errno;

}

else { // fetched JSON data is Ok

$views = 0;

foreach ($json['items'] as $json_item) {

$views += $json_item['views'];

if ($json_item['views'] > 0) // complete days may be missing

$days_available[$json_item['timestamp']] = true;

}

$views_total += $views;

$output[$id] = $views;

}

curl_multi_remove_handle($handle_all, $handles[$id]);

curl_close($handles[$id]);

progress_message(); // done with this chunk

}

if ($chunk != ($chunks - 1)) { // don't sleep after the last chunk

$message = '#'; // all this results in smooth progress messages

$limit = CHUNK_SLEEP * 4;

for ($i = 0; $i <= $limit; $i++) {

if (progress_message($message) === true) // print only one "marker"

$message = '.';

usleep(250000);

}

}

}

curl_multi_close($handle_all);

echo(" done.\n\n");

// done fetching all chunks of the stats data, generate and print the output...

arsort($output, SORT_NUMERIC);

$error_messages = array_flip($error_messages);

$articles_ok = $articles_total - $failures;

$first_error = true;

foreach ($output as $id => $views)

if ($views >= 0)

echo("- {$articles[$id]}: total " . plural_output($views, 'view') . "\n");

else {

if ($first_error && ($articles_ok > 0)) { // display an empty line before

echo("\n"); // the first failure message

$first_error = false;

}

echo("> {$articles[$id]}: failure ({$error_messages[$views]})\n");

}

// ... and the final summary

$days_missing = $days_total - count($days_available);

$month_name = @date('F', @strtotime(FETCH_YEAR . '-' . FETCH_MONTH . '-01'));

$elapsed_time = microtime(true) - $start_time;

$elapsed_min = intval($elapsed_time / 60);

$elapsed_sec = round($elapsed_time - $elapsed_min * 60);

echo("\nDone, {$month_name} " . FETCH_YEAR . ' statistics for ' . plural_output($articles_ok, 'article') .

' fetched in ' . (($elapsed_min > 0)

? (plural_output($elapsed_min, 'minute') . ' and ')

: '') .

plural_output($elapsed_sec, 'second') . ".\n" .

(($failures > 0)

? ('Fetching the views statistics failed for ' . plural_output($failures, 'article') . ".\n")

: ''));

if ($days_total > $days_missing) { // it's entirely possible that

$views_daily = intval($views_total / ($days_total - $days_missing)); // all days were missing

echo('Total ' . plural_output($views_total, 'view') . ', averaging in ' .

plural_output($views_daily, 'view') . ' per day (' .

plural_output($days_total, ($current_month ? 'whole ' : '') . 'day') .

' in ' . ($current_month ? 'the current' : 'that') . ' month' .

(($days_missing > 0)

? (', with the statistics unavailable for ' . plural_output($days_missing, 'day'))

: '') .

").\n");

} else { // no statistics data

echo('Sorry, no statistics data is available at the moment for ' .

($current_month ? 'the current' : 'that') . " month.\n");

$errno = ((($days_total != $days_missing) ? 10 : 0) + // just in case, perform some additional

(($views_total != 0) ? 20 : 0)); // sanity checks on the internal logic

if ($errno > 0) {

echo("\nInternal errors detected (error code: {$errno}), please report on " .

"https://en.wikipedia.org/wiki/User_talk:Dsimic by providing complete program output.\n");

exit(EXIT_FAILURE);

}

}

exit(EXIT_SUCCESS);

?>

= Output example =

Below is an example of the output produced when the program from above is run. The program sorts the articles by their total page views in descending order, so the article that has received the largest number of page views is first in the printed list. In the Fetching statistics data line, dots (.) represent the progress updates during the processing of each article chunk, while the hash marks (#) represent the beginning of processing for each new article chunk. This chunking is necessary because the Pageview API imposes a [https://wikimedia.org/api/rest_v1/?doc#!/Pageviews_data/get_metrics_pageviews_per_article_project_access_agent_article_granularity_start_end rate limit] on the API queries it receives, which, {{As of|2016|06|01|df=US|lc=yes}}, is specified at 10 requests per second.

Fetching statistics data: ...#.#.#. done.

- M.2: total 64,598 views

- SATA Express: total 21,724 views

- Laravel: total 16,115 views

- Stagefright (bug): total 12,717 views

- CoreOS: total 11,593 views

- Android Runtime: total 9,493 views

- Intel X99: total 7,928 views

- HipHop Virtual Machine: total 5,944 views

- Row hammer: total 3,896 views

- Open vSwitch: total 3,769 views

- Solid-state storage: total 3,006 views

- dm-cache: total 2,044 views

- OpenZFS: total 2,011 views

- kpatch: total 1,927 views

- UniDIMM: total 1,924 views

- ARM Cortex-A17: total 1,758 views

- Port Control Protocol: total 1,621 views

- Buildroot: total 1,397 views

- bcache: total 1,323 views

- kdump (Linux): total 1,184 views

- zswap: total 1,052 views

- Eurocrem: total 1,032 views

- Management Component Transport Protocol: total 961 views

- ftrace: total 921 views

- Address generation unit: total 723 views

- kGraft: total 630 views

- kernfs (Linux): total 598 views

- ThinkPad 8: total 427 views

- Distributed Overlay Virtual Ethernet: total 409 views

- WebScaleSQL: total 317 views

- Emdebian Grip: total 284 views

- kernfs (BSD): total 280 views

- OpenLMI: total 229 views

- List of Eurocrem packages: total 99 views

Done, January 2016 statistics for 34 articles fetched in 7 seconds.

Total 183,934 views, averaging in 5,933 views per day (31 days in that month).

Category:Wikipedia statistics

Category:Wikipedia pageviews