10 Processing Rules for Underlying Reporting Data
Usage data for usage report generation should ensure that only intended usage is recorded and that all requests not intended by the user are excluded.
Because the way usage records are generated can differ across platforms, it is impractical to describe all the possible filters and techniques used to clean up the data. This Code of Practice therefore specifies only the requirements to be met by data used for building usage reports.
10.1 Return codes
Return codes in this Code of Practice for Research Data Usage Metrics are not different from the specifications in the COUNTER Code of Practice Release 5. Successful and valid requests MUST be counted. Successful requests are those with specific HTTP status codes indicating successful retrieval of the content (200 and 304). HTTP status codes are defined and maintained by IETF (Fielding & Reschke, 2014).
10.2 Double-click Filtering
The intent of double-click filtering is to prevent over-counting which may occur when a user clicks the same link multiple times in succession, e.g. when frustrated by a slow internet connection. Double-click filtering applies to all metric types. The double-click filtering rule is as follows:
A “double-click” is defined as repeated access to a web accessible resource by the same user within a session, within a time period. Double-clicks on a link by the same user within a 30-second period MUST be counted as one action. For the purposes of the Code of Practice for Research Data Usage Metrics, the time window for a double-click on any page is set at a maximum of 30 seconds between the first and second mouse clicks. For example, a click at 10.01.00 and a second click at 10.01.29 would be considered a double-click (one action); a click at 10.01.00 and a second click at 10.01.35 would count as two separate single clicks (two actions).
A double-click may be triggered by a mouse-click or by pressing a refresh or back button. When two actions are made for the same URL within 30 seconds the first request MUST be removed and the second retained.
Any additional requests for the same URL within 30 seconds (between clicks) MUST be treated identically: always remove the first and retain the second.
There are different ways to track whether two requests for the same URL are from the same user and session. These options are listed in order of increasing reliability, with Option 4 being the most reliable.
If the user is identified only through their IP address, that IP combined with the browser’s user-agent (presented in the HTTP header) MUST be used to trace double-clicks. Multiple users on a single IP address with the same browser user-agent can occasionally lead to separate clicks from different users being logged as a double-click from one user. This will only happen if the multiple users are clicking on exactly the same content within a few seconds of each other. One-hour slices MUST be used as sessions.
When a session cookie is implemented and logged, the session cookie MUST be used to identify double-clicks.
When a user cookie is available and logged, the user cookie MUST be used to identify double-clicks.
When an individual has logged in with their own profile, their username MUST be used to trace double-clicks.
10.3 Counting Unique Datasets
Some metric types count the number of unique items that had a certain activity, such as a Unique_Dataset_Requests or Unique_Dataset_Investigations.
For the purpose of metrics, a dataset is the typical unit of content being accessed by users. The dataset MUST be identified using a unique identifier such as a DOI, regardless of format.
The rules for calculating the unique dataset counts are as follows:
Multiple activities qualifying for the metric type in question representing the same dataset and occurring in the same user-sessions MUST be counted as only one “unique” activity for that dataset.
A “User Session” is defined as activity by a user in a period of one hour. It may be identified in any of the following ways: by a logged session ID + transaction date, by a logged user ID (if users log in with personal accounts) + transaction date + hour of day (day is divided into 24 one-hour slices), by a logged user cookie + transaction date + hour of day, or by a combination of IP address + user agent + transaction date + hour of day.
To allow for simplicity in calculating User Sessions when a session ID is not explicitly tracked, the day will be divided into 24 one-hour slices and a surrogate session ID will be generated by combining the transaction date + hour time slice + one of the following: user ID, cookie ID, or IP address + user agent. For example, consider the following transaction:
Transaction date/time: 2017-06-15 13:35
IP address: 192.1.1.168
User agent: Mozilla/5.0
Generated session ID: 192.1.1.168|Mozilla/5.0|2017-06-15|13
The above surrogate session ID does not provide an exact analogy to a session. However, statistical studies show that the result of using such a surrogate session ID results in unique counts are within 1– 2 % of unique counts generated with actual sessions.
10.4 Attributing Usage when Item Appears in More Than One Database
Content providers that offer databases where a given dataset is included in multiple databases MUST attribute the Investigations and Requests metrics to just one database. They could use a consistent method of prioritizing databases or pick the database randomly.
10.5 Internet Robots and Crawlers
The intent is to exclude web robots and spiders but include usage by humans accessing content through a scripting language or automated tool, whether interactively or standalone.
Web robots and crawlers intended for search indexing and related applications SHOULD be excluded via the application of a blacklist of known user agents for these robots. This blacklist MUST NOT include general purpose user agents that are commonly used by researchers (e.g., python, curl, wget, and Java), and the blacklist will be maintained as a subset of the COUNTER Code of Practice Release 5 list of internet robots and crawlers (COUNTER-Robots, 2017). Generally, user agents reflecting programmatic access to specific datasets will not be included in the blacklist.
Usage counts by scripted and automated processes MUST NOT be excluded unless they can demonstrably be shown to originate from a blacklisted agent, such as an IP address of a known search agent. New or unknown user agents SHOULD be counted unless there is demonstrable evidence that they represent solely a web indexing agent.
10.6 Machine Access
Many researchers access and analyze data using scripts or automated tools, especially large data sets, and excluding those uses would be inaccurate and bias the counts. The Access_Method of type Machine is used to distinguish this kind of access.
Principles for reporting usage
The Code of Practice for Research Data Usage Metrics does not record machine use itself, as most of this activity takes place after a dataset has been downloaded. All we can do is track the count of datasets downloaded using machines.
Usage associated with machine access activity MUST be tracked by assigning an Access_Method of Machine.
Usage associated with machine activity MUST be reported using the Dataset Master Report by identifying such usage as “Access_Method=Machine”.
Detecting machine activity
For the purpose of reporting usage according to the Code of Practice for Research Data Usage Metrics, machine access does not require prior permission and/or the use of specific endpoints or protocols. This is in contrast to the COUNTER Code of Practice Release 5.
The distinction between legitimate machine use and robot or web crawler traffic is made based on the user agent (see Section 10.5).