CSIC 2010 HTTP Dataset in CSV Format (for Weka Analysis)

Description

Below are a series of CSV-formatted datasets based on the CSIC 2010 HTTP dataset (http://isi.csic.es/dataset/ or WayBackWhenMachine or PDF of webpage). CSIC created the original dataset of HTTP/1.1 packets, including web application penetration testing packets. It has 2 labels (normal and anomalous).

The initial purpose of these reformatted CSIC datasets were for feature selection and instance selection analysis prior to classification. Here are some preliminary results. Feel free to download and use them as you see fit; however, do acknowledge that the creators of the original datasets are indeed the researchers at CSIC (Carmen Torrano Giménez, Alejandro Pérez Villegas, Gonzalo Álvarez Marañón) during 2009-10.

If you have any questions, feel free to get in contact with me.

Downloads

Download “Full” dataset v02 (RAW)

  • Labelled as “norm” and “anom”.
    Origin text files: (1) anomalousTrafficTest.txt and (2) normalTrafficTraining.txt
    HTTP Packet Objects: 61065
    Instances: 223585
    Attributes: 18
    File size: ~99.3 MB (uncompressed)
    Note: Created by appending the normal training dataset to the anomalous test.

Download Anomalous test v02 (RAW) 

  • Labelled “anom”.
    Origin text file: anomalousTrafficTest.txt
    HTTP Packet Objects: 25065
    Instances: 119586
    Attributes: 18
    File size: ~56.4 MB

Download Normal training v02 (RAW) 

  • Labelled “norm”.
    Origin text file: normalTrafficTraining.txt
    HTTP Packet Objects: 36000
    Instances: 104001
    Attributes: 18
    File size: ~48.4 MB

Download Normal test v02 (RAW) 

  • Labelled “norm”.
    Origin text file: normalTrafficTest.txt
    HTTP Packet Objects: 36000
    Instances: 104001
    Attributes: 18
    File size: ~48.4 MB

Download Options:

  • RAW = The payloads are not decoded from the raw URL encoding, but are copied directly from the source CSIC dataset.

See discussion below for details on v02 vs v01 dataset versions.

The UTF8 dataset download links have now been removed. These have been removed as the classification performance results seemed to be worse than the original raw versions. For posterity, in the UTF8 version the payloads are decoded from the raw URL RFC 2616 encodings and reencoded in UTF8 format with the java.net.URLDecoder library.

Dataset Properties

The original dataset records are in an extended form of the HTTP/1.1 protocol (RFC 2616).

We have added two columns. First is the “index” number, which is used to track the HTTP packet number. We chose to split packets into separate records where the source URL had multiple key=value payloads. Thus the index enables us to trace back a CSV record to an original source HTTP packet. Secondly is the classification “label”.

The columns names are:
“index”, “method”, “url”, “protocol”, “userAgent”, “pragma”, “cacheControl”, “accept”, “acceptEncoding”, “acceptCharset”, “acceptLanguage”, “host”, “connection”, “contentLength”, “contentType”, “cookie”, “payload”, “label”

The “index” attribute is not unique.

The “cookie” and “payload” attribute values are formatted as KEY=VALUE.

Null values (e.g. in “contentLength” or “contentType” attributes) are represented as “null”.

Value separator is double-quote comma double-quote: “,”

Double quotes in the “payload” attribute column are escaped: \”

Known Issues

(Last updated 24th April 2015)

A number of issues regarding the dataset have been raised, discussed and answered by the corrected (v02) dataset and the following discussion.

  • Correction:   Version v02 of the CSV dataset corrects a payload key-value parsing error found in the parsed v01 CSV datasets.
    • The v02 CSV datasets should replace v01 CSV datasets.
    • The parser source code implements the correction.
  • Similarities:   The entropy analysis shows very little difference between normal and normal test. Due to the entropy similarity, I would recommend not using the normal test dataset file. It is provided for completeness.
  • K-Fold Cross Validation Usage of “Full”:   The “Full” dataset is a concatenation of the normal and anomalous files. Therefore k-Fold Cross Validation with a small k is not going to perform well as the learner will fail to receive large batches of either normal or anomalous labelled inputs. To get over this:
    • Increase the value of k.
    • An alternative is to shuffle the HTTP object (while keeping the indexes together) in the “Full” CSV v02 dataset, but then we lose any order context, which depending on how the original CSIC dataset documentation is interpretted may have already been lost.
  • Noise:   In the v01 dataset, a number of instances appeared in both the anomalous and normal datasets. This may have been an effect of noise during the initial labelling or the parsing error.

References & Reading

Readings on the CSV Dataset:

Find further text on the dataset in section 6.7 of my PhD thesis in the British Library and also available here. If you use this reformatted dataset for academic works, please cite that text.

Other Usages of the CSIC Dataset

See also papers referenced on CSIC’s own description of their original dataset. These are written by the original CSIC researchers from 2009-2010. Other undergraduate projects and MSc theses exist using this data.

Please note, this page and linked datasets are published here for posterity with the hope of becoming useful for future researchers. Please contact me directly with any enquiries.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.