Description

Below are a series of CSV-formatted datasets based on the CSIC 2010 HTTP dataset (http://isi.csic.es/dataset/ or WayBackWhenMachine or PDF of webpage). CSIC created the original dataset of HTTP/1.1 packets, including web application penetration testing packets. It has 2 labels (normal and anomalous).

The initial purpose of these reformatted CSIC datasets were for feature selection and instance selection analysis prior to classification. Here are some preliminary results. Feel free to download and use them as you see fit; however, do acknowledge that the creators of the original datasets are indeed the researchers at CSIC (Carmen Torrano Giménez, Alejandro Pérez Villegas, Gonzalo Álvarez Marañón) during 2009-10.

If you have any questions, feel free to get in contact with me.

Downloads

Download “Full” dataset v02 (RAW)

Labelled as “norm” and “anom”.
Origin text files: (1) anomalousTrafficTest.txt and (2) normalTrafficTraining.txt
HTTP Packet Objects: 61065
Instances: 223585
Attributes: 18
File size: ~99.3 MB (uncompressed)
Note: Created by appending the normal training dataset to the anomalous test.

Download Anomalous test v02 (RAW)

Labelled “anom”.
Origin text file: anomalousTrafficTest.txt
HTTP Packet Objects: 25065
Instances: 119586
Attributes: 18
File size: ~56.4 MB

Download Normal training v02 (RAW)

Labelled “norm”.
Origin text file: normalTrafficTraining.txt
HTTP Packet Objects: 36000
Instances: 104001
Attributes: 18
File size: ~48.4 MB

Download Normal test v02 (RAW)

Labelled “norm”.
Origin text file: normalTrafficTest.txt
HTTP Packet Objects: 36000
Instances: 104001
Attributes: 18
File size: ~48.4 MB

Download Options:

RAW = The payloads are not decoded from the raw URL encoding, but are copied directly from the source CSIC dataset.

See discussion below for details on v02 vs v01 dataset versions.

The UTF8 dataset download links have now been removed. These have been removed as the classification performance results seemed to be worse than the original raw versions. For posterity, in the UTF8 version the payloads are decoded from the raw URL RFC 2616 encodings and reencoded in UTF8 format with the java.net.URLDecoder library.

Dataset Properties

The original dataset records are in an extended form of the HTTP/1.1 protocol (RFC 2616).

We have added two columns. First is the “index” number, which is used to track the HTTP packet number. We chose to split packets into separate records where the source URL had multiple key=value payloads. Thus the index enables us to trace back a CSV record to an original source HTTP packet. Secondly is the classification “label”.

The columns names are:
“index”, “method”, “url”, “protocol”, “userAgent”, “pragma”, “cacheControl”, “accept”, “acceptEncoding”, “acceptCharset”, “acceptLanguage”, “host”, “connection”, “contentLength”, “contentType”, “cookie”, “payload”, “label”

The “index” attribute is not unique.

The “cookie” and “payload” attribute values are formatted as KEY=VALUE.

Null values (e.g. in “contentLength” or “contentType” attributes) are represented as “null”.

Value separator is double-quote comma double-quote: “,”

Double quotes in the “payload” attribute column are escaped: \”

Known Issues

(Last updated 24th April 2015)

A number of issues regarding the dataset have been raised, discussed and answered by the corrected (v02) dataset and the following discussion.

Correction: Version v02 of the CSV dataset corrects a payload key-value parsing error found in the parsed v01 CSV datasets.
- The v02 CSV datasets should replace v01 CSV datasets.
- The parser source code implements the correction.
Similarities: The entropy analysis shows very little difference between normal and normal test. Due to the entropy similarity, I would recommend not using the normal test dataset file. It is provided for completeness.
K-Fold Cross Validation Usage of “Full”: The “Full” dataset is a concatenation of the normal and anomalous files. Therefore k-Fold Cross Validation with a small k is not going to perform well as the learner will fail to receive large batches of either normal or anomalous labelled inputs. To get over this:
- Increase the value of k.
- An alternative is to shuffle the HTTP object (while keeping the indexes together) in the “Full” CSV v02 dataset, but then we lose any order context, which depending on how the original CSIC dataset documentation is interpretted may have already been lost.
Noise: In the v01 dataset, a number of instances appeared in both the anomalous and normal datasets. This may have been an effect of noise during the initial labelling or the parsing error.

References & Reading

Readings on the CSV Dataset:

Find further text on the dataset in section 6.7 of my PhD thesis in the British Library and also available here. If you use this reformatted dataset for academic works, please cite that text.

Other Usages of the CSIC Dataset

See also papers referenced on CSIC’s own description of their original dataset. These are written by the original CSIC researchers from 2009-2010. Other undergraduate projects and MSc theses exist using this data.

Please note, this page and linked datasets are published here for posterity with the hope of becoming useful for future researchers. Please contact me directly with any enquiries.

2 thoughts on “CSIC 2010 HTTP Dataset in CSV Format (for Weka Analysis)”

How could we know the actual class to prepare the confusion matrix for the CSIC HTTP 2010 dataset. I get the predicted class set by my own classifier. Please let me know the actual class set.

LikeLike

Pete Scully PhD (UK) says:

September 15, 2020 at 2:11 am

Hi Usha, thanks for your question.

The “actual class” can be found, per row, under the “label” column of each CSV file. In this dataset, there are two labels and two classes. One label corresponds to one (and only one) class. The actual class set is {“anon”, “normal”}.

During your classification testing phase, your classifier (and evaluation framework) will produce the TP, TN, FP, FN matrix by comparing each instance’s predicted class” to the row’s ‘label’ cell value (from the CSV file).

That said, precisely how the Positive and Negative are assigned to “anon” and “normal”, depends on how the framework implements this, or how you define it.

LikeLike

Reply