INFORMATION SECURITY
MACHINE LEARNING DATASETS FOR CYBER SECURITY APPLICATIONS
- 1 Computer Science Department – Military Technical Academy “Ferdinand I”, Romania
Abstract
The main objective of this study is not to identify the best machine learning model, but instead to review the main datasets, publicly available, used to train and test security solutions that employ modern classification algorithms for anomaly detection. Hence, DARPA 1998 and KDD were studied as they were the first initiatives taken in this direction, while NSL-KDD, ISCXIDS2012 and CICIDS2017 are taken in consideration for future research because of their advantages. Personalized datasets will always bring a reasonable amount of uncertainty, especially since some feature vectors used for training remain unknown. Nevertheless, training on data specific to the protected infrastructure is more efficient, from the security point of view, than training on old attack signatures.
Keywords
References
- ] Bartlett H., “Cybersecurity: Industry Report 2019”, 12.02.2019. [Online]. Available: https://burevalleygroup.com/wp-content/ uploads/2019/02/Industry-Report.inv-case-Cybersecurity-1.pdf. [Accessed 09.06.2019].
- Sapp C. E., “Preparing and Architecting for Machine Learning”, 17.01.2017. [Online]. Available: https://www.gartner.com /binaries/content/assets/events/keywords/catalyst/catus8/preparing_and_ architecting_for_machine_learning.pdf. [Accessed 12.06.2019].
- “Machine Learning Workflow”, 2019. [Online]. Available: https://cloud.google.com/ml-engine/docs/tensorflow/ml-solutions-overview. [Accessed 18.09.2019].
- Thomas C., V. Sharma, N. Balakrishnan, “Usefulness of DARPA dataset for intrusion detection system evaluation” in Proceedings of SPIE - The International Society for Optical Engineering, March 2008.
- Almamory S., H. Zhang, “Intrusion detection alarms reduction using root cause analysis and clustering”, Computer Communications, no. 32, pp. 419-430, 2009.
- McHugh J., “Testing Intrusion detection systems: a critique of the 1998 and 1999 DARPA intrusion detection system evaluations as performed by Lincoln Laboratory”, ACM Trans. Inf. Syst. Secur., vol. 3, no. 4, pp. 262- 294, 2000.
- “KDD Cup 1999 Data”, 28.10.1999. [Online]. Available: http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html. [Accessed 10.06.2019].
- Zhuang Q., “Intrusion Detection based on KDD Cup Dataset”, Youtube, 04.05.2015. [Online]. Available: https://youtu.be/ M50pQfj9ZOI. [Accessed 23.03.2019].
- Shah A. A., M. S. Hayat, M. D. Awan, “Analysis of Machine Learning Techniques for Intrusion Detection System: A Review”, International Journal of Computer Applications, vol. 119, no. 3, pp. 19- 29, 2015.
- Tavallaee M., E. Bagheri, W. Lu, A. Ghorbani, “A Detailed Analysis of the KDD CUP 99 Data Set”, in Second IEEE Symposium on Computational Intelligence for Security and Defense Applications (CISDA), 2009.
- Ibrahim L. M., D. T. Basheer, M. S. Mahmod, “A comparison study for intrusion database (KDD99, NSL-KDD) based on self organization map (SOM) artificial neural network”, Journal of Engineering Science and Technology, vol. 8, no.1, pp.107-119, 2013.
- Dhanabal L., S. Shantharajah, “A Study on NSL-KDD Dataset for Intrusion Detection System Based on Classification Algorithms”, International Journal of Advanced Research in Computer and Communication Engineering, vol. 4, no. 6, pp. 446-452, 2015.
- Hall M., E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, I. H. Witten, “The WEKA Data Mining Software: An Update”, SIGKDD Explorations, vol. 11, no. 1, 2009.
- Breiman L., J. Friedman, C. J. Stone, R. Olshen, “Classification and Regression Trees”, Wadsworth, Belmont, CA, 1984.
- “NSL-KDD dataset”, 2009. [Online]. Available: https://www .unb.ca/cic/datasets/nsl.html. [Accessed 02.02.2019].
- Garcia S., M. Grill, H. Stiborek, A. Zunino, “An empirical comparison of botnet detection methods”, Computers and Security Journal, Elsevier, vol. 45, pp. 100-123, 2014.
- “Intrusion detection evaluation dataset (ISCXIDS2012)”, 2012. [Online]. Available: https://www.unb.ca/cic/datasets/ids.html. [Accessed 12.06.2019].
- Ali Shiravi H. S., M. Tavallaee, A. A. Ghorbani, “Toward developing a systematic approach to generate benchmark datasets for intrusion detection”, Computers & Security, vol. 31, no. 3, pp. 357-374, 2012.
- Al-Rubaie M., J. M. Chang, “Privacy Preserving Machine Learning: Threats and Solutions”, 2018. [Online]. Available: https://arxiv.org/pdf/1804.11238.pdf. [Accessed 24.07.2019].