Many security technologies use anomaly detection mechanisms on top of a normality model constructed from previously seen traffic data. However, when the traffic originates from unreliable sources the learning process needs to mitigate potential reliability issues in order to avoid inclusion of malicious traffic patterns in this normality model. In this talk, we will present the challenges of learning from dirty data with focus on web traffic - probably the dirtiest data in the world, and explain different approaches for learning from dirty data. We will also discuss a mundane but no less important aspect of learning – time and memory complexity, and present a robust learning scheme optimized to work efficiently on streamed data. We will give examples from the web security arena with robust learning of URLs, parameters, character sets, cookies and more.
 
        
In the last 20 years, I had researched and innovated in many security domains, including web application security, APT, DRM systems, automotive systems, data security and more. While thinking as an attacker is my second nature, my first nature is problem solving and algorithm development - in the past in cryptography and watermarking, and today mostly around harnessing ML/AI technology to solve security-related problems. While I am fascinated with bleeding edge technologies like AI and federated learning and the opportunities these technologies unlock, as a security veteran I am also continuously asking what can go wrong and the answer is never NULL. I am the inventor of 20 patents in security, cryptography, data science and privacy-preserving computation arenas. I hold an M. Sc. in Applied Math and Computer Science from the Weizmann Institute.