Scraper Release (Copied from LinkedIn)

I open-sourced part of a pipeline for traditional phishing detection yesterday. I was on the fence for this one for so long but I saw no point in keeping the repository private.

The scraper library is meant to be used in conjunction with another internal library / endpoint for model predictions. I have the model weights somewhere in a local gitea instance. I could push it to huggingface if I can locate it. The models are from the DETR family (from facebook research) and I had already converted them to ONNX for edge deployment in browsers (had to make a PR to huggingface for it since they didn’t support it at the time).

The scraper was written in the pre-llm era (with my own 10 fingers) - it is basically nodejs, puppeteer and pm2 on steroids. I fingerprinted 27000 websites in ~24 hours using the library on my workstation. I had initially written the scraper with Python and Selenium but after working with nodejs on a project, I thought it was the best tool for the job (native async) ; also puppeteer has a lot of good add-on projects for anti-anti-scraping/fingerprinting.

The detection pipeline replicated the work Phishpedia usenix paper. I had been playing with the idea of logo based phishing detection pipelines and took this work as a validation of the idea. I wanted to work on it as part of my masters thesis at Islington circa 2020-2021 but I didn’t have the resources to manually collect and label webpages at scale and decided to pursue periodicity detection using 1 dimensional Convolutional Neural Networks instead (not as glamorous but it helped me to flag periodic activities in entity pairs such as IP-Domain pairs for malware beaconing detection at Vairav).

Phishpedia (and PhishIntention) author Roufan Liu is a brilliant researcher from Singapore. Her team’s work in visual analytics based phishing detection is unparalleled. Tip: some of the examples in the phishpedia / phishintention dataset need cleaning.

I decided to open-source the project because I hadn’t used it in a long time; and closed source code isn’t going to help anyone. Plus, I have already open-sourced another Azure consent audit-type-thing project that I did for my masters capstone at RIT which is a lot more relevant now in 2026 than traditional phishing.

Github link

Authors

Ashim Mahara

Security Analytics

My research interests include Cyber Security, Security Analytics and Autonomous Agents.

Docker with Python Feb 9, 2026 →