Scraper Release (Copied from LinkedIn)
I open-sourced part of a pipeline for traditional phishing detection yesterday. I was on the fence for this one for so long but I saw no point in keeping the repository private.
The scraper library is meant to be used in conjunction with another internal library / endpoint for model predictions. I have the model weights somewhere in a local gitea instance. I could push it to huggingface if I can locate it. The models are from the DETR family (from facebook research) and I had already converted them to ONNX for edge deployment in browsers (had to make a PR to huggingface for it since they didn’t support it at the time).
The scraper was written in the pre-llm era (with my own 10 fingers) - it is basically nodejs, puppeteer and pm2 on steroids. I fingerprinted 27000 websites in ~24 hours using the library on my workstation. I had initially written the scraper with Python and Selenium but after working with nodejs on a project, I thought it was the best tool for the job (native async) ; also puppeteer has a lot of good add-on projects for anti-anti-scraping/fingerprinting.
The detection pipeline replicated the work Phishpedia usenix paper. I had been playing with the idea of logo based phishing detection pipelines and took this work as a validation of the idea. I wanted to work on it as part of my masters thesis at Islington circa 2020-2021 but I didn’t have the resources to manually collect and label webpages at scale and decided to pursue periodicity detection using 1 dimensional Convolutional Neural Networks instead (not as glamorous but it helped me to flag periodic activities in entity pairs such as IP-Domain pairs for malware beaconing detection at Vairav).
Phishpedia (and PhishIntention) author Roufan Liu is a brilliant researcher from Singapore. Her team’s work in visual analytics based phishing detection is unparalleled. Tip: some of the examples in the phishpedia / phishintention dataset need cleaning.
I decided to open-source the project because I hadn’t used it in a long time; and closed source code isn’t going to help anyone. Plus, I have already open-sourced another Azure consent audit-type-thing project that I did for my masters capstone at RIT which is a lot more relevant now in 2026 than traditional phishing.