An improved Simhash algorithm based malicious mirror website detection method
Article 2021 en
Authors
GC
Guangxuan Chen
GC
Guangxiao Chen
DW
Di Wu
Abstract
1 min read
Abstract There are a large number of similar or even identical webpages on the Internet. These webpages will cause unnecessary loss of network resources, including waste of storage space, decreased web search speed, and decreased user experience. And some malicious mirror websites will become tools for criminals to carry out illegal activities such as phishing attacks. In this paper, the autours analyzed the mainstream text similarity detection algorithms and webpage deduplication algorithms, and proposed an improved webpage deduplication algorithm based on Simhash. The algorithm converts the text collection into Simhash fingerprints for storage through mapping, and calculates the similarity of the two fingerprints through Hamming distance, thereby obtaining the similarity of the webpage. Experiments show that the algorithm proposed in this paper has a higher accuracy rate and recall rate, and can be better applied to the identification and detection of malicious mirror websites.
Discussion(0)
No comments yet. Be the first to comment.