Website correlation

Website correlation, or website matching, is a process used to identify websites that are similar or related. Websites are inherently easy to duplicate.Search: [https://www.google.com/search?q=website+replication "website replication"], GoogleSearch: [https://www.google.com/search?q=website+clone+script "website clone script"], Google This led to proliferation of identical websitesFetterly, D., Manasse, M., Najork, M., "[http://www.cwr.cl/la-web/2003/stamped/05_fetterly_d.pdf On the Evolution of Clusters of Near Duplicate Web Pages]", Proceedings of the First Conference on Latin American Web Congress, pp. 37,2003 or very similar websites for purposes ranging from translation to Internet marketing (especially affiliate marketing)I've Got a Domain Name--Now What???: A Practical Guide to Building a Website and Web Presence, {{ISBN|1-60005-109-X}}, 2008 to Internet crimeShane McGlaun, [http://www.dailytech.com/Microsoft+Granted+Permanent+Ownership+of+276+Botnet+Domains/article19579.htm "Microsoft Granted Permanent Ownership of 276 Botnet Domains"], Daily Tech,2010/9/9 Locating similar websites is inherently problematic because they may be in different languages, on different servers, in different countries (different top-level domains).

Uses

Website correlation is used in:

Internet InvestigationsInvestigations Involving the Internet

and Computer Networks [http://www.ncjrs.gov/pdffiles1/nij/210798.pdf], National Institute of Justice (U.S.),2007 to determine the overall scope of an investigation

Market research to locate competitors or determine the market reach of competing companies or for cluster sampling
Web filteringJ Prasanna Kumar,P Govindarajulu,"Duplicate and Near Duplicate Documents Detection: A Review",European Journal of Scientific Research,{{ISSN|1450-216X}} Vol.32 No.4 (2009), pp.514-527 systems to ensure that all websites of a specific type are blocked from view
Data mining systems to maximize input or output data
Risk management programs to ensure websites are being monitored for problems that introduce fiscal risk
Compliance monitoring as part of a compliance and ethics program or policy to ensure websites follow established guidelines

Correlation types

There are several known types of correlation, each demonstrating different strengths and weaknesses. A practical website correlation process may require combining two or more of these methods.

=Similar structure=

To save time and effort, website owners duplicate major portions of website code across many domains. Similarity of code structure can provide enough information for correlation. Organizations known to have a publicly search-able databases for this kind of correlation include:

http://www.delineal.com

note: Websites can sometimes utilize the same structure but have no relationship to each other (as when websites coincidentally utilize the same content management system).

=Same server or subnet=

Also known as correlated Reverse DNS lookup. Websites may be served from the same server, on one or more ip address, on one or more subnet. Several organizations retain archives of ip address data and correlate the data. Examples include:

http://www.domaintools.com

note: Correlation via this method may be misleading because websites frequently exist on the same server (aka shared hosting) but have no relationship to each other.

=Same owner=

Websites may be authored by the same person or organization. Website owners are required to provide contact information to a registrar to obtain a domain name. Domain ownership can be determined via the WHOIS protocol which provides no mechanism for searching or correlating ownership. Several organizations retain archives of WHOIS information and provide searching and correlation services. Examples include:

http://whoisology.com
http://www.domaintools.com

note: Website ownership information can be falsified, outdated, or hidden from public view. Website Correlation via this method can be accurate, misleading, or impossible depending on the information contained in WHOIS records.

=Similar content=

Search engines provide search-able databases of indexed website content. Search engine results lists are correlated by content similarity.

==Google==

on Google.com type 'related:website_name_here.com' to find websites related by name or phrases
find a unique-sounding phrase on the website then use search engine(s) to locate the phrase literally on other websites
In the search box, place quotes around the phrase to do a literal phrase search
instead of copyright 2010 xyzcompany use "copyright 2010 xyzcompany"

note: This method of correlation is inherently slow because one must guess which phrases to search for. Also, related websites may not contain literally similar content (as when a site is translated into another language).

=Same category=

Websites are frequently categorized or tagged similarly via automated or manual means. Examples of publicly accessible website categorization databases include:

http://www.similarsitesearch.com/
http://similarsites.com
DMOZ

note: Manual Categorization and tag (metadata) methods are inherently subjective.Bruce & Wiebe, "[https://www.cambridge.org/core/journals/natural-language-engineering/article/recognizing-subjectivity-a-case-study-in-manual-tagging/131A4DC84D785E612DDC1AB3A823BD76 Recognizing subjectivity: a case study in manual tagging]", Natural Language Engineering, 1999 Automated categorization and tagging methods are inherently subject to the varying weaknesses and strengths of underlying categorization algorithms.Fabrizio Sebastiani. [https://arxiv.org/abs/cs.ir/0110053 Machine learning in automated text categorization]. ACM Computing Surveys, 34(1):1–47, 2002.

=Same tracking ID=

Tracking IDs, used for analytics or affiliate identification are frequently embedded in website code. These ids can be used for correlation because they imply common management of websites. Publicly available websites for correlating by tracking id include:

http://ewhois.com

References

Category:Web technology

Category:Cybercrime