Google has published a new podcast for webmasters that explains how duplicates and canon pages are determined.
Canonization and duplicate search are not identical concepts. After searching for duplicates and grouping them, it is necessary to identify the leader in this group. This process will be called canonicalization.
To identify duplicatesGoogle has to create a checksum for each page. This can be compared to a unique word-based fingerprint for each page. Accordingly, if Google is able to recognize two pages with the same checksum, it will treat them as duplicates.
This method is suitable for searching both full duplicates and partial ones.
A checksum is a value obtained from a block of digital data in order to detect errors that may have been introduced during their transmission or storage. Checksums are often used by programmers to check data integrity.
The canonical page is the main page in the cluster and Google takes into account more than 20 signals to select it – content, PageRank of the page, availability of HTTPS protocol, redirection, rel = canonical attribute, etc.
Google uses machine learning to assign weights to all of these signals.
It is important to note that canonicalization has nothing to do with ranking – the selected page will be ranked based on signals other than those used in the canonicalization process.