image/svg+xml $ $ ing$ ing$ ces$ ces$ Res Res ea ea Res->ea ou ou Res->ou r r ea->r ch ch ea->ch r->ces$ r->ch ch->$ ch->ing$ T T T->ea ou->r

Introduction

My PhD dissertation is entitled "Looking for Similarity in Source Code" ("recherche de similarité dans du code source" in French). It describes more deeply some themes (factorization of functions of token sequences, usage of abstract subtrees represented by hash values...) already evoked in some publications and introduces new ones (meta-tokenization, similarity metrics...). I defended it on November 25th 2010 to get my PhD degree in the Gaspard Monge Computer Science Laboratory of the University Paris-Est in France. The jury was composed of Prof. Gilles Roussel (my PhD supervisor), Prof. Thierry Lecroq from the University of Rouen, Dr. Didier Parigot from INRIA Sophia Antipolis as reporters and Dr. Julien Allali from the University of Bordeaux as the examiner.

Small abstract in English

Several phenomena cause source code duplication like inter-project copying and adaptation or cloning inside a same project. Looking for code matches allows to factorize them inside a project or to highlight plagiarism cases. We study statical similarity retrieval methods on source code that may be transformed via edit operations like insertion, deletion, transposition, in- or out-lining of functions.

Sequence similarity retrieval methods inspired from genomics are studied and adapted to find common chunks of tokenized source. After an explanation on alignment and $k$-grams lookup techniques, we present a factorization method that merges function call graphs of projects to a single graph with the creation of synthetic functions modeling nested matches. It relies on the use of suffix indexation structures to find repeated token factors.

Syntax tree indexation is explored to handle huge code bases allowing to lookup similar sub-trees with their hash values computed via heterogeneous abstraction profiles. Exact copies of sub-trees close in their host trees may be merged to get approximate and extended matches.

Before and after match retrieval, we define similarity metrics to preselect interesting code spots, refine the search process or enhance the human understanding of results.

Online version

A (never definitive) version of the manuscrit is available here in French.
The slides used during the defense are here (also in French).