Skip to content

Commit eab1dfa

Browse files
authored
Merge pull request python-geeks#324 from abhinand5/main
Added Simple Plagiarism Checker
2 parents 6321ea2 + 923715b commit eab1dfa

File tree

4 files changed

+83
-0
lines changed

4 files changed

+83
-0
lines changed

simple_plagiarism_detector/README.md

+27
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
# Simple Plagiarism Checker
2+
3+
This is the simplest way to calculate similarity between two texts. Using cosine similarity between them. So, it is rather a similarity measure between two texts. Which in this case might be useful when comparing two texts for similarity/plagiarism.
4+
5+
## Requirements
6+
Only uses in-built Python packages.
7+
8+
## Usage
9+
10+
Specify the text files you want to compare while running the script as arguments.
11+
12+
`$ python simple-plagiarism-detector.py file1.txt file2.txt`
13+
14+
**Sample Output:**
15+
16+
`file1.txt` contents
17+
> This is an example sentence
18+
19+
`file2.txt` contents
20+
> This sentence is similar to an example sentence
21+
22+
23+
`$ python simple-plagiarism-detector.py file1.txt file2.txt`
24+
25+
Similarity Score: 0.8485281374238569
26+
27+
**Contributor:** [abhinand5](https://github.com/abhinand5)

simple_plagiarism_detector/file1.txt

+1
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
:) :(

simple_plagiarism_detector/file2.txt

+1
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
:( :)
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
# Simple plagiarism detector
2+
# Note: This is a very simple example, might not work perfectly
3+
# Author: Abhinand
4+
5+
import math
6+
import re
7+
from collections import Counter
8+
import sys
9+
10+
WORD = re.compile(r"[A-Za-z0-9]")
11+
SPECIAL_CHARS = re.compile(r"[^A-Za-z0-9]")
12+
13+
14+
def get_cosine(vec1, vec2):
15+
""" Function to compute Cosine Similarity between two word vectors"""
16+
intersection = set(vec1.keys()) & set(vec2.keys())
17+
numerator = sum([vec1[x] * vec2[x] for x in intersection])
18+
19+
sum1 = sum([vec1[x] ** 2 for x in list(vec1.keys())])
20+
sum2 = sum([vec2[x] ** 2 for x in list(vec2.keys())])
21+
denominator = math.sqrt(sum1) * math.sqrt(sum2)
22+
23+
if not denominator:
24+
return 0.0
25+
else:
26+
return float(numerator) / denominator
27+
28+
29+
def text_to_vector(text):
30+
"""Simple function to convert text to vector"""
31+
words = WORD.findall(text)
32+
special = SPECIAL_CHARS.findall(text)
33+
return Counter(words) + Counter(special)
34+
35+
36+
if __name__ == "__main__":
37+
38+
file1 = open(sys.argv[1], "r")
39+
text1 = file1.read().replace("\n", " ")
40+
41+
file2 = open(sys.argv[2], "r")
42+
text2 = file2.read().replace("\n", " ")
43+
44+
if len(text1) == 0 and len(text2) == 0:
45+
print("given text files were empty, imputing '1' as place holder")
46+
text1 = '' + '1'
47+
text2 = '' + '1'
48+
49+
vector1 = text_to_vector(text1)
50+
vector2 = text_to_vector(text2)
51+
52+
cosine = round(get_cosine(vector1, vector2), 2)
53+
54+
print("Similarity Score:", cosine)

0 commit comments

Comments
 (0)