9

I have checked the originality of my PhD thesis in mathematics using Turnitin. The similarity was 31%. Is this percentage acceptable by most committees?

10
  • 1
    I would imagine that this would vary for university to university.
    – user21984
    Commented Oct 10, 2014 at 10:45
  • 7
    Provided that we are speaking about the ratings provided by automated tools (somewhat implied by the "similarity"): the answer should probably be that it does not matter. Any single copied paragraph that is beyond coincidence is a reason for rejecting the thesis. At the same time, for a thesis, someone should definitely check all potential cases of plagiarism that an automated tool provides. Otherwise the department would use the automated checking tool in a plain wrong way. If the rating is "90% plagiarism", but all cases found by the tool are false-positives, then this should be fine.
    – DCTLib
    Commented Oct 10, 2014 at 11:08
  • 1
    Different (but similar) question with good answers and comments is asked in this link: What is the range of percentage similarity of plagiarism for a review article?
    – enthu
    Commented Oct 10, 2014 at 11:14
  • 13
    Surely you know whether or not you've plagiarised. If you have, you'll be removing that plagiarism before you submit, regardless of the Turnitin score. So why are you checking your own thesis with Turnitin?
    – 410 gone
    Commented Oct 10, 2014 at 14:19
  • 1
    @FranckDernoncourt: I do not think anybody will need a link to Turnitin. See also this Meta discussion.
    – Wrzlprmft
    Commented Oct 10, 2014 at 16:53

5 Answers 5

17

Is this percentage acceptable by most committees?

This is the wrong question to be asking, since academic decisions are not made based on a numerical measure of similarity from a computer program. The purpose of this software is to flag suspicious cases for humans to examine more carefully. It will identify passages that appear similar to other writings, but it can't decide whether that constitutes plagiarism.

For example, part of your thesis might be based on previous papers you have written. In some circumstances, it may be reasonable to copy text from these papers. (You need to check that your advisor approves and that it doesn't conflict with any university regulations or the publishing agreement with the publisher.) Of course you would need to cite the papers and clearly indicate the overlap. It's not plagiarism if you do that, but Turnitin doesn't understand what you've written well enough to distinguish it from plagiarism. So it's possible that Turnitin would flag lots of suspicious sections, but that your committee would look at them and see that everything is cited appropriately.

If you haven't committed any plagiarism, then you don't need to worry about this at all. If you genuinely write everything yourself (or carefully quote and cite anything you didn't write), then there's no way you could accidentally write something that looks like proof of plagiarism. There's just too much possible variation, and the probability of matching someone else's words by chance is negligible. The worst case scenario is that Turnitin flags something due to algorithmic limitations or a poor underlying model, but human review shows that it is not actually worrisome. (Nobody trusts Turnitin more than they trust their own judgment.)

I'll assume you don't know you've committed plagiarism, but it is possible that you honestly wouldn't know? Unfortunately, the answer is yes if you have certain bad writing habits. For example, it's dangerous to write while having another reference open in front of you to compare with. Even if you don't copy anything verbatim, it's easy to write something that's just an adaptation of the original source (maybe rewording sentences or rearranging things slightly, but clearly based on the original).

If that's what worries you, then you should take a look at the most suspicious passages found by Turnitin. If they look like an adaptation of another source, then it's worth rewriting them. If they don't, then maybe Turnitin is worrying you unnecessarily.

But in any case a plagiarism finding won't just come down to a percentage of similarity. Any percentage greater than 0 is too much for actual plagiarism, and no percentage is too high if it reflects limitations of the software rather than actual plagiarism.

1
  • 5
    Something that may be different for math papers is that many definitions are standard enough that the wording is almost exactly the same in all papers. I would not try to get "creative" with the definition of a complete metric space for example. Commented Oct 24, 2014 at 22:31
11

TurnItIn uses a complicated algorithm to determine whether a piece of text within a larger body of work matches something in its database. The TurnItIn is limited to open access sources and therefore has huge gaps in its ability to detect things. Further, while TurnItIn can in some cases exclude things like references and quotes from the similarity index, it sometimes fails. Overall, when my department's academic misconduct committee looks at TurnItIn reports we essentially ignore the overall similarity index. We do not completely ignore it in that it guides how we are going to further examine the document.

We employ 4 different strategies based on whether the similarity index is 0, between 1 and 20 percent, between 20 and 40 percent, and over 40+ percent. A piece of work with a similarity index of 0 is pretty rare and generally means that students have manipulated the document in a way that TurnItIn cannot process it (e.g., if a paper is converted to an image file and then converted to a pdf, there is no text for TurnItIn to analyse). A similarity index less than 20 percent can arise from work that contains no plagiarism with the similarity being quotes and references and small meaningless sentences. The key here is "meaningless". For example, there are only so many ways of saying "we did a t-test between the two groups" and it is reasonable to assume that someone else has used exactly the same wording. A piece of work with a similarity index less than 20 percent can also, however, include a huge amount of plagiarised material. A similarity index between 20-40 percent generally means there is a problem unless a large portion of text that should have been skipped was not (e.g., block quotes, reference lists, or appendices of common tables). A similarity index in excess of 40 percent is almost always problematic.

You really should not depend on the overall similarity index. First and foremost you should depend on your own following of good academic practices. If you have followed good academic practices, there really is no need for TurnItIn. If you want to use the TurnItIn report, you should look at what is being match and ask yourself why it is matching. If it found something your "accidentally" cut and paste, or "inadvertently" did not reword appropriately, fix it and use that as a wake up call to improve your academic practice. If everything it is finding are properly attributed quotes or common tables (or questionnaires, etc) and references then there is no problem.

3

I have some familiarity with Turnitin, though that was way back in undergrad. The thing about similarity engines is that they aren't perfect.

It's important to consider exactly how Turnitin describes itself on its FAQ.

What does TurnItIn actually do?

Turnitin determines if text in a paper matches text in any of the Turnitin databases. By itself, Turnitin does not detect or determine plagiarism — it just detects matching text to help instructors determine if plagiarism has occurred. Indeed, the text in the student’s paper that is found to match a source may be properly cited and attributed.

When we were testing Turnitin in high school (probably a decade ago) with a short writing prompt (~page or two) with a single source, the entire class ended up getting 15 to 20% similarity score, because not only did our sources match, but our quotes matched. No surprise there, really.

Now, consider how large Turnitin's database has grown. If this FAQ is to be trusted, you're comparing your paper to more than 80 thousand journals.

Turnitin’s proprietary software then compares the paper’s text to a vast database of 12+ billion pages of digital content (including archived internet content that is no longer available on the live web) as well as over 110 million papers in the student paper archive, and 80,000+ professional, academic and commercial journals and publications. We’re adding new content through new partnerships all the time. For example, our partner CrossRef boasts 500-plus members that include publishers such as Elsevier and the IEEE, and has already added hundreds of millions of pages of new content to our database.

If I recall correctly, you can see exactly where your paper has similarity with others, so you can pull that up.

Sources of Similarity

My bet is that your paper cites papers almost identically to how another paper cites theirs. The great benefit of commonplace citing techniques like APA and MLA is that they're consistent.

If you cite, for example, the general APA format from Purdue, and someone else cites it, they're going to match at almost 100%.

Angeli, E., Wagner, J., Lawrick, E., Moore, K., Anderson, M., Soderlund, L., & Brizee, A. (2010, May 5). General format. Retrieved from http://owl.english.purdue.edu/owl/resource/560/01/

The chances of you citing a paper that has never been cited before when compared to the world of science is, let's face it, probably 0%. Someone out there has cited your sources at some point. With sources being at times up to 10% of the paper's length, that's an easy portion we can knock out.

The other portion likely has to do with the vernacular that is used to describe a situation. Let's go with the following statement, written entirely off the top of my head.

Java is an object-oriented programming language.

Pretty simple statement, and true enough that it has been mentioned 260,000 times already, in that exact wording.

Similarity for that statement is 100% if it were to check for that. But when you make it loosely checked for similarity (i.e. remove the quotes from the search), you get several million hits.

Does that mean I plagiarized? Nope. Would TurnItIn flag it? Definitely. Consider how likely everyday people great each other with "How was your weekend?" Are we plagiarizing each other's greetings? Nope. We pick up similarities in how we control language to understand each other, and that shows in papers, where we describe confidence intervals, methodologies, and processes the same way.

Perhaps even more terrifying in considering the similarity score, is that it will likely evaluate the two following statements similar:

Statement 1

The double helix of DNA was first discovered by the combined efforts of Watson and Crick. Watson and Crick would later get a Nobel Prize for their efforts.

Statement 2

The double helix of DNA was not first discovered by the combined efforts of Watson and Crick, but by Franklin. Watson and Crick would later get a Nobel Prize for her efforts.

Two very similar sentences. 80-90% similarity word-wise. Meaning-wise? Completely different. That's why the human element is required. We can tell those two statements tell an entirely different story when read. These small similar sets of wording add up quite quickly, and a 30% similarity in your case, given the level of research probably done in whatever your field is, and the amount of sources you have probably cited (100+?) is unlikely to be anything to fret about in this day and age.

2
  • 1
    Nice analysis, but I disagree with your conclusions. Do you have any evidence that "30% similarity is unlikely to be anything to fret about in this day and age." I haven't calculated the numbers for my department, although it might be worth doing, but I would estimate that over 3/4 of the cases of academic misconduct I have seen have an overall similarity index of less than 30%. Harder for me to estimate is the percentage of work that has a similarity index in excess of 30 percent that did not involve academic misconduct.
    – StrongBad
    Commented Oct 10, 2014 at 14:23
  • @StrongBad I mean in his case, not in general, sorry D: I'm sure if we really wanted to, we could definitely break TurnItIn by forceful plagiarism at a <10% rating, and I know students will likely do that. I'll edit it to reflect that.
    – Compass
    Commented Oct 10, 2014 at 14:25
-1

From my experience with Ithenticate (the version of turnitin for journals and conference proceedings), I'd say that 30% similarity most likely indicates significant plagiarism or self-plagiarism (recycling of text.) I would certainly investigate further to understand exactly where the similar text was coming from.

If the similar text is taken from sources written by other authors, then I would investigate further by reading the text carefully and comparing it with the sources. There are certainly false alarms raised by this type of software. For example, common phrases like "Without loss of generality, we can assume that..." and "Partial differential equation boundary value problem" will be flagged. Standard definitions are also commonly flagged. However, if I see long narrative paragraphs with significant copying, that's clearly plagiarism.

It's traditional at many universities to staple together a bunch of papers and call it a dissertation. Conversely, it's also very common to slightly rewrite chapters of a dissertation and turn them into papers. Either way, this is "text recycling."

Now that text recycling can be easily detected, commercial publishers are cracking down on it for a variety of reasons. First, the publisher might get sued for copyright violation if the holder of the copyright on the previously published text objects. A different objection is that the material shouldn't be published because it isn't original. As a result, text recycling between two published papers (in conference proceedings or journal articles) is rapidly becoming a thing of the past. This has upset many academics who have made a habit of reusing text from one paper to the next. Some feel that if the reused text is from a methods section or literature review, than the copying is harmless. Publishers typically take a harder line.

The situation with dissertations is somewhat different. In one direction journals have always been willing to accept papers that are substantially based on dissertation chapters with minimal rewriting. Since the student usually retains copyright on the thesis itself, there's no particular problem with copyright violation. Since dissertations traditionally weren't widely distributed, publishers didn't care that the material had been "previously published." I don't really expect this to change much in the near future.

In the other direction, there are two issues: First, will the publisher of journal articles object to reuse of the text in the dissertation as a copyright violation? You'd need to check with the publisher. Second, will the university be willing to accept a dissertation (and perhaps publish it through Proquest or its own online dissertation web site) that contains material that has been separately published? That really depends on the policy of your university and the particular opinions of your advisor and committee.

-2

I have used websites in the past to help with similar content, they will give you a report of what was found online and help remove/reword the similar content so you don't have to worry about your document being marked as plagiarism.

1
  • He doesn't have to worry about the document being marked as plagiarism, no matter what a program says. Commented Oct 24, 2014 at 17:57

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .