Chemicals, Supplies and Software Information and Internet Searching News, Reviews and Roundups Worldwide Chemistry Community Chemistry Community, Content and Commerce

ChemFinder.Com > About ChemFinder.Com > ChemFinder & ChemINDEX >
How do ChemFinder & ChemINDEX work? > Chemical Errors Found

A Discussion of Problems Encountered while Creating the ChemFinder Database

Typos
Many of the most common problems we've found involve typos. References to "mehtyl" groups and "napthalene" derivatives are particularly common, but they are far from the only ones. We think we've caught most of these.

Chemistry by the Non-Chemist
These sorts of errors could be considered typos, but they clearly arise from a specific cause: non-chemists (including off-the-shelf OCR packages) trying to transcibe chemistry. These mistakes are chemically absurd, but grammatically reasonable. Common mistakes in this category include "1-tryptophan" (instead of L-tryptophan), references to "H20" (aitch-two-zero instead of aitch-two-oh), 3s becoming 8s and so on. We think we've caught most of these where they involve chemical names, but have probably not done as well when it comes to physical property information.

Nomenclature Changes
Some of the chemical "errors" we've found were completely correct at one point. The best example of this is the name for the element with atomic number 16. Until recently, this element (and its derivatives) were commonly called "sulphur compounds"; now they are known as chemicals containing "sulfur". We believe we have caught most of these, and ChemFinder will treat sulphur/sulfur interchangably when evaluating search queries.

Regional Variations
As its name implies, the WWW is a world-wide phenomenon. Unfortunately, as with the metric system, the US has its own nomenclature when it comes to some chemistry. For example, the name of the element with atomic number 13 is "aluminum" in the US, but "aluminium" in most of the rest of the world. The name of the element with atomic number 15 is variously "phosphorus" or "phosphorous". We have tried to standardize on the American spellings for ChemFinder, and it will treat aluminium/aluminum and phosphorous/phosphorus interchangably when evaluating search queries.

IUPAC vs. Popular Nomenclature
This is a tough one. Ask any chemist, and you will get ready agreement that

  • 1-propanol
  • propan-1-ol
  • n-propanol
  • 1-propyl alcohol
  • n-propyl alcohol
  • ...and so on
are all the same compound. Ask what the preferred name is and you will likely get 5 (or more!) different answers, and this is only a simple example. We have taken several steps to try to address this problem: In building ChemFinder database, we tried our best to retain all synonyms we found, without imparting any value judgements as to their appropriateness (with exceptions noted elsewhere on this page). In doing so, there are two sorts of errors that we could have easily made:
  1. Errors of inclusion: for example, 2-methyl-hydroxybenzene and 2-hydroxy-methylbenzene clearly refer to the same compound. However, add a third substituent (3-chloro-2-methyl-hydroxybenzene and 3-chloro-2-hydroxy-methylbenzene), and you no longer have identical compounds. If we had mistakenly treated these as identical, it would have been an error of inclusion. We believe we have minimized this type of error.
    2-methyl-hydroxybenzene
    2-hydroxy-methylbenzene
    3-chloro-2-methyl-hydroxybenzene
    3-chloro-2-hydroxy-methylbenzene

  2. Errors of omission: "cetyl alcohol" and "hexadecanol" refer to the same compound (CH3(CH2)15OH). Similarly, "ferrous sulfate", "iron (II) sulfate", and "sulfuric acid, iron (II) salt" are all different names for the same chemical. If we did not recognize that these were all the same, we could have listed them as separate compounds. Again, we have tried to minimize this type of error, but due to the large number of popular synonyms for many compounds (especially for consumer compounds that have multiple brand names), it is not clear how well we have done. We will make corrections to ChemFinder database as we find these errors.
In searching the database, we have tried to anticipate many of the most-common types of nomenclature variations that chemists might use. ChemFinder should be completely forgiving of simple punctuation variations ("propyl amine" vs. "propylamine"). It will also be fairly forgiving for most of the other problems listed above ("ferrous sulfate" vs. "iron (II) sulfate"). Even certain more-complicated cases ("p-nitroaniline" vs. "4-nitrobenzene amine") will be recognized as equivalent. Of course, the more unusual the nomenclature, the less likely ChemFinder will recognize it.

Invalid CAS RNs
CAS Registry Numbers have a built-in checksum. To calculate the checksum, number each digit from 1 to N, starting at the second-last digit and progressing to the left. Then, multiply each digit by the number you assigned. Add the sums, and take the units digit. This units digit should match the last digit of the CAS number.



If the calculated checksum doesn't match, the CAS number is invalid. For example, 26471-62-4 and 26471-62-6 would both be invalid. We found many examples of invalid CAS numbers across the WWW. We do not have any CAS Registry Numbers at ChemFinder that have invalid checksums, and if you enter an invalid CAS number as a search query, it will tell you so (rather than saying that it found no hits, which would technically be accurate, but not helpful in identifying the problem).

You can also see the official word from CAS on this subject.

Mismatched CAS RNs
A problem more insidious than an invalid CAS RN is when a valid CAS RN is assigned to the wrong compound. Often this happens as a result of casual nomenclature. For example, "copper sulfate" is a popular chemical used in many high school and college labs. It has a very vivid blue color, and makes nice crystals. The CAS RN for "copper sulfate" is 7758-98-7, but that is not the CAS RN for the blue compound used in labs (it is, in fact, white). The blue compound, despite its casual name, is really "copper sulfate, pentahydrate"; it has a CAS RN of 7758-99-8. The extra five waters make the difference between the two compounds. We have dealt with these mis-assignments ("copper sulfate is a blue solid with CAS RN 7758-98-7") on a case-by-case basis, sometimes indexing the compound under how we think it was intended, and sometimes indexing it under all possibilities. We will certainly correct any mis-indexing as we are made aware of it.

Capitalization
One of the first decisions we made was to make ChemFinder case-insensitive. There are a lot of cases of strange capitalization on the WWW and on ChemFinder -- things like "N-pentane" and "n,n-Dimethylformamide". We have made virtually no attempt to address capitalization issues, and they will not affect searching.



©2004 CambridgeSoft Corporation. All Rights Reserved. Privacy Statement
Email   /
Tel  1 800 315-7300 / 1 617 588-9300     Fax  1 617 588-9390    
EU Tel  00 800 875 20000 / +44 1223 464900     EU Fax  +44 1223 464990    
Germany Tel  +49 69 2222 2280     France Tel  +33 1 70 71 98 80

CambridgeSoft Corporation, 100 CambridgePark Drive, Cambridge, MA 02140 USA
CambridgeSoft Corporation, 1 Signet Court, Swann's Road, Cambridge CB5 8LA UK
All trademarks are the property of their respective holders.