Parsing pre-2007 SMILES string

Question

How would one parse the SMILES string

 BrC[2]:C[3]:C(:CH:CH:CH:@2):CH:CH:CH:CH:@3

I rely on tools like rdkit and OpenBabel to parse SMILES, but both tools aren't able to parse this string.

More specifically, this SMILES string comes from the supplementary information of a 2007 paper on solubility prediction. After doing some searching, it seems that this paper was received prior to the 2007 standardization of OpenSMILES (this might be why canonical parsers like rdkit have trouble with it).

After searching a bit more, I found an older PubChem sketcher that was able to convert the SMILES in question into a more recognizable format

 (BrC1=C(C(=C(C2=C1C(=C(C(=C2[H])[H])[H])[H])[H])[H])[H])

(parsing this was successful via rdkit, see molecule image below) but still need a way to convert similarly-formatted SMILES strings into the OpenSMILES standard/canonical format en mass.

Are the brackets dynamic? Thus one bracket pattern for one structure a different one for a different structure or are they always as stated in the second expression? Hope that makes sense - it does make a difference to the answer. — M__, Apr 19 '22 at 01:46
I don't think they're dynamic but am not very familiar with the SMILES format specification. — Ryan Park, Apr 19 '22 at 02:03
"Welcome to Jurassic park" (theme music starts playing). Wow, that is proper ancient. And more similar to a SMARTS than a SMILES! Your outputted version is kekulised and a machine-made mess —I don't think that's a good option. It ought to be something like `[Br]c1c2c(ccc1)cccc2` if keeping your input order, but I think the canonical order is : `[Br]c1ccc2ccccc2c1` (unchecked, simply typed out while counting carbons aloud). As you can see the standard isn't too dissimilar, so one could write a conversion schema on a push —although I am not sure if it's worth it. — Matteo Ferla, Apr 19 '22 at 10:02
@MatteoFerla Do you know of any software tools that can execute this conversion? If not, how would you go about converting the older format to a more standard one? — Ryan Park, Apr 19 '22 at 12:43
I have not encountered pre-2007 SMILES nor do I know of any tool to do so. Few sites would still exist 15 years later. C++ pre-11 generally compiles —pre-98 is the nightmare. Scripting modules may exists, but in Perl, not Python... So CPAN would be the place to search. But the question is whether this is worth it... after all your example molecule is bromonaphthalene. — Matteo Ferla, Apr 19 '22 at 12:59
I see, thanks. My problem is not parsing this particular molecule but rather a whole host of similarly-formatted ones (around 2k), so an automated parser would definitely help! — Ryan Park, Apr 19 '22 at 13:02
I assumed as much, I meant it is a rather small compound and there are hordes of catalogues. However, I had a look at your source and it has a name field. Why not just search the name with PubChem PUG API to get the SMILES? PS. If you do find a Perl module in CPAN that does do the conversion or a similar converter do post the answer! — Matteo Ferla, Apr 20 '22 at 10:04
I couldn't find a conversion module, but the name field trick worked - thanks! Am posting more details as an answer. — Ryan Park, Apr 23 '22 at 16:17

Ryan Park · Accepted Answer · 2022-04-27T13:21:35.187

I couldn't find a way to directly convert these SMILES strings to a more canonical format. However, thanks to @Matteo Ferla, I was able to get around the problem entirely by looking up the chemical names and resolving the canonical SMILES programmatically.

More specifically, I used the CAS no column given in the supplementary information of the referenced paper to pull the SMILES via Chemical Identifier Resolver. See this answer for an explanation on how to do so.

EDIT: Some of the CAS nos couldn't be resolved, so I manually converted them using the PubChem sketcher tool mentioned in the question. There are quite a few (~160) of such molecules, so if anyone needs the full parsed dataset but doesn't want to spend the time manually parsing those molecules, let me know.

As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). — Community, Apr 23 '22 at 21:37

Parsing pre-2007 SMILES string

1 Answers1