USPTO Chemical Compounds Extraction Proof-of-concept

Extracted chemical compounds from USPTO .CDX attachments since 2001 using RDKit.

Browse index (contents.json)

What's in here

Each week, the USPTO uploads two tar files: one for patent grants (Tuesday) and one for patent applications (Thursday). For each of the files, patents are filtered by IPC/CPC codes relevant to the pharmaceutical sector, first, patent metadata is extracted; then every .CDX attachment is located and compounds are extracted alongside the metadata. Each USPTO tar file produces a corresponding JSON file with the extracted data. Compounds are parsed and saved as CXSMILES with RDKit.

Focus IPC/CPC classes