TY  - JOUR
AU  - Rajagopal, Balaji Ganesh 
AU  - C, Amarnath 
AU  - Rajan, Chidambaram Sawri 
AU  - Ramalingam, Deebalakshmi 
PY  - 2026
TI  - Enhanced RoBERTa Model for OCR-Based EHR Parsing and Information Extraction
JF  - Journal of Computer Science
VL  - 22
IS  - 4
DO  - 10.3844/jcssp.2026.1434.1447
UR  - https://thescipub.com/abstract/jcssp.2026.1434.1447
AB  - Healthy source data for medical research and health analytics in general can be obtained from Electronic Health Records (EHRs). Nevertheless, due to the complexities of the design and especially the unstructured nature of them, it is not easy to extract important information from digital documents. This paper proposes a fundamentally new approach to the problem of interpreting EHRs obtained by Optical Character Recognition (OCR) that utilizes a refined RoBERTa foundation architecture. Basically, our method is very efficient in extracting key elements, like section headings and bold words, which most of the time have very significant clinical significance. More than just straightforward text recognition is the use of RoBERTa for semantic understanding. 89. 2% is the accuracy of the tests that we have performed. This paper presents an exhaustive benchmarking of the pros and cons of the deep learning techniques that are currently being used for parsing EHRs. However, our model is fixing the problem of very accurately extracting bold section heads from unstructured data in EHRs. The system proposes a two-phase approach combining natural language and image processing techniques. Performing thinning and normalizing operations first to separate bold texts based on pixel intensity over a preset threshold. By successfully removing the needless text from the paragraphs, our method significantly enhances the accuracy of bold word extraction, reaching 98%.