TY - JOUR AU - Rajagopal, Balaji Ganesh AU - C, Amarnath AU - Rajan, Chidambaram Sawri AU - Ramalingam, Deebalakshmi PY - 2026 TI - Enhanced RoBERTa Model for OCR-Based EHR Parsing and Information Extraction JF - Journal of Computer Science VL - 22 IS - 4 DO - 10.3844/jcssp.2026.1434.1447 UR - https://thescipub.com/abstract/jcssp.2026.1434.1447 AB - Healthy source data for medical research and health analytics in general can be obtained from Electronic Health Records (EHRs). Nevertheless, due to the complexities of the design and especially the unstructured nature of them, it is not easy to extract important information from digital documents. This paper proposes a fundamentally new approach to the problem of interpreting EHRs obtained by Optical Character Recognition (OCR) that utilizes a refined RoBERTa foundation architecture. Basically, our method is very efficient in extracting key elements, like section headings and bold words, which most of the time have very significant clinical significance. More than just straightforward text recognition is the use of RoBERTa for semantic understanding. 89. 2% is the accuracy of the tests that we have performed. This paper presents an exhaustive benchmarking of the pros and cons of the deep learning techniques that are currently being used for parsing EHRs. However, our model is fixing the problem of very accurately extracting bold section heads from unstructured data in EHRs. The system proposes a two-phase approach combining natural language and image processing techniques. Performing thinning and normalizing operations first to separate bold texts based on pixel intensity over a preset threshold. By successfully removing the needless text from the paragraphs, our method significantly enhances the accuracy of bold word extraction, reaching 98%.