12-03, 22:00–22:30 (UTC), General Track
To identify a production-ready, open-source OCR model capable of handling sensitive, non-English content with highly technical language, we evaluated the performance of available open-source OCR models in terms of accuracy, memory efficiency, and processing speed. This presentation will share our findings and key insights gained from this research
Medical data often comes in various formats, such as images and PDF scans. Using an OCR model to convert these documents into machine-readable text is a crucial first step in healthcare data analysis. However, due to the sensitive nature of this data, sending documents to be processed by a commercial OCR endpoint hosted on a third-party server is often prohibited by local privacy regulations.
This talk focuses on three open-source OCR models—Tesseract, EasyOCR, and PaddleOCR—after testing numerous models available on the market. The talk will have a deep dive of the models' architectures and will compare the precision, memory efficiency, and processing speed of these models when analyzing Japanese medical data. The presentation will also introduce the metrics developed to benchmark the models, discuss the types of documents where each model excels and falls short, and share strategies you can use to improve the results.
Note: All three models support a multitude of languages, so the insights from this talk can be applied to language requirements beyond Japanese.
No previous knowledge expected
Bing Wang is a Software Engineer at Flatiron, a healthcare company specializing in building big data databases for oncological research. She is passionate about developing data pipelines to enhance cancer care and is constantly exploring ways to further automate these processes.
Wang holds an M.S. in Computer Science from the University of Chicago, an M.Ed. in Developmental Psychology from Harvard, and a B.A. in Linguistics and Education from ECNU. Her interdisciplinary educational background blends social sciences and technology, fueling her interest in applying machine learning techniques to the analysis of human language.