Reading Between the Lines: Incorporating Text Mining and Machine Learning in Financial Fraud Detection

Authors

  • Agung Septia Wibowo Universitas Gadjah Mada
  • Iis Istianah Universitas Gadjah Mada

DOI:

https://doi.org/10.21532/apfjournal.v10i1.382

Keywords:

Financial Statement Fraud, Text Mining, Machine Learning (ML), Natural Language Processing (NLP), Financial Ratios

Abstract

Notwithstanding rigorous oversight in the Indonesian capital market, the manipulation of financial reports continues to occur. This study examines the potential for employing machine learning (ML) models, which utilize linguistic features and financial ratios, in effectively detecting deception or manipulation. Drawing upon publicly listed Indonesian companies as the samples, this research validates the predictive capabilities of the Beneish M-Score, confirms the occurrence of negative language in fraudulent reports, and demonstrates the superiority of the Gradient Boosting ML model in identifying anomalies within financial and textual data. The study distinctively adapts to Indonesian-language annual reports, thereby addressing a gap in the linguistic-based fraud detection literature. These findings not only advance our comprehension of how linguistic features and financial ratios provide practical tools for fraud detection, thereby preparing the academic and professional community in this domain.

References

Aghghaleh, S. F., Mohamed, Z. M., & Rahmat, M. M. (2016). Detecting Financial Statement Frauds in Malaysia: Comparing the Abilities of Beneish and Dechow Models. Asian Journal of Accounting and Governance, 7, 57–65. https://doi.org/10.17576/ajag-2016-07-05.

Aronoff, S. (1982). Classification Accuracy: A User Approach. Photocrammetric Engineering and Remote Sensing, 48(8), 1299–1307.

Ashtiani, M., & Raahemi, B. (2022). Intelligent Fraud Detection in Financial Statements Using Machine Learning and Data Mining: A Systematic Literature Review. IEEE Access, 10(6), 72504–72525. https://doi.org/10.1109/ACCESS.2021.3096799.

Association of Certified Fraud Examiners (ACFE). (2024). Occupational Fraud 2024: A Report to the Nations. Association of Certified Fraud Examiners (ACFE).

Beneish, M. (1999). The Detection of Earnings Manipulation. Financial Analysts Journal - FINANC ANAL J, 55(5), 24–36. https://doi.org/10.2469/faj.v55.n5.2296.

Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5–32. https://doi.org/10.1023/A:1010933404324.

Chan, S., & Chong, M. (2016). Sentiment Analysis in Financial Texts. Decision Support Systems, 94(2017), 53-64. https://doi.org/10.1016/j.dss.2016.10.006.

Chicco, D., & Jurman, G. (2023). The Matthews correlation coefficient (MCC) should replace the ROC AUC as the standard metric for assessing binary classification. BioData Mining, 16(4). 1-23. https://doi.org/10.1186/s13040-023-00322-4.

Cover, T., & Hart, P. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13(1), 21–27. https://doi.org/10.1109/TIT.1967.1053964.

Craja, P., Kim, A., & Lessmann, S. (2020). Deep learning for detecting financial statement fraud. Decision Support Systems, 139(1), 113421. https://doi.org/https://doi.org/10.1016/j.dss.2020.113421.

Dechow, P., Ge, W., Larson, C., & Sloan, R. (2010). Predicting Material Accounting Misstatements. Contem-porary Accounting Research, 28(1), 17-82. https://doi.org/10.1111/j.1911-3846.2010.01041.x.

Dong, W., Liao, S., & Liang, L. (2016). Financial Statement Fraud Detection Using Text Mining: A Systemic Functional Linguistics Theory Pers-pective. Proceeding of the 20th Pacific Asia Conference on Information Systems (PACIS 2016).

Faccia, A., McDonald, J., & George, B. (2024). NLP Sentiment Analysis and Accounting Transparency: A New Era of Financial Record Keeping. Computers, 13(5), 1-18. https://doi.org/10.3390/computers13010005.

Freund, Y., & Schapire, R. E. (1997). A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting. Journal of Computer and System Sciences, 55(1), 119–139. https://doi.org/https://doi.org/10.1006/jcss.1997.1504.

Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. The Annals of Statistics, 29(5), 1189–1232. https://doi.org/10.1214/aos/1013203451.

Fuller, C., Biros, D., & Delen, D. (2011). An investigation of data and text mining methods for real world deception detection. Expert Syst. Appl., 38(7), 8392–8398. https://doi.org/10.1016/j.eswa.2011.01.032.

Gotelaere, S., & Paoli, L. (2022). Prevention and Control of Financial Fraud: a Scoping Review. European Journal on Criminal Policy and Research, 31(1), 1–21. https://doi.org/10.1007/s10610-022-09532-8.

Hajek, P., & Henriques, R. (2017). Mining Corporate Annual Reports For Intelligent Detection of Financial Statement Fraud – A comparative study of machine learning methods. Knowledge-Based Systems, 128(C), 139–152. https://doi.org/https://doi.org/10.1016/j.knosys.2017.05.001.

Halliday, M. A. K., & Matthiessen, C. M. I. M. (2014). An Introduction to Functional Grammar (Third Edition). Hodder Arnold.

Healy, P. M., & Wahlen, J. M. (1999). A Review of the Earnings Management Literature and Its Implications for Standard Setting. Accounting Horizons, 13(4), 365–383. https://doi.org/10.2308/acch.1999.13.4.365.

Hicks, S., Strumke, I., Thambawita, V., Hammou, M., Riegler, M., Halvorsen, P., & Parasa, S. (2022). On evaluation metrics for medical applications of artificial intelligence. Scientific Reports, 12(1), 1-9. https://doi.org/10.1038/s41598-022-09954-8.

Hogan, C., Rezaee, Z., Riley, J., & Velury, U. (2008). Financial Statement Fraud: Insights from the Academic Literature. Auditing, 27(2), 231–252. https://doi.org/10.2308/aud.2008.27.2.231.

Jaballi, S., Zrigui, S., NICOLAS, H., & Zrigui, M. (2024). Analyzing Multilingual Conversations During COVID-19: An Imbalanced Class-Ensemble Learning Approach with Reweighted AdaBoost-SVM for Code-Switched Text Classification. https://doi.org/10.21203/rs.3.rs-3978507/v1

Kanapickiene, R., & Grundien?, Ž. (2015). The Model of Fraud Detection in Financial Statements by Means of Financial Ratios. Procedia - Social and Behavioral Sciences, 213(2015), 321–327. https://doi.org/10.1016/j.sbspro.2015.11.545.

LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep Learning. Nature, 521(7553), 436–444. https://doi.org/10.1038/nature14539.

Lobo, J. M., Jiménez-Valverde, A., & Real, R. (2008). AUC: a misleading measure of the performance of predictive distribution models. Global Ecology and Biogeography, 17(2), 145–151. https://doi.org/https://doi.org/10.1111/j.1466-8238.2007.00358.x.

Lokanan, M., & Sharma, S. (2024). The use of machine learning algorithms to predict financial statement fraud. The British Accounting Review, 56(6), 101441. https://doi.org/https://doi.org/10.1016/j.bar.2024.101441.

Loughran, T., & Mcdonald, B. (2011). When Is a Liability Not a Liability? Textual Analysis, Dictionaries, and 10-Ks. The Journal of Finance, 66(1), 35–65. https://doi.org/https://doi.org/10.1111/j.1540-6261.2010.01625.x.

Mohammad, S. M., & Turney, P. D. (2013). Crowdsourcing A Word–Emotion Association Lexicon. Computational Intelligence, 29(3), 436–465. https://doi.org/https://doi.org/10.1111/j.1467-8640.2012.00460.x.

Perols, J. (2010). Financial Statement Fraud Detection: An Analysis of Statistical and Machine Learning Algorithms. Auditing A Journal of Practice & Theory, 30(2), 19-50. https://doi.org/10.2308/ajpt-50009.

Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). “Why should I trust You?” Explaining the predictions of any classifier. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 13-17-August-2016, 1135–1144. https://doi.org/10.1145/2939672.2939778

Saquete, E., Tomás, D., Moreda, P., Martínez-Barco, P., & Palomar, M. (2020). Fighting post-truth using natural language processing: A review and open challenges. Expert Systems with Applications, 141(7), 112943. https://doi.org/https://doi.org/10.1016/j.eswa.2019.112943.

Soltani, M., Kythreotis, A., & Roshanpoor, A. (2023). Two decades of financial statement fraud detection literature review; combination of bibliometric analysis and topic modeling approach. Journal of Financial Crime, 30(5), 1367-1388. https://doi.org/10.1108/JFC-09-2022-0227.

Throckmorton, C., Mayew, W., Venkatachalam, M., & Collins, L. (2015). Financial Fraud Detection Using Vocal, Linguistic and Financial Cues. Decision Support Systems, 74(2015), 78-87. https://doi.org/10.1016/j.dss.2015.04.006.

Zhou, W., & Kapoor, G. (2011). Detecting Evolutionary Financial Statement Fraud. Decision Support Systems, 50(3), 570–575. https://doi.org/10.1016/j.dss.2010.08.007.

Downloads

Published

2025-07-17

How to Cite

Septia Wibowo, A., & Istianah, I. (2025). Reading Between the Lines: Incorporating Text Mining and Machine Learning in Financial Fraud Detection. Asia Pacific Fraud Journal, 10(1), 73–93. https://doi.org/10.21532/apfjournal.v10i1.382