Comparison of Data Mining and Statistical Techniques for Prediction Model

امجد عبد المنعم محمود حرب; AMJAD A. M. HARB

Comparison of Data Mining and Statistical Techniques for Prediction Model

Files

Comparison of Data Mining and Statistical Techniques for Prediction Model.pdf(1.6 MB)

Date

2012-05-10

Authors

امجد عبد المنعم محمود حرب

AMJAD A. M. HARB

Publisher

AL-Quds University
جامعة القدس

Abstract

The aim of this research is to perform a comparison study between statistical and data mining modeling techniques. These techniques are statistical Logistic Regression, data mining Decision Tree and data mining Neural Network. The performance of these prediction techniques were measured and compared in terms of measuring the overall prediction accuracy percentage agreement for each technique and the models were trained using eight different training datasets samples drawn using two different sampling techniques. The effect of the dependent variable values distribution in the training dataset on the overall prediction percent and on the prediction accuracy of individual “0” and “1” values of the dependent variable values was also experimented. For a given data set, the results shows that the performance of the three techniques were comparable in general with small outperformance for the Neural Network. An affecting factor that makes the percent prediction accuracy varied is the dependent variable values distribution in the training dataset, distribution of “0” and “1”. The results showed that, for all the three techniques, the overall prediction accuracy percentage agreement was high when the dependent variable values distribution ratio in the training data was greater than 1:1 but at the same time they, the techniques, fails to predict the individual dependent variable values successfully or in acceptable prediction percent. If the individual dependent variable values needed to be predicted comparably, then the dependent variable values distribution ratio in the training data should be exactly 1:1.
هدف هذه الدراسة هو إجراء مقارنة الكفاءة والفعالية بين الوسائل اإلحصائية وتقنيات التنقيب عن البيانات لبناء نماذج التصنيف والتنبؤ العلمي. الخوارزميات والوسائل والتقنيات التي تمت دراستها ومقارنة أدائها هي االنحدار اللوجستي اإلحصائي، وتقنيتي التنقيب عن البيانات شجرة القرار والشبكة العصبية. تم قياس أداء هذه التقنيات ومقارنتها باالعتماد على مقياس مشترك وهو النسبة المئوية الشاملة لدقة التنبؤ لكل تقنية. تم تدريب نماذج هذه التقنيات باستخدام ثمانية عينات من بيانات التدريب تم سحبها باالعتماد على تقنيتي سحب عينات إحصائية. تم أيضا فحص تأثير توزيع قيم المتغير التابع في بيانات تدريب خوارزميات التنبؤ المذكورة وذلك على مستوى النسبة المئوية الشاملة لدقة التنبؤ لكل تقنية وأيضا على مستوى النسبة المئوية لدقة التنبؤ لقيم المتغير التابع الفردية "0 "و "1 "لكل تقنية. أظهرت النتائج أن أداء التقنيات الثالثة كانت بشكل عام متقاربة وقابلة للمقارنة مع تفوق بسيط لخوارزمية الشبكات العصبية. تم تحديد عنصر مؤثر على اختالف وتفاوت دقة النسبة المئوية للتنبؤ وهذا العنصر هو توزيع قيم المتغير التابع في بيانات تدريب النماذج، أي توزيع "0 "و "1 ."كما أظهرت النتائج أيضا أن النسبة المئوية لدقة التنبؤ الشامل للتقنيات الثالثة كانت مرتفعة عندما كانت نسبة توزيع قيم المتغير التابع في بيانات التدريب أكبر من 1:1 ولكن في الوقت نفسه فشلت الخوارزميات والتقنيات قيد الدراسة في التنبؤ بالقيم الفردية للمتغير التابع بنجاح أو بنسبة تنبؤ مقبولة. في التطبيقات باستخدام هذه التقنيات إذا كان الهدف هو الحصول على تنبؤ بنسبة مئوية عالية لقيم المتغير التابع الفردية وأن تكون النسبة المئوية للتنبؤ بالقيمتين متقاربة فانه يجب أن تكون نسبة توزيع قيم المتغير التابع في بيانات التدريب بالضبط .1:1 تساوي

Keywords

علم الحاسوب , Computer Science

Citation

HARB، AMJAD Abed Almenem. (2012). Comparison of Data Mining and Statistical Techniques for Prediction Model [A published thesis, Al-Quds University, Palestine].Al-Quds University digital repository https://arab-scholars.com/2ea6c5

URI

https://dspace.alquds.edu/handle/20.500.12213/1465

Collections

Computer Science علم الحاسوب

Full item page