Veri Madenciligi – Data Mining SOLUTION HW1 Veri Onisleme
Transkript
Veri Madenciligi – Data Mining SOLUTION HW1 Veri Onisleme
Veri Madenciligi – Data Mining SOLUTION HW1 Veri Onisleme – Data Preprocessing Dr. Cengiz Orencik October 11, 2015 1. Suppose that the data for analysis includes the attribute age. The age values for the data are (in increasing order) 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70. Turkce: Elimizde analiz etmek istedigimiz verinin yas niteligini icerdigini dusunelim. Bu yas verisi kucukten buyuge sral halde su sekilde olsun: 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70. (a) What are the mean and median of the data? Verinin ortalama ve ortanca (medyan) degerleri nedir? number of items, veri eleman says (N = 27) mean(ortalama) = 13 + 15 + 16 + 16 + 19 + 20 + 20 + . . . + 35 + 36 + 40 + 45 + 46 + 52 + 70 = 27 = 29.962 As there are 27 data items, median is the 14th one which is 25. toplam 27 veri var, tam ortadaki 14uncu eleman olan 25 (b) What is the mode of the data? Comment on its modality (i.e., unimodal, trimodal, etc.). Verinin mode degeri nedir? Hangi mode yapsnda oldugunu yorumlayin (unimodal, trimodal, vs.). 25 and 35 both occurs 4 times, the rest occurs 3 or less so both 25 and 35 are mode values. Having two different modes means it is bimodal or generally multimodal. Hem 25hem de 35 verinin icinde 4er defa geciyor. Diger veriler 3 veya daha az sayida geciyor. Bu yuzden en sk gecen elemanlar 25 ve 35 mod degerleridir. Iki mod degeri oldugu icin bimodal yada genel olarak birden fazla mod degeri oldugu icin multimodal yapisindadir. 1 (c) What is the midrange of the data? Verinin orta-aralik (midrange) degeri nedir? = 41.5 avg(min, max) = 13+70 2 (d) Find (roughly) the first quartile (Q1) and the third quartile (Q3) of the data. Verinin (yaklask olarak) ilk ceyrek (Q1) ve ikinci ceyrek (Q3) degerleri nedir? As we have 27 elements, roughly we may set the first quartile as 7th smallest and set the third quartile as the 7th largest (21st smallest) elements. 6th and 8th elements are also fine, or you may take the average of 6th and 7th. So the first quartile (Q1 ) is 20 and third quartile (Q3 ) is 35. 27 elemani 4 parcaya bolmek istersek yaklask bastan 7 inci ve sondan 7 inci elemanlar eyreklikleri belirler. 6 inci veya 8 inci de kullanilsa olur. Dolayisiyla Q1 degeri 20 ve Q3 degeri 35 olarak bulunur. (e) Does the data contain any outlier values? Explain. Veri sapan deger (outlier) iceriyor mu? Aciklayin. first calculate 1.5 IQR which is 1.5 × (35 − 20) = 22.5 lower bound is Q1 − 1.5IQR = 20 − 22.5 = −2.5 upper bound is Q3 + 1.5IQR = 35 + 22.5 = 57.5 Any value smaller than -2.5 or larger than 57.5 is an outlier. So 70 is the only outlier value in the data. Once 1.5 IQR hesaplanir: 1.5 × (35 − 20) = 22.5 alt limit Q1 − 1.5IQR = 20 − 22.5 = −2.5 ust limit Q3 + 1.5IQR = 35 + 22.5 = 57.5 -2.5ten kucuk veya 57.5ten buyuk degerler sapan veri, dolayisiyla verimizdeki tek sapan deger 70. (f) Show a boxplot of the data. Veriyi kutu grafigi (boxplot) olarak ifade ediniz. 5 value representation min(Q1 −1.5IQR), max(Q3 +1.5IQR), Q1 , M edian, Q3 1.2 1 0.8 13 10 57.5 20 30 40 50 60 (g) What is the standart deviation (σ) of the data? Verinin standart sapmasini (σ) bulun. σ2 = standartdeviation(σ) = P i (xi √ − 29.96)2 = 161.3 27 161.3 = 12.7 2 2. Suppose a hospital tested the age and body fat data for 18 randomly selected adults with the following results: Bir hastanede 18 rastgele secilmis yetiskin uzerinde yapilan testte yas ve yag oranlar uzerine asagidaki sonuc alinmistir: age % fat age % fat 23 9.5 52 34.6 23 26.5 54 42.5 27 7.8 54 28.8 27 17.8 56 33.4 39 31.4 57 30.2 41 25.9 58 34.1 47 27.4 58 32.9 49 27.2 60 41.2 50 31.2 61 35.7 (a) Draw the boxplots for age and % fat (separately). yas ve yag oran degerlerini iki ayri kutu grafigi (boxplot) olarak ifade ediniz. age is already sorted need to sort fat (yas zaten sirali, yag oranlarda siralanmali sorted fat {7.8, 9.5, 17.8, 25.9, 26.5, 27.2, 27.4, 28.8, 30.2, 31.2, 31.4, 32.9, 33.4, 34.1, 34.6, 35.7, 41.2, 42.5} medianage = avg(50, 52) = 51 medianf at = avg(30.3, 31.2) = 30.75 (yaklask bastan ve sondan (roughly) 5th element) Q1 age = 39 Q1 f at = 26.5 Q3 age = 57 Q3 f at = 34.1 No outlier for age but for fat rate there are outliers: 1.5IQR = (34.1 - 26.5) * 1.5 = 11.4 lower bound is 26.5 − 11.4 = 15.1. Anything lower is an outlier. AGE 1.2 61 1 23 0.8 20 30 40 50 60 FAT RATE 1.2 1 15.1 0.8 15 20 42.5 25 30 35 40 45 (b) Calculate the correlation coefficient r. Are these two attributes positively or negatively correlated? Korelasyon katsays r’yi hesaplayin. Bu iki nitelik pozitif mi yoksa negatif yonde mi birbirleriyle alakaldr? P (agei − meanage )(f ati − meanf at ) rage,f at = i N σage σf at meanage = 46.44, meanf at = 28.78 σage = 12.84 σf at = 8.99 N = 18 1700.33 = 0.818 18 × 12.84 × 8.99 3 As 0.818 is positive and very close to 1, we can claim age and fat rate are positively correlated in a strong way. 0.818 sifirdan buyuk ve 1 e cok yakin oldugu icin kuvvetli bir sekilde pozitif korelasyon vardir. 4
Benzer belgeler
UDK 621 .713.14 :744 .4 TURK STANDARDI TS B
tolerans . sekil ve konumlar igin genel toleranslarin tasarrminin tam uygulamasina imkan verir.
Fonksiyonun gene) toleranslardan daha buyuk bir teloransa izin verdigi ve daha buyuk toleransla imala...
Doç.Dr. Şevket Dönmez
n the major excavations of last 20 years like Old Sultanahmet Prison, Yenikapı and Sirkeci in the boundaries
of the Historical Peninsula, no new discoveries was obtained regarding the speculation o...