iris 데이터 설명

본 포스트에서는 XAI 중 Model-Agnostic Method를 적용하기 위한 sklearn의 iris 데이터를 살펴보겠습니다.

Iris 데이터

Iris 데이터는 Sklearn 라이브러리에서 제공해주는 데이터 중 하나로 붓꽃의 품종 분류 문제입니다. Setosa, Versicolor, Virginica 총 3개로 분류하는 다중 분류 문제이며 instance는 총 150개, feature는 꽃받침의 길이와 폭, 꽃잎의 길이와 폭 총 4개의 수치 값으로 가지고 있습니다. 비교적 feature 수가 적기 때문에 Iris 데이터를 사용하게 되었습니다.

Iris 데이터 분석 코드

분석 코드는 “Introduction to Machine Learning with Python” 책을 참고하여 작성하였습니다.

Introduction_to_Machine_Learning_with_Python

iris 데이터 가져오기

 from sklearn.datasets import load_iris
 iris_dataset = load_iris()

iris 데이터를 가져오기위해 sklearn의 load_iris 함수를 import 해줍니다. 그리고 load_iris 함수를 이용하여 iris_dataset을 불러옵니다.

iris_dataset의 key 알아보기

 print(f"keys: {iris_dataset.keys()}")

keys: dict_keys([‘data’, ‘target’, ‘target_names’, ‘DESCR’, ‘feature_names’, ‘filename’])

iris_dataset은 data, target, target_names, DESCR, feature_names, filname 의 키를 가지는 딕셔너리입니다.

target 클래스 이름 알아보기

 print(f"target names: {iris_dataset['target_names']}")

target names: [‘setosa’ ‘versicolor’ ‘virginica’]

setosa, bersicolor, virginica 총 3개의 클래스로 이루어져 있습니다.

feature 이름 알아보기

 print(f"feature names: {iris_dataset['feature_names']}")

feature names: [‘sepal length (cm)’, ‘sepal width (cm)’, ‘petal length (cm)’, ‘petal width (cm)’]

꽃받침의 길이, 폭, 꽃잎의 길이, 폭 총 4개의 feature로 이루어져 있습니다.

각 feature 범위 확인하기

 print("range of features")
 print(f"{iris_dataset['feature_names'][0]} : {min(iris_dataset['data'][:, 0])} ~ {max(iris_dataset['data'][:, 0])}")
 print(f"{iris_dataset['feature_names'][1]} : {min(iris_dataset['data'][:, 1])} ~ {max(iris_dataset['data'][:, 1])}")
 print(f"{iris_dataset['feature_names'][2]} : {min(iris_dataset['data'][:, 2])} ~ {max(iris_dataset['data'][:, 2])}")
 print(f"{iris_dataset['feature_names'][3]} : {min(iris_dataset['data'][:, 3])} ~ {max(iris_dataset['data'][:, 3])}")      

range of features
sepal length (cm) : 4.3 ~ 7.9
sepal width (cm) : 2.0 ~ 4.4
petal length (cm) : 1.0 ~ 6.9
petal width (cm) : 0.1 ~ 2.5

보유중인 데이터들의 feature 범위를 알 수 있습니다. 꽃받침의 길이는 4.3 ~ 7.9cm, 꽃받침의 너비는 2.0 ~ 4.4cm, 꽃잎의 길이는 1.0 ~ 6.9cm, 꽃잎의 너비는 0.1 ~ 2.5cm 의 범위를 가지는 것을 알 수 있습니다.

data sample 확인하기

 print(f"data samples: {iris_dataset['data'][:5]}")

data samples: [[5.1 3.5 1.4 0.2]
[4.9 3. 1.4 0.2]
[4.7 3.2 1.3 0.2]
[4.6 3.1 1.5 0.2]
[5. 3.6 1.4 0.2]]

각 feature는 수치형 데이터입니다.

target 확인하기

 print(f"target: {iris_dataset['target']}")

target: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]

타겟은 0, 1, 2 이렇게 3가지 클래스로 이루어져 있습니다. 0은 setosa, 1은 versicolor, 2는 virginica를 의미합니다.

instance 갯수 확인하기

 print(f"data shape: {iris_dataset['data'].shape}")
 print(f"target shape: {iris_dataset['target'].shape}")

data shape: (150, 4)
target shape: (150,)

feature와 target의 정보를 담고있는 array의 shape을 확인함으로 써 instance의 수가 150개라는 것을 알 수 있습니다.