首頁  常見問題  正文

聚名企服

異常數據剔除方法有哪些？

轉載 2021-12-08 09:29:09 6833

異常數據4種剔除方法分別是：1、“isolation forest”，孤立森林；2、DBSCAN；3、OneClassSVM；4、“Local Outlier Factor”，計算一個數值score來反映一個樣本的異常程度。

異常數據剔除方法有哪些？

outlier detection異常點識別方法1. isolation forest 孤立森林1.1 測試樣本示例

文件 test.pkl

異常數據剔除方法有哪些？

1.2 孤立森林 demo

孤立森林原理

通過對特征進行隨機劃分，建立隨機森林，將經過較少次數進行劃分就可以劃分出來的點認為時異常點。

# 參考https://blog.csdn.net/ye1215172385/article/details/79762317 # 官方例子https://scikit-learn.org/stable/auto_examples/ensemble/plot_isolation_forest.html#sphx-glr-auto-examples-ensemble-plot-isolation-forest-pyimport numpy as npimport matplotlib.pyplot as pltfrom sklearn.ensemble import IsolationForest rng = np.random.RandomState(42) # 構造訓練樣本n_samples = 200 #樣本總數outliers_fraction = 0.25 #異常樣本比例n_inliers = int((1. - outliers_fraction) * n_samples)n_outliers = int(outliers_fraction * n_samples) X = 0.3 * rng.randn(n_inliers // 2, 2)X_train = np.r_[X + 2, X - 2] #正常樣本X_train = np.r_[X_train, np.random.uniform(low=-6, high=6, size=(n_outliers, 2))] #正常樣本加上異常樣本 # 構造模型并擬合clf = IsolationForest(max_samples=n_samples, random_state=rng, contamination=outliers_fraction)clf.fit(X_train)# 計算得分并設置閾值scores_pred = clf.decision_function(X_train)threshold = np.percentile(scores_pred, 100 * outliers_fraction) #根據訓練樣本中異常樣本比例，得到閾值，用于繪圖 # plot the line, the samples, and the nearest vectors to the planexx, yy = np.meshgrid(np.linspace(-7, 7, 50), np.linspace(-7, 7, 50))Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])Z = Z.reshape(xx.shape) plt.title("IsolationForest")# plt.contourf(xx, yy, Z, cmap=plt.cm.Blues_r)plt.contourf(xx, yy, Z, levels=np.linspace(Z.min(), threshold, 7), cmap=plt.cm.Blues_r) #繪制異常點區域，值從最小的到閾值的那部分a = plt.contour(xx, yy, Z, levels=[threshold], linewidths=2, colors='red') #繪制異常點區域和正常點區域的邊界plt.contourf(xx, yy, Z, levels=[threshold, Z.max()], colors='palevioletred') #繪制正常點區域，值從閾值到最大的那部分 b = plt.scatter(X_train[:-n_outliers, 0], X_train[:-n_outliers, 1], c='white', s=20, edgecolor='k')c = plt.scatter(X_train[-n_outliers:, 0], X_train[-n_outliers:, 1], c='black', s=20, edgecolor='k')plt.axis('tight')plt.xlim((-7, 7))plt.ylim((-7, 7))plt.legend([a.collections[0], b, c], ['learned decision function', 'true inliers', 'true outliers'], loc="upper left")plt.show()1.3 自己修改的，X_train能夠改成自己需要的數據

此處沒有進行標準化，可以先進行標準化再在標準化的基礎上去除異常點， from sklearn.preprocessing import StandardScaler

import numpy as npimport matplotlib.pyplot as pltfrom sklearn.ensemble import IsolationForestfrom scipy import stats rng = np.random.RandomState(42) X_train = X_train_demo.valuesoutliers_fraction = 0.1n_samples = 500# 構造模型并擬合clf = IsolationForest(max_samples=n_samples, random_state=rng, contamination=outliers_fraction)clf.fit(X_train)# 計算得分并設置閾值scores_pred = clf.decision_function(X_train)threshold = stats.scoreatpercentile(scores_pred, 100 * outliers_fraction) #根據訓練樣本中異常樣本比例，得到閾值，用于繪圖 # plot the line, the samples, and the nearest vectors to the planerange_max_min0 = (X_train[:,0].max()-X_train[:,0].min())*0.2range_max_min1 = (X_train[:,1].max()-X_train[:,1].min())*0.2xx, yy = np.meshgrid(np.linspace(X_train[:,0].min()-range_max_min0, X_train[:,0].max()+range_max_min0, 500), np.linspace(X_train[:,1].min()-range_max_min1, X_train[:,1].max()+range_max_min1, 500))Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])Z = Z.reshape(xx.shape) plt.title("IsolationForest")# plt.contourf(xx, yy, Z, cmap=plt.cm.Blues_r)plt.contourf(xx, yy, Z, levels=np.linspace(Z.min(), threshold, 7), cmap=plt.cm.Blues_r) #繪制異常點區域，值從最小的到閾值的那部分a = plt.contour(xx, yy, Z, levels=[threshold], linewidths=2, colors='red') #繪制異常點區域和正常點區域的邊界plt.contourf(xx, yy, Z, levels=[threshold, Z.max()], colors='palevioletred') #繪制正常點區域，值從閾值到最大的那部分 is_in = clf.predict(X_train)>0b = plt.scatter(X_train[is_in, 0], X_train[is_in, 1], c='white', s=20, edgecolor='k')c = plt.scatter(X_train[~is_in, 0], X_train[~is_in, 1], c='black', s=20, edgecolor='k')plt.axis('tight')plt.xlim((X_train[:,0].min()-range_max_min0, X_train[:,0].max()+range_max_min0,))plt.ylim((X_train[:,1].min()-range_max_min1, X_train[:,1].max()+range_max_min1,))plt.legend([a.collections[0], b, c], ['learned decision function', 'inliers', 'outliers'], loc="upper left")plt.show()1.4 核心代碼1.4.1 示例樣本import numpy as np# 構造訓練樣本n_samples = 200 #樣本總數outliers_fraction = 0.25 #異常樣本比例n_inliers = int((1. - outliers_fraction) * n_samples)n_outliers = int(outliers_fraction * n_samples) X = 0.3 * rng.randn(n_inliers // 2, 2)X_train = np.r_[X + 2, X - 2] #正常樣本X_train = np.r_[X_train, np.random.uniform(low=-6, high=6, size=(n_outliers, 2))] #正常樣本加上異常樣本1.4.2 核心代碼實現

clf = IsolationForest(max_samples=0.8, contamination=0.25)

from sklearn.ensemble import IsolationForest# fit the model# max_samples 構造一棵樹使用的樣本數，輸入大于1的整數則使用該數字作為構造的最大樣本數目，# 如果數字屬于(0,1]則使用該比例的數字作為構造iforest# outliers_fraction 多少比例的樣本可以作為異常值clf = IsolationForest(max_samples=0.8, contamination=0.25)clf.fit(X_train)# y_pred_train = clf.predict(X_train)scores_pred = clf.decision_function(X_train)threshold = np.percentile(scores_pred, 100 * outliers_fraction) #根據訓練樣本中異常樣本比例，得到閾值，用于繪圖## 以下兩種方法的篩選結果，完全相同X_train_predict1 = X_train[clf.predict(X_train)==1]X_train_predict2 = X_train[scores_pred>=threshold,:]# 其中，1的表示非異常點，-1的表示為異常點clf.predict(X_train)array([ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1])2. DBSCAN

DBSCAN(Density-Based Spatial Clustering of Applications with Noise) 原理

以每個點為中心，設定鄰域及鄰域內需要有多少個點，如果樣本點大于指定要求，則認為該點與鄰域內的點屬于同一類，如果小于指定值，若該點位于其它點的鄰域內，則屬于邊界點。

2.1 DBSCAN demo# 參考https://blog.csdn.net/hb707934728/article/details/71515160## 官方示例 https://scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html#sphx-glr-auto-examples-cluster-plot-dbscan-pyimport numpy as npimport matplotlib.pyplot as pltimport matplotlib.colorsimport sklearn.datasets as dsfrom sklearn.cluster import DBSCANfrom sklearn.preprocessing import StandardScalerdef expand(a, b): d = (b - a) * 0.1 return a-d, b+dif __name__ == "__main__": N = 1000 centers = [[1, 2], [-1, -1], [1, -1], [-1, 1]] #scikit中的make_blobs方法常被用來生成聚類算法的測試數據，直觀地說，make_blobs會根據用戶指定的特征數量、 # 中心點數量、范圍等來生成幾類數據，這些數據可用于測試聚類算法的效果。 #函數原型：sklearn.datasets.make_blobs(n_samples=100, n_features=2, # centers=3, cluster_std=1.0, center_box=(-10.0, 10.0), shuffle=True, random_state=None)[source] #參數解析： # n_samples是待生成的樣本的總數。 # # n_features是每個樣本的特征數。 # # centers表示類別數。 # # cluster_std表示每個類別的方差，例如我們希望生成2類數據，其中一類比另一類具有更大的方差，可以將cluster_std設置為[1.0, 3.0]。 data, y = ds.make_blobs(N, n_features=2, centers=centers, cluster_std=[0.5, 0.25, 0.7, 0.5], random_state=0) data = StandardScaler().fit_transform(data) # 數據1的參數：(epsilon, min_sample) params = ((0.2, 5), (0.2, 10), (0.2, 15), (0.3, 5), (0.3, 10), (0.3, 15)) plt.figure(figsize=(12, 8), facecolor='w') plt.suptitle(u'DBSCAN clustering', fontsize=20) for i in range(6): eps, min_samples = params[i] #參數含義： #eps:半徑，表示以給定點P為中心的圓形鄰域的范圍 #min_samples:以點P為中心的鄰域內最少點的數量 #如果滿足,以點P為中心,半徑為EPS的鄰域內點的個數不少于MinPts,則稱點P為核心點 model = DBSCAN(eps=eps, min_samples=min_samples) model.fit(data) y_hat = model.labels_ core_indices = np.zeros_like(y_hat, dtype=bool) # 生成數據類型和數據shape和指定array一致的變量 core_indices[model.core_sample_indices_] = True # model.core_sample_indices_ border point位于y_hat中的下標 # 統計總共有積累，其中為-1的為未分類樣本 y_unique = np.unique(y_hat) n_clusters = y_unique.size - (1 if -1 in y_hat else 0) print (y_unique, '聚類簇的個數為：', n_clusters) plt.subplot(2, 3, i+1) # 對第幾個圖繪制，2行3列，繪制第i+1個圖 # plt.cm.spectral https://blog.csdn.net/robin_Xu_shuai/article/details/79178857 clrs = plt.cm.Spectral(np.linspace(0, 0.8, y_unique.size)) #用于給畫圖灰色 for k, clr in zip(y_unique, clrs): cur = (y_hat == k) if k == -1: # 用于繪制未分類樣本 plt.scatter(data[cur, 0], data[cur, 1], s=20, c='k') continue # 繪制正常節點 plt.scatter(data[cur, 0], data[cur, 1], s=30, c=clr, edgecolors='k') # 繪制邊緣點 plt.scatter(data[cur & core_indices][:, 0], data[cur & core_indices][:, 1], s=60, c=clr, marker='o', edgecolors='k') x1_min, x2_min = np.min(data, axis=0) x1_max, x2_max = np.max(data, axis=0) x1_min, x1_max = expand(x1_min, x1_max) x2_min, x2_max = expand(x2_min, x2_max) plt.xlim((x1_min, x1_max)) plt.ylim((x2_min, x2_max)) plt.grid(True) plt.title(u'$epsilon$ = %.1f m = %d clustering num %d'%(eps, min_samples, n_clusters), fontsize=16) plt.tight_layout() plt.subplots_adjust(top=0.9) plt.show()[-1 0 1 2 3] 聚類簇的個數為： 4[-1 0 1 2 3] 聚類簇的個數為： 4[-1 0 1 2 3 4] 聚類簇的個數為： 5[-1 0] 聚類簇的個數為： 1[-1 0 1] 聚類簇的個數為： 2[-1 0 1 2 3] 聚類簇的個數為： 4

2.2 使用自定義測試樣例## 參考https://blog.csdn.net/hb707934728/article/details/71515160## 官方示例 https://scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html#sphx-glr-auto-examples-cluster-plot-dbscan-pyimport numpy as npimport matplotlib.pyplot as pltimport matplotlib.colorsimport sklearn.datasets as dsfrom sklearn.cluster import DBSCANfrom sklearn.preprocessing import StandardScalerdef expand(a, b): d = (b - a) * 0.1 return a-d, b+dif __name__ == "__main__": N = 1000 data = X_train_demo.values # 數據1的參數：(epsilon, min_sample) params = ((0.2, 5), (0.2, 10), (0.2, 15), (0.2, 20), (0.2, 25), (0.2, 30)) plt.figure(figsize=(12, 8), facecolor='w') plt.suptitle(u'DBSCAN clustering', fontsize=20) for i in range(6): eps, min_samples = params[i] #參數含義： #eps:半徑，表示以給定點P為中心的圓形鄰域的范圍 #min_samples:以點P為中心的鄰域內最少點的數量 #如果滿足,以點P為中心,半徑為EPS的鄰域內點的個數不少于MinPts,則稱點P為核心點 model = DBSCAN(eps=eps, min_samples=min_samples) model.fit(data) y_hat = model.labels_ core_indices = np.zeros_like(y_hat, dtype=bool) # 生成數據類型和數據shape和指定array一致的變量 core_indices[model.core_sample_indices_] = True # model.core_sample_indices_ border point位于y_hat中的下標 # 統計總共有積累，其中為-1的為未分類樣本 y_unique = np.unique(y_hat) n_clusters = y_unique.size - (1 if -1 in y_hat else 0) print (y_unique, '聚類簇的個數為：', n_clusters) plt.subplot(2, 3, i+1) # 對第幾個圖繪制，2行3列，繪制第i+1個圖 # plt.cm.spectral https://blog.csdn.net/robin_Xu_shuai/article/details/79178857 clrs = plt.cm.Spectral(np.linspace(0, 0.8, y_unique.size)) #用于給畫圖灰色 for k, clr in zip(y_unique, clrs): cur = (y_hat == k) if k == -1: # 用于繪制未分類樣本 plt.scatter(data[cur, 0], data[cur, 1], s=20, c='k') continue # 繪制正常節點 plt.scatter(data[cur, 0], data[cur, 1], s=30, c=clr, edgecolors='k') # 繪制邊緣點 plt.scatter(data[cur & core_indices][:, 0], data[cur & core_indices][:, 1], s=60, c=clr, marker='o', edgecolors='k') x1_min, x2_min = np.min(data, axis=0) x1_max, x2_max = np.max(data, axis=0) x1_min, x1_max = expand(x1_min, x1_max) x2_min, x2_max = expand(x2_min, x2_max) plt.xlim((x1_min, x1_max)) plt.ylim((x2_min, x2_max)) plt.grid(True) plt.title(u'$epsilon$ = %.1f m = %d clustering num %d'%(eps, min_samples, n_clusters), fontsize=14) plt.tight_layout() plt.subplots_adjust(top=0.9) plt.show()

注意：可以看到在測試樣例的兩端，相比與孤立森林，DBSCAN能夠很好地對“尖端”處的樣本的分類。

2.3 核心代碼

model = DBSCAN(eps=eps, min_samples=min_samples) # 構造分類器

from sklearn.cluster import DBSCANfrom sklearn import metricsdata = X_train_demo.valueseps, min_samples = 0.2, 10# eps為領域的大小，min_samples為領域內最小點的個數model = DBSCAN(eps=eps, min_samples=min_samples) # 構造分類器model.fit(data) # 擬合labels = model.labels_ # 獲取類別標簽，-1表示未分類# 獲取其中的core pointscore_indices = np.zeros_like(labels, dtype=bool) # 生成數據類型和數據shape和指定array一致的變量core_indices[model.core_sample_indices_] = True # model.core_sample_indices_ border point位于labels中的下標core_point = data[core_indices]# 獲取非異常點normal_point = data[labels>=0]# 繪制剔除了異常值后的圖plt.scatter(normal_point[:,0],normal_point[:,1],edgecolors='k')plt.show()

2.4 構造過濾函數

該函數先進行了標準化，方便使用固定的參數進行分析

2.4.1 過濾函數def filter_data(data0, params): from sklearn.cluster import DBSCAN from sklearn import metrics scaler = StandardScaler() scaler.fit(data0) data = scaler.transform(data0) eps, min_samples = params # eps為領域的大小，min_samples為領域內最小點的個數 model = DBSCAN(eps=eps, min_samples=min_samples) # 構造分類器 model.fit(data) # 擬合 labels = model.labels_ # 獲取類別標簽，-1表示未分類 # 獲取其中的core points core_indices = np.zeros_like(labels, dtype=bool) # 生成數據類型和數據shape和指定array一致的變量 core_indices[model.core_sample_indices_] = True # model.core_sample_indices_ border point位于labels中的下標 core_point = data[core_indices] # 獲取非異常點 normal_point = data0[labels>=0] return normal_point2.4.2 衡量分類結果

（markdown格式懶得轉，直接截圖了::>_0]進行剔除異常點之前

剔除異常點之后plt.scatter(X_train_normal[:,0],X_train_normal[:,1])plt.show()

4. Local Outlier Factor（LOF）

LOF通過計算一個數值score來反映一個樣本的異常程度。這個數值的大致意思是：

一個樣本點周圍的樣本點所處位置的平均密度比上該樣本點所在位置的密度。比值越大于1，則該點所在位置的密度越小于其周圍樣本所在位置的密度。

## 參考https://blog.csdn.net/hb707934728/article/details/71515160## 官方示例 https://scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html#sphx-glr-auto-examples-cluster-plot-dbscan-pyimport numpy as npimport matplotlib.pyplot as pltimport matplotlib.colorsfrom sklearn.neighbors import LocalOutlierFactordef expand(a, b): d = (b - a) * 0.1 return a-d, b+dif __name__ == "__main__": N = 1000 data = X_train_demo.values # 數據1的參數：(epsilon, min_sample) params = ((0.01, 5), (0.05, 10), (0.1, 15), (0.15, 20), (0.2, 25), (0.25, 30)) plt.figure(figsize=(12, 8), facecolor='w') plt.suptitle(u'DBSCAN clustering', fontsize=20) for i in range(6): outliers_fraction, min_samples = params[i] #參數含義： #eps:半徑，表示以給定點P為中心的圓形鄰域的范圍 #min_samples:以點P為中心的鄰域內最少點的數量 #如果滿足,以點P為中心,半徑為EPS的鄰域內點的個數不少于MinPts,則稱點P為核心點 model = LocalOutlierFactor(n_neighbors=min_samples, contamination=outliers_fraction) y_hat = model.fit_predict(X_train) # 統計總共有積累，其中為-1的為未分類樣本 y_unique = np.unique(y_hat) # clrs = [] # for c in np.linspace(16711680, 255, y_unique.size): # clrs.append('#%06x' % c) plt.subplot(2, 3, i+1) # 對第幾個圖繪制，2行3列，繪制第i+1個圖 # plt.cm.spectral https://blog.csdn.net/robin_Xu_shuai/article/details/79178857 clrs = plt.cm.Spectral(np.linspace(0, 0.8, y_unique.size)) #用于給畫圖灰色 for k, clr in zip(y_unique, clrs): cur = (y_hat == k) if k == -1: # 用于繪制未分類樣本 plt.scatter(data[cur, 0], data[cur, 1], s=20, c='k') continue # 繪制正常節點 plt.scatter(data[cur, 0], data[cur, 1], s=30, c=clr, edgecolors='k') x1_max, x2_max = np.max(data, axis=0) x1_min, x2_min = np.min(data, axis=0) x1_min, x1_max = expand(x1_min, x1_max) x2_min, x2_max = expand(x2_min, x2_max) plt.xlim((x1_min, x1_max)) plt.ylim((x2_min, x2_max)) plt.grid(True) plt.title(u'outliers_fraction = %.1f min_samples = %d'%(outliers_fraction, min_samples), fontsize=12) plt.tight_layout() plt.subplots_adjust(top=0.9) plt.show()

4.1 核心代碼from sklearn.neighbors import LocalOutlierFactorX_train = X_train_demo.values# 構造分類器## 25個樣本點為一組，異常值點比例為0.2clf = LocalOutlierFactor(n_neighbors=25, contamination=0.2)# 預測，結果為-1或者1labels = clf.fit_predict(X_train)# 獲取正常點X_train_normal = X_train[labels>0]進行剔除異常點之前plt.scatter(X_train[:,0],X_train[:,1])plt.show()

剔除異常點之后plt.scatter(X_train_normal[:,0],X_train_normal[:,1])plt.show()

聲明：本文轉載于：互聯網，如有侵犯，請聯系service@Juming.com刪除

相關標簽: 異常數據

上一篇: 32kb存儲器共有幾個存儲單元？

下一篇: 根據應用交換機的網絡規模，什么不屬于分類后的結果？

相關文章

相關專題

編輯推薦

域名注冊專題合集
域名搶注專題合集
企業建站專題合集

熱門排行榜

猜你喜歡

熱門標簽



主站蜘蛛池模板：亚洲AV无码久久精品成人| 精品亚洲A∨无码一区二区三区| 国产乱妇无码大片在线观看| 亚洲中文字幕无码爆乳| 特级毛片内射www无码| 亚洲高清无码综合性爱视频| 色爱无码AV综合区| 国产av激情无码久久| 亚洲av日韩av高潮潮喷无码| 精品无码综合一区| 久久久久久久久无码精品亚洲日韩| 亚洲午夜福利AV一区二区无码| 69ZXX少妇内射无码| 国产激情无码一区二区| 永久无码精品三区在线4| 亚洲精品久久无码| 日韩av无码久久精品免费| 日韩精品无码一区二区三区AV| 国产精品午夜无码AV天美传媒| 无码精品久久久天天影视 | 色欲AV无码一区二区三区 | 亚洲熟妇无码久久精品| 国产免费黄色无码视频| 国产精品亚洲а∨无码播放麻豆| 久久无码高潮喷水| 人妻中文字幕无码专区| 亚洲国产a∨无码中文777| 中文字幕无码日韩专区免费| 亚洲无码在线专区| 精品一区二区三区无码视频| 国产精品亚洲专区无码唯爱网| 天堂一区人妻无码| 性色av无码不卡中文字幕| 无码无需播放器在线观看| 亚洲成av人无码亚洲成av人| 亚洲AV永久无码天堂影院| 亚洲国产精品无码久久久秋霞1| 99精品一区二区三区无码吞精| 亚洲AV综合色区无码二区偷拍| 精品无码一区在线观看| 精品久久久久久无码中文字幕一区|