TSQL Fuzzy Grouping

Disons que j'ai des lignes avec des descriptions comme collées ci-dessous:

DAS0_CIP.TRA_HVAC01_073.TRA_HVAC_BCK.TRA_HVAC_CLX.Program:AHU_5. DAS0_CIP.TRA_HVAC01_072.TRA_HVAC_BCK.TRA_HVAC_CLX.Program:AHU_5. 

Ce sont très similaires mais pas tout à fait la même chose. Malheureusement, Soundex ne travaillera pas à les regrouper et je ne pense pas que je puisse utiliser une fonction de comparaison d'loggings à deux parce que j'ai une table avec 7 millions d'loggings que je dois regrouper par la description. Je veux regrouper des descriptions similaires. Existe-t-il un moyen de marquer ou de classr la similarité des lignes? Comment pourrais-je les regrouper d'une meilleure manière?

En ce moment, je groupais par la gauche ([description], 50) ou quelque chose comme ça pour get un regroupement approximatif, mais ce n'est pas vraiment précis. Je pense qu'il doit y avoir un meilleur moyen.

Exemple de données étendues:

 [GREDS2UA1381TFT][MR_Data_2]MRSaveData Script: Sproc failed to Update !ERR Invalid hpItemArray in AlmLinkedList::AddItemAtEnd !ERR Invalid hpItemArray in AlmLinkedList::GetItemPtrAt !ERR Semaphore wait failed in CBThrd_CallbackThreadMain [hCBThreadSemaphore=000001AC] !WNAL_Register(AlarmPriority10b) - failed(501) !WNALderegister AlarmMgr not running !WNAPalarm AlarmMgr not running .Initialize(Label Details ToteStacker Manual): EntID or AreaName is bad: EntID= 0| AreaName= .RefreshGrid(Reprint): No production labels found with filters .RefreshMaterialsGrid [PRODUCE]: - Fact.Common.XMLReadBuilder {1a} MES_CheeseMakeVatL1_AppEngine: StopHostedObjects() - m_AccessManagerProvider->UnregisterAutomationObjectNoUpdate( objectId = 16 ) failed, hr = 0x8000FFFF {1a} MES_CheeseMakeVatL1_AppEngine: StopHostedObjects() - m_AccessManagerProvider->UnregisterAutomationObjectNoUpdate( objectId = 3 ) failed, hr = 0x80070057 [GRE_CM_REC_OS3][MR_Data_2]MRSaveData Script: objResult Not Successful Proxy failed to connect to Middleware Server service endpoint. This can happen if your Middleware Server is not available or if the service endpoint portqueue is not accessible. The endpoint is [net.tcp:GRECMESDB1:8099FactMES.Server.HostEventBrokerSvc]. {1a} MES_StarterL1_AppEngine: StopHostedObjects() - m_AccessManagerProvider->UnregisterAutomationObjectNoUpdate( objectId = 8 ) failed, hr = 0x80070057 {1a} MES_StarterL1_AppEngine: StopHostedObjects() - m_AccessManagerProvider->UnregisterAutomationObjectNoUpdate( objectId = 9 ) failed, hr = 0x8000FFFF 0:A2C Attempt to store values in the future; timestamps were overwritten with current time (2542, 20160313 03:31:23.009, 20160313 03:31:23.008) [aahMDASSF.cpp; 3315; 1] 0:A2C Attempt to store values in the future; timestamps were overwritten with current time (2560, 20160313 03:22:55.884, 20160313 03:22:55.883) [aahMDASSF.cpp; 3315; 1] 0:A2C Attempt to store values in the future; timestamps were overwritten with current time (2560, 20160313 03:22:55.884, 20160313 03:22:55.883) [aahMDASSF.cpp; 3315; 3782] [GREDS2UA1381TFT][MR_Data_2]MRSaveData Script: Unable to get session instance 0:A2C Attempt to store values in the future; timestamps were overwritten with current time (2579, 20160313 03:27:09.509, 20160313 03:27:09.508) [aahMDASSF.cpp; 3315; 3651] {1a} MES_StarterL1_AppEngine: StopHostedObjects() - m_AccessManagerProvider->UnregisterAutomationObjectNoUpdate( objectId = 10 ) failed, hr = 0x80070057 0:A2C Attempt to store values in the future; timestamps were overwritten with current time (2579, 20160313 03:35:36.037, 20160313 03:35:36.036) [aahMDASSF.cpp; 3315; 2380] 0:A2C Attempt to store values in the future; timestamps were overwritten with current time (2579, 20160313 03:39:50.371, 20160313 03:39:50.371) [aahMDASSF.cpp; 3315; 1] {1a} MES_StarterL1_AppEngine: StopHostedObjects() - m_AccessManagerProvider->UnregisterAutomationObjectNoUpdate( objectId = 3 ) failed, hr = 0x80070057 0:A2C Attempt to store values in the future; timestamps were overwritten with current time (2579, 20160313 03:52:29.264, 20160313 03:52:29.264) [aahMDASSF.cpp; 3315; 3598] 0:A2C Attempt to store values in the future; timestamps were overwritten with current time (2579, 20160313 03:56:41.972, 20160313 03:56:41.971) [aahMDASSF.cpp; 3315; 776] 0:A2C Attempt to store values in the future; timestamps were overwritten with current time (2588, 20160313 03:44:03.194, 20160313 03:44:03.194) [aahMDASSF.cpp; 3315; 1] {1a} MES_SweetWheyL1_AppEngine: StopHostedObjects() - m_AccessManagerProvider->UnregisterAutomationObjectNoUpdate( objectId = 7 ) failed, hr = 0x8000FFFF 0:A2C Attempt to store values in the future; timestamps were overwritten with current time (2588, 20160313 03:44:03.194, 20160313 03:44:03.194) [aahMDASSF.cpp; 3315; 3199] 0:A58 Attempt to store values in the future; timestamps were overwritten with current time (2233, 20160313 03:27:07.759, 20160313 03:27:07.758) [aahMDASSF.cpp; 3315; 1] 0:A2C Attempt to store values in the future; timestamps were overwritten with current time (2542, 20160313 03:31:23.009, 20160313 03:31:23.008) [aahMDASSF.cpp; 3315; 3472] 0:A2C Attempt to store values in the future; timestamps were overwritten with current time (2579, 20160313 03:27:09.509, 20160313 03:27:09.508) [aahMDASSF.cpp; 3315; 1] 0:A2C Attempt to store values in the future; timestamps were overwritten with current time (2579, 20160313 03:35:36.037, 20160313 03:35:36.036) [aahMDASSF.cpp; 3315; 1] 0:A58 Attempt to store values in the future; timestamps were overwritten with current time (2233, 20160313 03:27:07.759, 20160313 03:27:07.758) [aahMDASSF.cpp; 3315; 332] 0:A58 Attempt to store values in the future; timestamps were overwritten with current time (2233, 20160313 03:31:20.290, 20160313 03:31:20.290) [aahMDASSF.cpp; 3315; 1] 0:A58 Attempt to store values in the future; timestamps were overwritten with current time (2233, 20160313 03:35:35.019, 20160313 03:35:35.018) [aahMDASSF.cpp; 3315; 1] 0:A58 Attempt to store values in the future; timestamps were overwritten with current time (2233, 20160313 03:35:35.019, 20160313 03:35:35.018) [aahMDASSF.cpp; 3315; 233] 0:A58 Attempt to store values in the future; timestamps were overwritten with current time (2233, 20160313 03:39:50.959, 20160313 03:39:50.959) [aahMDASSF.cpp; 3315; 253] 0:A58 Attempt to store values in the future; timestamps were overwritten with current time (2233, 20160313 03:44:04.936, 20160313 03:44:04.936) [aahMDASSF.cpp; 3315; 1] 0:A58 Attempt to store values in the future; timestamps were overwritten with current time (2233, 20160313 03:44:04.936, 20160313 03:44:04.936) [aahMDASSF.cpp; 3315; 289] 0:A58 Attempt to store values in the future; timestamps were overwritten with current time (2233, 20160313 03:52:30.967, 20160313 03:52:30.967) [aahMDASSF.cpp; 3315; 1] 0:A2C Attempt to store values in the future; timestamps were overwritten with current time (2579, 20160313 03:39:50.371, 20160313 03:39:50.371) [aahMDASSF.cpp; 3315; 2725] 0:A58 Attempt to store values in the future; timestamps were overwritten with current time (2233, 20160313 03:52:30.967, 20160313 03:52:30.967) [aahMDASSF.cpp; 3315; 326] 0:A2C Attempt to store values in the future; timestamps were overwritten with current time (2579, 20160313 03:48:17.312, 20160313 03:48:17.312) [aahMDASSF.cpp; 3315; 1] 0:A2C Attempt to store values in the future; timestamps were overwritten with current time (2579, 20160313 03:48:17.312, 20160313 03:48:17.312) [aahMDASSF.cpp; 3315; 3683] 0:A2C Attempt to store values in the future; timestamps were overwritten with current time (2579, 20160313 03:52:29.264, 20160313 03:52:29.264) [aahMDASSF.cpp; 3315; 1] 0:A2C Attempt to store values in the future; timestamps were overwritten with current time (2579, 20160313 03:56:41.972, 20160313 03:56:41.971) [aahMDASSF.cpp; 3315; 1] 0:A58 Attempt to store values in the future; timestamps were overwritten with current time (2233, 20160313 03:22:55.228, 20160313 03:22:55.227) [aahMDASSF.cpp; 3315; 1] 0:A58 Attempt to store values in the future; timestamps were overwritten with current time (2233, 20160313 03:22:55.228, 20160313 03:22:55.227) [aahMDASSF.cpp; 3315; 331] 0:A58 Attempt to store values in the future; timestamps were overwritten with current time (2233, 20160313 03:31:20.290, 20160313 03:31:20.290) [aahMDASSF.cpp; 3315; 325] 0:A58 Attempt to store values in the future; timestamps were overwritten with current time (2233, 20160313 03:39:50.959, 20160313 03:39:50.959) [aahMDASSF.cpp; 3315; 1] 0:A58 Attempt to store values in the future; timestamps were overwritten with current time (2233, 20160313 03:48:18.899, 20160313 03:48:18.899) [aahMDASSF.cpp; 3315; 1] 0:A58 Attempt to store values in the future; timestamps were overwritten with current time (2233, 20160313 03:48:18.899, 20160313 03:48:18.899) [aahMDASSF.cpp; 3315; 316] 0:A58 Attempt to store values in the future; timestamps were overwritten with current time (2233, 20160313 03:56:44.589, 20160313 03:56:44.588) [aahMDASSF.cpp; 3315; 1] 0:A5C Attempt to store values in the future; timestamps were overwritten with current time (2100, 20160313 03:35:40.381, 20160313 03:35:40.380) [aahMDASSF.cpp; 3315; 1] 0:A5C Attempt to store values in the future; timestamps were overwritten with current time (2101, 20160313 03:22:59.993, 20160313 03:22:59.993) [aahMDASSF.cpp; 3315; 1] 0:A5C Attempt to store values in the future; timestamps were overwritten with current time (2101, 20160313 03:22:59.993, 20160313 03:22:59.993) [aahMDASSF.cpp; 3315; 1097] 0:A5C Attempt to store values in the future; timestamps were overwritten with current time (2103, 20160313 03:27:13.618, 20160313 03:27:13.618) [aahMDASSF.cpp; 3315; 1] 0:A5C Attempt to store values in the future; timestamps were overwritten with current time (2103, 20160313 03:27:13.618, 20160313 03:27:13.618) [aahMDASSF.cpp; 3315; 1092] 0:A5C Attempt to store values in the future; timestamps were overwritten with current time (2103, 20160313 03:31:27.118, 20160313 03:31:27.118) [aahMDASSF.cpp; 3315; 992] 0:A5C Attempt to store values in the future; timestamps were overwritten with current time (2103, 20160313 03:39:53.131, 20160313 03:39:53.131) [aahMDASSF.cpp; 3315; 1] 0:A5C Attempt to store values in the future; timestamps were overwritten with current time (2103, 20160313 03:39:53.131, 20160313 03:39:53.131) [aahMDASSF.cpp; 3315; 798] 0:A5C Attempt to store values in the future; timestamps were overwritten with current time (2103, 20160313 03:44:07.930, 20160313 03:44:07.930) [aahMDASSF.cpp; 3315; 1008] 0:A5C Attempt to store values in the future; timestamps were overwritten with current time (2103, 20160313 03:48:21.578, 20160313 03:48:21.577) [aahMDASSF.cpp; 3315; 1] 0:A6C Attempt to store values in the future; timestamps were overwritten with current time (2249, 20160313 03:22:55.884, 20160313 03:22:55.883) [aahMDASSF.cpp; 3315; 419] 0:A58 Attempt to store values in the future; timestamps were overwritten with current time (2233, 20160313 03:56:44.589, 20160313 03:56:44.588) [aahMDASSF.cpp; 3315; 66] 0:A6C Attempt to store values in the future; timestamps were overwritten with current time (2249, 20160313 03:31:23.009, 20160313 03:31:23.008) [aahMDASSF.cpp; 3315; 1] 0:A5C Attempt to store values in the future; timestamps were overwritten with current time (2100, 20160313 03:35:40.381, 20160313 03:35:40.380) [aahMDASSF.cpp; 3315; 725] 0:A6C Attempt to store values in the future; timestamps were overwritten with current time (2249, 20160313 03:35:37.598, 20160313 03:35:37.597) [aahMDASSF.cpp; 3315; 1] 0:A5C Attempt to store values in the future; timestamps were overwritten with current time (2103, 20160313 03:31:27.118, 20160313 03:31:27.118) [aahMDASSF.cpp; 3315; 1] 0:A5C Attempt to store values in the future; timestamps were overwritten with current time (2103, 20160313 03:44:07.930, 20160313 03:44:07.930) [aahMDASSF.cpp; 3315; 1] 0:A5C Attempt to store values in the future; timestamps were overwritten with current time (2103, 20160313 03:48:21.578, 20160313 03:48:21.577) [aahMDASSF.cpp; 3315; 1143] 0:A5C Attempt to store values in the future; timestamps were overwritten with current time (2103, 20160313 03:52:37.035, 20160313 03:52:37.034) [aahMDASSF.cpp; 3315; 1] 0:A5C Attempt to store values in the future; timestamps were overwritten with current time (2103, 20160313 03:52:37.035, 20160313 03:52:37.034) [aahMDASSF.cpp; 3315; 1161] 0:A6C Attempt to store values in the future; timestamps were overwritten with current time (2249, 20160313 03:39:53.494, 20160313 03:39:53.493) [aahMDASSF.cpp; 3315; 306] 0:A5C Attempt to store values in the future; timestamps were overwritten with current time (2103, 20160313 03:56:50.535, 20160313 03:56:50.535) [aahMDASSF.cpp; 3315; 1] 0:A5C Attempt to store values in the future; timestamps were overwritten with current time (2103, 20160313 03:56:50.535, 20160313 03:56:50.535) [aahMDASSF.cpp; 3315; 183] 0:A6C Attempt to store values in the future; timestamps were overwritten with current time (2249, 20160313 03:44:05.698, 20160313 03:44:05.698) [aahMDASSF.cpp; 3315; 355] 0:A6C Attempt to store values in the future; timestamps were overwritten with current time (2249, 20160313 03:22:55.884, 20160313 03:22:55.883) [aahMDASSF.cpp; 3315; 1] 0:A6C Attempt to store values in the future; timestamps were overwritten with current time (2249, 20160313 03:48:18.453, 20160313 03:48:18.452) [aahMDASSF.cpp; 3315; 1] 0:A6C Attempt to store values in the future; timestamps were overwritten with current time (2249, 20160313 03:27:09.509, 20160313 03:27:09.508) [aahMDASSF.cpp; 3315; 1] 0:A6C Attempt to store values in the future; timestamps were overwritten with current time (2249, 20160313 03:48:18.453, 20160313 03:48:18.452) [aahMDASSF.cpp; 3315; 390] 0:A6C Attempt to store values in the future; timestamps were overwritten with current time (2249, 20160313 03:27:09.509, 20160313 03:27:09.508) [aahMDASSF.cpp; 3315; 416] 0:A6C Attempt to store values in the future; timestamps were overwritten with current time (2249, 20160313 03:31:23.009, 20160313 03:31:23.008) [aahMDASSF.cpp; 3315; 393] 0:A6C Attempt to store values in the future; timestamps were overwritten with current time (2249, 20160313 03:56:48.585, 20160313 03:56:48.584) [aahMDASSF.cpp; 3315; 71] 0:A6C Attempt to store values in the future; timestamps were overwritten with current time (2249, 20160313 03:35:37.598, 20160313 03:35:37.597) [aahMDASSF.cpp; 3315; 291] 0:A6C Attempt to store values in the future; timestamps were overwritten with current time (2249, 20160313 03:39:53.494, 20160313 03:39:53.493) [aahMDASSF.cpp; 3315; 1] 0:A6C Attempt to store values in the future; timestamps were overwritten with current time (2249, 20160313 03:44:05.698, 20160313 03:44:05.698) [aahMDASSF.cpp; 3315; 1] 0:A6C Attempt to store values in the future; timestamps were overwritten with current time (2288, 20160313 03:42:55.241, 20160313 03:42:55.240) [aahMDASSF.cpp; 3315; 1] 0:A6C Attempt to store values in the future; timestamps were overwritten with current time (2249, 20160313 03:52:32.720, 20160313 03:52:32.719) [aahMDASSF.cpp; 3315; 1] 0:A6C Attempt to store values in the future; timestamps were overwritten with current time (2288, 20160313 03:42:55.241, 20160313 03:42:55.240) [aahMDASSF.cpp; 3315; 639] 0:A6C Attempt to store values in the future; timestamps were overwritten with current time (2294, 20160313 03:38:43.218, 20160313 03:38:43.218) [aahMDASSF.cpp; 3315; 442] 0:A6C Attempt to store values in the future; timestamps were overwritten with current time (2249, 20160313 03:52:32.720, 20160313 03:52:32.719) [aahMDASSF.cpp; 3315; 401] 0:A6C Attempt to store values in the future; timestamps were overwritten with current time (2249, 20160313 03:56:48.585, 20160313 03:56:48.584) [aahMDASSF.cpp; 3315; 1] 0:A6C Attempt to store values in the future; timestamps were overwritten with current time (2338, 20160313 03:55:38.184, 20160313 03:55:38.184) [aahMDASSF.cpp; 3315; 1] 0:A6C Attempt to store values in the future; timestamps were overwritten with current time (2294, 20160313 03:38:43.218, 20160313 03:38:43.218) [aahMDASSF.cpp; 3315; 1] 0:A6C Attempt to store values in the future; timestamps were overwritten with current time (2336, 20160313 03:51:24.094, 20160313 03:51:24.094) [aahMDASSF.cpp; 3315; 1] 0:A6C Attempt to store values in the future; timestamps were overwritten with current time (2344, 20160313 03:34:28.353, 20160313 03:34:28.352) [aahMDASSF.cpp; 3315; 455] 0:A6C Attempt to store values in the future; timestamps were overwritten with current time (2344, 20160313 03:59:50.704, 20160313 03:59:50.703) [aahMDASSF.cpp; 3315; 1] 0:A6C Attempt to store values in the future; timestamps were overwritten with current time (2336, 20160313 03:51:24.094, 20160313 03:51:24.094) [aahMDASSF.cpp; 3315; 841] 0:A6C Attempt to store values in the future; timestamps were overwritten with current time (2338, 20160313 03:55:38.184, 20160313 03:55:38.184) [aahMDASSF.cpp; 3315; 670] 0:A6C Attempt to store values in the future; timestamps were overwritten with current time (2343, 20160313 03:47:09.092, 20160313 03:47:09.091) [aahMDASSF.cpp; 3315; 1] 0:A6C Attempt to store values in the future; timestamps were overwritten with current time (2343, 20160313 03:47:09.092, 20160313 03:47:09.091) [aahMDASSF.cpp; 3315; 658] A write to InTouch was skipped with a value of because the data quality was bad or initializing. A write to InTouch was skipped with a value of CM__Ctrl because the data quality was bad or initializing. A write to InTouch was skipped with a value of false because the data quality was bad or initializing. 

Une approche pourrait être la suivante: elle testera, combien de parties sont identiques … Vous pouvez continuer avec cette idée dans chaque partie en la divisant par _ .

 CREATE FUNCTION dbo.TestSimilarity(@s1 VARCHAR(100),@s2 VARCHAR(100)) RETURNS INT AS BEGIN DECLARE @x1 XML=CAST('<x>' + REPLACE(@s1,'.','</x><x>') + '</x>' AS XML); DECLARE @x2 XML=CAST('<x>' + REPLACE(@s2,'.','</x><x>') + '</x>' AS XML); DECLARE @retVal INT=0; IF @x1.value('/x[1]','varchar(100)')= @x2.value('/x[1]','varchar(100)') SET @retVal=@retVal+1; IF @x1.value('/x[2]','varchar(100)')= @x2.value('/x[2]','varchar(100)') SET @retVal=@retVal+1; IF @x1.value('/x[3]','varchar(100)')= @x2.value('/x[3]','varchar(100)') SET @retVal=@retVal+1; IF @x1.value('/x[4]','varchar(100)')= @x2.value('/x[4]','varchar(100)') SET @retVal=@retVal+1; IF @x1.value('/x[5]','varchar(100)')= @x2.value('/x[5]','varchar(100)') SET @retVal=@retVal+1; RETURN @retVal; END; GO SELECT dbo.TestSimilarity('DAS0_CIP.TRA_HVAC01_073.TRA_HVAC_BCK.TRA_HVAC_CLX.Program:AHU_5.','DAS0_CIP.TRA_HVAC01_072.TRA_HVAC_BCK.TRA_HVAC_CLX.Program:AHU_5.'); GO DROP FUNCTION dbo.TestSimilarity; 

Le résultat serait 4 …