Preliminary Data Analysis
In an earlier post, we described the first part of the Entytle process: Data Clean Up. After the customer data has been cleaned, standardized and enriched, there are various more analytical steps we need to carry out before we start the core quantitative analysis of the transaction histories (to be described in a later post). The steps described in this post have two goals: 1) to defragment the data by separately deduplicating purchasing entities as well as items and 2) to find associations between equipment and parts algorithmically.
Step 1 Customer Deduplication
In this step records are identified which are likely to refer to the same entity and assigned a unique label. For example, the following entries may all occur in the field “Company Name” in different records: Walmart, Wal-mart, Walmart Inc. etc. These three names likely refer to the same entity and are all mapped to “Walmart”. Deduplication is carried out on customer names and addresses. This allows us to merge different transaction histories, each of which may be very sparse, and thence obtain a more complete picture of the installed base at each location.
Step 2 Item Deduplication, Categorization & Classification.
Data referring to items forms a big part of the purchase history, for example in fields like Item Number, Item Description, Product Line, Product Category, Quantity and Unit Price. The first step is to identify records that probably refer to the same item (say “Blue washer”), eliminating variations due to typos, spellings or extraneous characters. As a next step functionally similar items are clustered together (for example “Blue washer” and “Red washer”) and assigned to the same Item Category, the goal being to coalesce the histories and generate opportunities for functionally interchangeable items. Finally, item categories are classified as equipment, part or consumable. For example, the data may contain 10-100 distinct item descriptions for a single item that is labelled as “blue motor”. That and “red motor” are interchangeable and can be categorized as ‘Motors’. Finally, “Motors” are classified as a part. This information is shared with the customer and feedback is incorporated to enhance the results of this process. Not all domain knowledge can be learned by natural language or other machine learning algorithms, for example that over time and by different people “bladders” have been referred to as “air pads” as well as “backup rubbers”. Recall that the main goal here is to group together functionally similar or interchangeable items to increase data density and the reliability of our predictions.
Step 3 Item Association
After the classification of items into equipment and parts/consumables, we find Equipment-Part pairs using collaborative filtering and other machine learning techniques. Equipment part pairs we find are shared with the customer for confirmation and their feedback is incorporated into the algorithms.