When I first heard about this particular project, I was immediately excited. Not only would I get a chance to work with some cool technologies doing stuff that mattered, but I would also be spearheading the project by myself. It felt great that my manager believed in me to let me run with a project of this magnitude.
The IP Manager project consisted of essentially a tool that would allow Polar to collect data from various existing Internet number registries and pull them into our database for reference. Clients using our applications at Polar send in various pieces of data. Some of these include their IP Address, their Mobile Country Code and Mobile Network Code (if applicable). From previous testing it has been found that these pieces of information are not always entirely accurate so the purpose of the project is two-fold.
First, we want to query the appropriate Internet number registry for additional information about the IP Address being used such as a street address, city, region, country and postal/zip code. This information is stored so that Polar can better cater their applications and offerings for specific geographic areas. The second purpose of this project is to confirm and fix any incorrect MNC/MCC codes that are being sent from the client. For whatever reason, it has been found that these are not always accurate.
When it came down to implementation, I first set up a fork in our current analytics infrastructure such that each incoming packet that we wanted to analyze would be dumped to a separate Beanstalk queue. The IP Manager consumes the packets on this queue and processes it by first determining which Internet Registry the IP Address is associated with. This is a necessary step so that we know which API to use to further gather information. Next, using Tornado, we query this API with the appropriate address and write the new data to the tables that were created for this purpose. Earlier I created a set of Django models to house this particular data.
Once this step is complete, the packet moves onto the second stage of processing which checks the correctness of the MNC/MCC unique pair. First a simple exact match check is performed against known correct values in our database. If this does not match, we check to see if an alias for this pair exists. An alias essentially maps an incorrect pair to a correct pair of values. In order to create the corresponding aliases we actually check and see if we have processed any other addresses in the current IP block. If we have processed other IP Addresses with certainty we are actually able to create the alias and now use it for subsequent incorrect pairs.
One challenging aspect of this project is the fact that it is hard to determine 'certainty' when creating the actual aliases. To do this, several Redis queues were used that would increment an MCC/MNC pair and an IP Block. Once the number of confirmations reaches a predefined threshold we take that pair as the one that is most correct. From testing, this method worked extremely well. It acts as a form of light machine learning in that the correct MCC/MNC pair is actually learned based on the frequency of its appearance.
After all was said and done, I learned more than I could have imagined while working on this project. There were a multitude of challenges to overcome and I did get stuck along the way but all of the members on the Analytics team helped me out tremendously and for that I can't thank them enough.