“SolutionInc manages public access Wi-Fi in some of the world's busiest places. We contacted the IBM jStart team to help us analyze our large datasets using the Apache Spark and iPython Notebook running on the IBM Bluemix platform. By making use of the IBM Spark technology, we were able to obtain insights on device traffic patterns. These analytics can help our customers leverage their investment in a Wi-Fi solution into a valuable business tool.”
Glen Lavigne, SolutionInc President, Chief Executive Officer
Generating Business Insights from large Wi-Fi Datasets
SolutionInc, an established leader in managing public access Wi-Fi in some of the world’s busiest places, approached the IBM jStart team for assistance in analyzing large data sets. SolutionInc collected Wi-Fi presence data at multiple venues and wanted to identify trends that were hiding within 241 million rows of Wi-Fi log information. The goal was to provide valuable business insights that SolutionInc could share with their clients. The jStart team used the IBM Analytics for Apache Spark managed service running on IBM Bluemix to visualize customer trends such as peak volume times, busiest locations, route patterns, dwell times and device types (iOS vs Android) from the location-based data sets.
Since 1997, SolutionInc has been a leader in managing public Internet access in hotels, convention centers, airports and other public venues around the world. SolutionInc offers both on-premise and cloud-based solutions for managing high demand public Wi-Fi so customers can seamlessly access the Internet, thereby enhancing their overall venue experience.
Collection of Wi-Fi data logs occurring all the time everywhere
Smartphones and other mobile devices periodically broadcast packets called Probe Request that contain the unique MAC address of the client. By actively scanning for nearby Wi-Fi networks, a phone can initiate a Wi-Fi connection faster than if it waits for a Wi-Fi access point to send out a Beacon Frame. Since many mobile devices use this technique, it is possible to track many individuals from one location to another and determine how long a person stays at a particular location. Wi-Fi access providers can collect location-based data which consists of individual log records when one mobile device is near a Wi-Fi access point and broadcasts a probe request. Each location-based data log record contains multiple data elements such as the access point ID, signal strength, mobile device MAC address, and Date/Time.
Spark on IBM Bluemix to the Rescue
SolutionInc collected over 23GB of presence data from Wi-Fi access points at various locations such as restaurants, coffee shops, and shopping malls. The enormity of the database made it impossible to analyze using traditional spreadsheet tools. So the jStart team and SolutionInc team decided to use Spark technology on Bluemix to perform the data analysis. Spark is an open source framework and engine designed to process large amounts of data in a quick and efficient manner. While it is somewhat similar to Hadoop, the advantage of Spark is that it places the data to be analyzed in memory, making it up to 100x faster. An iPython notebook can be used as a front-end interface to the Spark processing engine. Essentially, an iPython notebook combines text, Python executable code, and graphics/visualizations into a single document that captures the flow of data analysis and exploration. This document can be exported as a formatted report or an executable script. The combination of IPython Notebooks, Apache Spark, and Object Storage deliver a complete and integrated experience for data scientists and analysts when performing interactive analytics. By using IBM Bluemix cloud infrastructure to host both Spark and iPython notebook technologies, the SolutionInc team did not need to worry about setting up their own servers, nor learn how to install and configure Spark. It only takes a few minutes for users to get a working Spark/iPython notebook environment running on IBM Bluemix.
First step: clean/filter the data
Real world data is rarely “clean”. When more than one Wi-Fi access point is deployed within the same location, a mobile device can be “seen” at two or more access points at once. In these instances, Spark and iPython notebooks were used to sort the data according to timestamp, look for occurrences of clients being registered simultaneously at different locations, and remove the log instances that had the lower signal strength.
Interesting insights derived from data
SolutionInc wanted to compare the number of Wi-Fi registrations versus raw location-based data at various sites. In other words, how many people were in an area serviced by a Wi-Fi access point, but did not actually register to access the Internet at that location? For the particular dataset supplied by SolutionInc, it was found that for most venues, 1% to 10% of the mobile devices that stay longer than 5 minutes at a particular location actually register to use the Wi-Fi network. As seen by the chart below, some venues had a significantly higher percentage of engagement than other venues. In general, hotels and coffee shops have a higher percentage of patrons access the site’s Wi-Fi network than other locations such as a doctor or lawyer’s office.
Analytics reports, such as the one shown above, only take a few minutes to generate using the IBM Spark/iPython notebook environment.
Another insight found was the number of patrons per day at an establishment who had smartphones or tablets. In order to generate report for this metric, Wi-Fi location-based data first needed to be cleansed of “passer-by” and “stationary” devices. For example, a person walking outside on the sidewalk or driving in a street nearby might be picked up by an establishment’s access point. We defined these occurrences as a person passing by and did not count those individuals as a customer at that location. To filter out “passer-by” log entries, our team established a rule that only mobile devices that pinged an access point for 5 minutes or more would be counted as a “patron” of the establishment. We also found that there were several stationary devices at many locations. For example, a point-of-sale terminal or tablet could be connected to the establishment’s Wi-Fi network, but those devices should not be counted as patrons. Therefore, devices were filtered out if the device was detected at only one location and it pinged the access point for more than 6 hours a day on average.
The graphs to the right are results from a bar the week of April 7, 2014 and April 14, 2014. The data pattern for the week of April 7th reflects the expected scenario of the busiest days being Friday and Saturday. The week of April 14th had an irregular dispersion; the busiest day was Thursday. Friday, Saturday, and Sunday of that week each had very low traffic. This pattern reflects the change in traffic due to it being the Easter holiday. Thursday night became “Friday night” and patron registrations were much reduced over the weekend.
Another analytics exercise was to classify the mix of mobile devices being used by patrons at an establishment. By inspecting the Mac address of the client, the type of mobile device being used can be identified by cross referencing the Mac address of the client with the IEEE assigned list of Mac addresses.
Once users log into the Wi-Fi network, additional information about the devices such as its Operating System or OS can also be obtained. One would expect that the percentage of device types found within the location-based data would parallel the percentage of Operating Systems supported by particular device types that actually logged into Wi-Fi access points. As shown in the chart, the percentage of Apple Mac/iOS devices (62.62%) for those who registered to use the Wi-Fi network. This percentage is similar to the percentage of the Apple device type detected in the location-based data (58%) as shown by the chart above.
Collecting and analyzing Wi-Fi access point data can present several data processing challenges, but the data results can be valuable to both Wi-Fi access providers and to their clients. Some of the data challenges involve the sheer size of the dataset. In order to minimize processing time, Apache Spark running on Bluemix was used to help speed the filtering, sorting, analysis, and reporting of the SolutionInc data. Wi-Fi data records are often “noisy” containing records for stationary and passerby devices that could distort the analysis. Therefore, access log records must be filtered before trends can be analyzed. This project demonstrated how the volume of data points generated by Wi-Fi technology requires a powerful analytics tool to create meaningful analytics. There is valuable information that can be distilled from raw data and that information can assist Wi-Fi providers and their clients to understand and utilize Wi-Fi passive and active activity in order to create business insights and opportunities.
Start Small, Grow Fast
Learn how the jStart Team can help your business get started using our "start small, grow fast" engagement process. Today's business challenges aren't just about huge amounts of information, rather it is leveraging the valuable insights and opportunities living within that data. jStart is a highly skilled team focused on providing fast, smart, and valuable business solutions leveraging the latest technologies. The team typically focuses on emerging technologies which have commercial potential within 12-18 months. This allows the team to keep ahead of the adoption curve, while being prepared for client engagements and partnerships. The team’s focus includes: predictive and prescriptive analytics, cognitive computing, cloud technologies, big data, social data and mobile platforms.
Follow us on: