My research focuses on using data-driven methods in the study of human behavior. Particularly, I leverage machine learning and statistical modeling techniques to design agents that are grounded in empirical data. I also make use of data mining techniques to uncover behavioral patterns from publicly available social media data. As the application domain, I develop models from the areas of cybersecurity and urban systems.
Active Research Projects
completed new ongoing
Social Media Analytics
This study proposes a sentiment-based approach to investigate the temporal and spatiotemporal effects on tourists’ emotions when visiting a city’s tourist destinations. Our approach consists of four steps: data collection and preprocessing from social media; visitor origin identification; visit sentiment identification; and temporal and spatiotemporal analysis. The temporal and spatiotemporal dimensions include day of the year, season of the year, day of the week, location sentiment progression, enjoyment measure, and multi-location sentiment progression. We apply this approach to the city of Chicago using over eight million tweets. Results show that seasonal weather, as well as special days and activities like concerts, impact tourists’ emotions. In addition, our analysis suggests that tourists experience greater levels of enjoyment in places such as observatories rather than zoos. Finally, we find that local and international visitors tend to convey negative sentiment when visiting more than one attraction in a day whereas the opposite holds for out of state visitors. Below you will see some interesting results we gathered.
This study develops a machine learning classifier that determines Twitter users' home location with 100 meters resolution. Our results suggest up to 0.87 overall accuracy in predicting home location for the City of Chicago. We explore the influence of time span of data collection and location-sharing habits of a user. The classifier accuracy changes by data collection time but larger than one-month time spans do not significantly increase prediction accuracy. An individual's home location can be ascertained with as few as 0.6 to 1.4 tweets/day or 75 to 225 tweets with an accuracy of over 0.8. Our results shed light on how home location information can be predicted with high accuracy and how long data needs to be collected. On the flip side, our results imply potential privacy issues on publicly available social media data.
The following image shows tweeting temporality distribution for different twitter user groups based on number of tweets.
As it can be seen in the log-log scale, the data for all groups follow a log-normal-like distribution for up to 24-hour period.
Further, these groups tend to have differently shaped tails, like a power-law distribution with different exponents.
While the log-normal looking side of the graph has very similar shapes, tails show that more frequently posting twitter users' inter-tweet time is shorter.
Individuals tend to visit places that they previously visited such as home or work locations. Moreover, these visits are periodic (see: Gonzalez, Hidalgo, and Barabasi ). The following image is a visualization that shows periodic visiting behavior of Twitter users from Washington, DC. The blue dotted line shows the probability of visiting the same location after some hours, also named as first pessage time of a place. With this voluntary Twitter data, it is clear to see the periodic visiting behavior is present. Periodicity appears as 24-hour intervals. The red line would be the probability distribution if individuals visit places randomly. In other words, this graph shows that we are not random at all, at least when it comes to mobility.
Zipf's law, in general terms, indicates that frequency of a quantity is inversely propotional to its rank. Applying to Twitter data, the following graph shows that Zipf's law is present in geo-located Twitter data for Washington, DC regardles of number of unique locations a person visits.
This model relies on twitter data when it comes to understanding the attraction visit mobility of people. Attraction visits are extracted according to person's tweet location and closeby venues around. Here, venues are gathered from Google's Places API by scanning Washington, DC map locations covering all the area. People's proximity to attractions is the main factor when determining whether that attraction is visited. Here below, you can see a network of attractions gathered from same-day visits of individuals. Link weight indicates the frequency of hops between places while the intensity of nodes indicates number of visits.
Data-Driven Modeling of Agents
We have recently witnessed the proliferation of large-scale behavioral data that can be used to empirically develop agent-based models (ABMs). Despite this opportunity, the literature has neglected to offer a structured agent-based modeling approach to produce agents or its parts directly from data. In this paper, we present initial steps towards an agent-based modeling approach that focuses on individual-level data to generate agent behavioral rules and initialize agent attribute values. We present a structured way to integrate Big Data and machine learning techniques at the individual agent-level. We also describe a conceptual use case study of an urban mobility simulation driven by millions of geo-tagged Twitter social media messages. We believe our approach will advance the-state-of-the-art in developing empirical ABMs and conducting their validation. Further work is needed to assess data suitability, to compare with other approaches, to standardize data collection, and to serve all these features in near-real time.
This study revisits a Wi-Fi malware spread model by Hu et al. [2009, PNAS, 106(5)] with current Wi-Fi router data from WiGLE.net and a refined data selection method. We examine the temporality and scale of the malware spread applying these two updates. Despite ≈88% WPA adoption rate, we see a rapid malware spread occurring in a week and infecting ≈34% of all insecure routers (≈5.4% of all) after two weeks. This result is significantly higher than the original study projection. It occurs due to the increased use of Wi-Fi routers causing a more tightly connected graph. We argue that this projected risk can increase when current vulnerabilities introduced and connected devices are considered. Ultimately, a thorough consideration is needed to assess cybersecurity risks in Wi-Fi ecosystem and evaluate interventions to stop epidemics.
This study briefs on current research efforts pertaining to the use of social media data to provide empirical grounding of agent-based simulations. Three examples of how data from social media can be used in agent-based modeling are presented: 1) using large data set processing and sentiment analysis to identify preferences of a population (initialization of an agent population), 2) using agents with machine learning capabilities to learn mobility patterns from individuals in a population (initialization of individual agents in a population), and 3) identifying preferences and communication patterns based on graph analysis (agent relation). Current research indicates that these techniques show promise for creating smart agents to complement those based on complex rule-based behavior, especially using a simulation's what-if capabilities.
Simulation Data Analytics
Urban life is a complex phenomenon affected by human preferences, human behavior, and urban geography, among other factors. Agent-based models allow us to study urban life from a bottom-up perspective by capturing individuals, their actions, and interactions. In this study, we report our development of an agent-based model that simulates the patterns of urban life including daily commutes and recreational activities. We base our model on well-known theories of human behavior. We show that our model re-creates stylized facts about movement patterns and social network degree distributions. Such a model opens the door to study urban phenomena such as housing market fluctuations.
Verification and Validation (V&V) is one of the main processes in simulation development and is essential for increasing the credibility of simulations. Due to the extensive time requirement and the lack of common V&V practices, simulation projects often conduct ad-hoc V&V checks using informal methods. In this study, we propose a novel Verification and Validation platform that can handle large scale simulation output data and allows conducting tests on such data. The platform relies on a seamless integration of web technologies, data management, discovery & analysis techniques pertaining to V&V, and cloud computing. A proof-of-concept implementation that automatically makes simulation results available for V&V tests is being implemented. We believe that this data platform will be an indispensable tool for novice to expert modelers in evaluating and conveying the credibility of their simulations.
Verification and validation (V&V) techniques commonly require modelers to collect and statistically analyze large amounts of data which require specific methods for ordering, filtering, or converting data points. Modelers need simple, intuitive, and efficient techniques for gaining insight into unexpected behaviors to help in determining if these behaviors are errors or if they are artifacts resulting from the model's specifications. We present an approach to begin addressing this need by applying heat maps and spatial plots to visually observe unexpected behaviors within agent-based models. Our approach requires the modeler to specify hypotheses about expected model behavior. Agent level outputs of interest are then used to create graphical displays to visually test the hypotheses. Visual identification of unexpected behaviors can direct focus for additional V&V efforts and inform the selection process of follow-on V&V techniques. We apply our approach to a model of obesity.
Past Research Projects
Cloudes is a cloud-based discrete-event simulation development tool that’s solely operating on browser in the front-end and cloud-based infrastructure at the back-end. I designed the initial software architecture in 2013. A master’s student from Computer Science Department at ODU helped building the initial interface. Later, Anthony M. Barraco took the lead on development and made significant improvements on the project. This project is active and led by Dr. Jose J Padilla. I am still making contributions to different parts of the project. Dr. Saikou Y. Diallo and Chris J. Lynch are other members of the team. You can test the tool at cloudes.me.
Simulation of Cybersecurity
- Current Status and Future Challenges
- Assessing the Impact of Cyberloafing on Cyber Risk
- Towards Modeling Factors that Enable an Attacker
- A characterization of cybersecurity simulation scenarios
M&S Cube is a smart phone and tablet app that serves as a gentle introduction to the emerging field of modeling and simulation. I developed the first version of the iPad app in 2012 and also ported the app to iPhone platform in 2013. Other contributors are Anthony M. Barraco who developed the second version of iPad app and Android version and Anitam who helped porting the app to iPhone platform. The project was led by Dr. Jose J Padilla and Dr Saikou Y Diallo. You can download the app using the links below.
Web-based simulations and tools