Bike UAE
June 2025
This summer I’ve been assisting Alexander Christou, a Transportation Expert with the Center for Interacting Urban Networks @ NYU Abu Dhabi, with a research project tackling ‘Micro Mobility Usage & Challenges in the UAE’. The goal of this research is to assist the people and governments throughout the United Arab Emirates in taking back their cities from cars and help make educated decisions to improve the mobility within each community.
The UAE is set to spend millions of dollars in the next few years creating and improving micro mobility infrastructure. With a plan to build 1200+ km of bike lanes by 2028, it is our job to make sure they’re implemented in the most effective way possible.
Throughout this project, we’ll be focusing on micro mobility, which at its core is simply traveling from one place to another using small forms of transportation (i.e. bikes, scooters, skateboards, etc.).
I’ve spent the first half of this internship seing the foundation for some great analysis. I started with processing/cleaning the survey data collected this spring, and began some early insights into the provided ‘Strava Metro’ data.
Next came the online questionnaire that Mr. Christou put together. The survey had more than 1000 respondents, all UAE residents aged 18+ who used some form of micro mobility. The data was collected from April to May of this year distributed UAE-wide by Wolfi’s. The data was collected and stored with Qualtrics, an online service to help collect and analyze survey results. You can see the format of the Qualtrics data in the following dashboard.
Survey Data through Qualtrics
The questionnaire was complex, in that it had almost every type of response possible including multiple choice, text entry, multi select, numeric input, etc. It also featured nested questions, which makes sense for the user and writer but makes understanding the data challenging. Therefore, my first step was to rewrite the survey in a road map style, helping me understand what questions were asked in what order. This resulted in a list of 100+ variables of varying types, with each user response requiring 38 questions to be answered.
Before the coding started I knew I needed to be extremely organized throughout the development process. I brushed up on my GitHub fundamentals and ran through some practice projects for a technical refresh. A tool like this is crucial to the safety and completeness of your data analysis, preventing the loss of work and/or files. The repository is public so anyone can see the work I’ve completed so far here.
Now that I had established a safe and reliable place for managing my work, it was ready to start the first step in data analysis; cleaning. From Qualtrics I downloaded the .csv file for the survey responses. Below you can see the raw output that was produced.
The file included 100+ variables with little to no organization as the default output. I knew I would need to clean up inputs, column names, and a multitude of other issues it had.
The easiest and first step was to trim down the fat and drop some columns that were not needed. Empty columns were selected and removed, including a nested question tree on ‘Skateboarding’ that seemed to be hidden from the survey entirely, or at least not a single person selected it.
Next came the removal of unnecessary context rows that won’t be helpful in analysis. These included items like StartDate, ResponseID, DistributionChannel, etc. This data isn’t gone but it won’t be helpful for what we’re trying to accomplish.
Then, using the road map I had created earlier, I renamed all the columns to help with readability and cut down on lengthy or confusing names. For example, Q11 isn’t helpful as a column name without referencing further documentation, but ‘Emirate’, representing which Emirate the user resides in, is far more effective. Other examples include column names that are simply the question asked in their entirety: ‘Pedal power! Based on your previous response, you don't need a motor to get places. Why do you cycle? (select all that apply)’ can simply be written tas ‘Bike-Why’. This isn’t the fault of Mr. Christou, it is simply how the data was formatted as it was exported.
A common source of error in survey data is spam responses. Upon completion, participants could enter a giveaway, so there are almost certainly going to be individuals who click through the survey quickly without reading any of the questions.
An established way to deal with this is to see if the user’s responses are all extreme in one direction or the other, flagging the row as potential ‘fraud’. These questions however aren’t suited for such a method but thankfully we do have % completion along with duration variables to play with.
Before removing spam responses, let’s take a look at the distribution of durations taken in the survey. Below you can find a boxplot (a representation for each quartile) of the data as it comes straight from Qualtrics.
Boxplot of raw data.
This is impossible to read due to some extremely slow people (likely started and left open). These will be fine for feature analysis but for visualization let’s temporarily remove the outliers so we can see the data more clearly.
Boxplot of narrowed data.
That’s beer; now we can see a right skew in the data with a couple outliers on the far right. Let’s plot this narrowed data with a histogram to get a beer grasp of what we’re dealing with.
Narrowed Survey Duration Histogram
With 1000+ responses, we’d expect something similar to a normal distribution with a single hump in the middle (with perhaps a slight right skew to account for slow/afk folks). But above we can see a second hump in the data, with a large percentage of responses happening in under 2 minutes. With 38 questions, some with significant text, these have to be spam responses.
To solve this, I wrote a function to handle these answers we don’t want to use. We need a way to mathematically sort out those who simply click through the survey without reading. These values can be tweaked later, but for now I set:
Minimum Questions for 100% Completion = 38 Minimum Seconds per Question = 5
Now with those constants established, we can write in a function of progress and duration to figure out who submied a spam response. This is done by comparing the duration logged to the minimum total time to complete the survey multiplied by the percent completed.
Code Snip-it to handle spam responses
If the user submitted a response too fast, then we filter it out and we're just left with the honest answers. To avoid false positives, I’ve made the minimum time to answer only 5 seconds to account for fast readers. After running, you can see below a much more reasonable distribution of durations.
Filtered Survey Duration Histogram
There is still a slight hump present under 2 minutes, but a portion of these are incomplete responses. So if someone only answered 4 questions and it took them over 20 seconds then that is fine for our uses. Some spam may still have slipped through but this is a great improvement regardless.
Once I had done some basic cleaning and filtering, I thought I was ready to jump into some analysis. I started with an ANOVA in an attempt to see if Gender affected the type of bikes owned. However, I quickly realized an issue in how the data was collected/formatted.
The survey had many multi select questions (ex: ‘Select which types of bikes you own: Road, BMX, Mountain, etc.’), which are great for collecting data by essentially asking 6 questions in 1, but make analysis messy. Rows become cluttered with similar yet horrible to work with responses. For example:
Which months do you ride in?
January, February, April, May
vs
February, March, April
vs
January, March, April, May
All of these responses differ only slightly but show up as a different kind of response for each combination of months. For questions like these with 12 months, there are 12! theoretical combinations of a potential response! This was a serious problem that I had never dealt with before in raw data. So the solution was to write a function allowing me to expand these questions into multiple boolean variables that simply have true or false for if the user selected it. For example, months cycled become 12 new boolean columns (January, February, etc.) and the original column is dropped.
Transform Multiselect Column Function Code Snip-it
This solved my problem but I now had 100+ extra variables to deal with and organize after running the function for each multi select feature. Once again referencing my road map I made earlier (it pays to be thorough), I renamed each newly created variable with a more digestible name. ‘A reliable and trustworthy place for maintenance-repairs-and upgrades’ under cycler shop benefits can simply be wrien as ‘CSB-ReliableMaintence’.
Renaming Multiselect Columns Code Snip-it
When I first realized several questions were nested, I didn’t think about how this would affect column count. For example, after being asked about which Emirate the user resides in they were asked to select how safe they felt riding. This results in one column with their answer (ex: Abu Dhabi-3) and 8 empty columns with N/A values.
This is messy and fat, so I wrote a function to consolidate columns that could be merged. So instead of having 9 columns for emirate safety rating, there is only one and you can still see the user’s zone in the ‘Emirate’ column.
Merge Compatible String Columns Code Snip-it
Above is the more complicated merge, accounting for differing column names dependent on previous responses; in addition to this I wrote a function to merge duplicate rows.
For example, each Emirate had columns for months ridden in, so these could simply be consolidated by matching identical names.
Merge Duplicate Columns Code Snip-it
We’re almost done! At this point that data is becoming great to work with, but is still hard to understand in table form. After all this cleaning and due to the messy export itself, the output data frame had no structure and was incredibly difficult to follow.
To solve this, I just reordered the columns into logical and readable sections. This has made things much easier when it comes to viewing neighboring sections in the data.
Reorder Columns Code Snip-it
Finally, we have not only clean but extremely well formatted data. After all that processing there are now nearly 200 variables we can use. This took a lot of time and effort, but I’ve learned that it is usually worth it to be kind to your future self and put some work in early
The only remaining kind of data processing I could think to do would be handling the ‘Other’ responses. A decent amount of them are just people who did not read the answers and wrote in an answer that was already listed. Some people input slight variations of already present answers, for example ‘Track Riding’ and ‘Circuit Cycling’ are essentially the same thing.
There are not a lot of these however so I will tackle them at a later date. I would like to handle the unique other responses though. I think my approach will be based around using word clouds to establish which phrases/responses were the most common. I could go in manually to handle these but that would be poor practice.
I also need to consult with Mr. Christou about several responses that were in Arabic that I very much would appreciate help translating.
Once again, I want this work to be seen by more than just me and Xander so I’ve wrien a companion document that lists every feature in the cleaned data (organized by similar variables) along with their data type, factors for single select multiple choice, and a short description. Each item also appears in the exact order that you’ll find in the cleaned excel file.
This document, in addition to the readme on GitHub, should allow any individual to look over the data and analysis for their own use. I also went this far for my future self, because there is nothing more frustrating than forgetting why you did something that you used to know but no longer can remember.
Basic Visualization After all that processing, let’s take a look at just some of what we have to work with. I plan to get into more complex visualization but for now this is sufficient for understanding who took this survey: some wealthier millennials apparently.
There is certainly some selection bias, with the sampling frame only being those with access to the survey, but we can still get a decent picture of who uses micro mobility in the UAE.
Gender Distribution of Riders
I plan to do all sorts of data analysis in July, but the GIS and data processing took longer than expected, in addition to geing access to the data. So for now here is just a quick contingency table to show what direction I’d like to go in for the analysis.
Gender Analysis Code Snip-it
Differences between genders cycling is important, because if only men bike then the system is failing. Especially in a country where it is less common for women to drive, alternatives to transport are crucial.
Gender vs Owning a Bike Contingency Table
Chi-Square Statistic: 37.77868517966452
p-value: 6.258391774749355e-09
From this chi-squared contingency table, we can see a statistically significant difference in owning a bike for women and men. A chi-square statistic of 37 and p-value of 0.0000000063 all but guarantees an imbalance.
With a cleaned and organized data set, I plan to do all sorts of analysis in July, making models and random forest to get down to the boom of why citizens in the UAE do and don’t ride their bike. With these mathematical backings, this survey becomes truly a citable and reliable justification for beer and more efficient bike infrastructure.