Notes: tw
← Older revision
Revision as of 18:31, 17 March 2014
(One intermediate revision by the same user not shown)
Line 215:
Line 215:
User sessions on Wikipedia meet this definition, because they can be seen as a user coming to Wikipedia in their first request (a B-action), browsing successive pages on the site through internal links (M-actions) and either ending their session by closing the tab or by navigating to a site outside the Wikimedia ecosystem (an E-action). Lea's definition will thus be used.
User sessions on Wikipedia meet this definition, because they can be seen as a user coming to Wikipedia in their first request (a B-action), browsing successive pages on the site through internal links (M-actions) and either ending their session by closing the tab or by navigating to a site outside the Wikimedia ecosystem (an E-action). Lea's definition will thus be used.
−
Identifying a session within a dataset is more difficult. Existing papers and work on web sessions fall largely into two camps. Jansen & Spink's work uses an arbitrary "episode", or "session", which is "a period from the first recorded time stamp to the last recorded time stamp...from a particular [user] on a particular day".<ref name = "Jansen2006"/> Eickhoff et al (2014),<ref name = "Eickhoff"/> on the other hand, draw boundaries after a 30 minute period of inactivity for that user. Both of these identification methods have the problem of being arbitrary, and seem to be measuring something that doesn't resemble what we'd consider a "session"; Jansen's method is measuring by-day user activity and Eickhoff's seems completely arbitrary. In fact, it ''is'' completely arbitrary, although it has some pedigree behind it,<ref>The idea of a 30 minute session stems from a 1994 paper that claimed to find a 25.5 minute timeout. See {{cite journal|last=Catledge|first=L.|coauthors=Pitkow, J.|date=1995|title=Characterizing browsing strategies in the world-wide web|journal=Proceedings of the Third International World-Wide Web Conference on Technology, tools and applications|volume=27}}</ref> and Jones & Klinkner (2008)<ref name = "Rosie"/> found that "this threshold is no better than random for identifying boundaries between user search tasks". Indeed,
Montgomery
&
Faloutsos (2001)<ref>{{cite journal|last=Montgomery|first=A.L.|coauthors=Faloutsos, C.|date= July 2001 |journal=Computer|publisher=IEEE|volume=34|issue=7|issn=0018-9162|title=Identifying Web browsing trends and patterns}}</ref>
tested multiple different arbitrary cutoffs, and found none that were reliable
, something Jones & Klinkner further validated
.
+
Identifying a session within a dataset is more difficult. Existing papers and work on web sessions fall largely into two camps. Jansen & Spink's work uses an arbitrary "episode", or "session", which is "a period from the first recorded time stamp to the last recorded time stamp...from a particular [user] on a particular day".<ref name = "Jansen2006"/> Eickhoff et al (2014),<ref name = "Eickhoff"/> on the other hand, draw boundaries after a 30 minute period of inactivity for that user. Both of these identification methods have the problem of being arbitrary, and seem to be measuring something that doesn't resemble what we'd consider a "session"; Jansen's method is measuring by-day user activity and Eickhoff's seems completely arbitrary. In fact, it ''is'' completely arbitrary, although it has some pedigree behind it,<ref>The idea of a 30 minute session stems from a 1994 paper that claimed to find a 25.5 minute timeout. See {{cite journal|last=Catledge|first=L.|coauthors=Pitkow, J.|date=1995|title=Characterizing browsing strategies in the world-wide web|journal=Proceedings of the Third International World-Wide Web Conference on Technology, tools and applications|volume=27}}</ref> and Jones & Klinkner (2008)<ref name = "Rosie"/> found that "this threshold is no better than random for identifying boundaries between user search tasks". Indeed,
Jones
&
Klinkner
tested multiple different arbitrary cutoffs, and found none that were reliable.
So, 24-hour 'episodes' don't tell us what we want, and arbitrary inactivity cutoffs don't either. What's left?
So, 24-hour 'episodes' don't tell us what we want, and arbitrary inactivity cutoffs don't either. What's left?
+
+
Mehrzadi & Feitelson (2012)<ref>{{cite book|last=Mehrzadi|first=David|coauthors=Feitelson, Dror G.|title=Proceedings of the 5th Annual International Systems and Storage Conference|publisher=ACM|date=2012|series=SYSTOR '12|chapter=On Extracting Session Data from Activity Logs|isbn=978-1-4503-1448-0}}</ref> investigated a variety of ways of identifying sessions, including the arbitrary cutoffs described above. They also looked at considering the intervals between successive actions. In other words, we assume a unique user would not be engaging in multiple sessions at the same time, and there would be a gap between sessions. As a result, if we look at the time between each request for each client, and aggregate them, we can expect to see a 'drop' at which point most or all users have ended their sessions. Using this data, we identify a local minimum - the first point at which the number of requests drops particularly low - and use that as the cutoff for 'session'. Mehrzadi & Feitelson actually found, using their example dataset, no clear dropoff - but Halfaker & Geiger used it successfully with a dataset of Wikipedia ''editors'', so we can at least experiment with it.
+
===Results===
===Results===
Line 361:
Line 364:
==Notes==
==Notes==
−
{{reflist}}
+
{{reflist
|2
}}