开发者

Visited pages for particular user from log file

开发者 https://www.devze.com 2023-04-12 05:39 出处:网络
How to find visited pages for a particular user from a big log file that contains list of sessionId and PageId combination in each separate line?

How to find visited pages for a particular user from a big log file that contains list of sessionId and PageId combination in each separate line?

File is big enough not to fi开发者_JAVA技巧t in memory. It means find out page that is being visited most in same session(user).

for e.g.

My file is (order is sessionId, PageID)

usera  page1
userb  page2
userb  page1
usera  page3
....

It should print

usera visits page1 most followed by page3.

If the number of pages visited is equal, it is up to you how to handle the case (Can print both, or can print any one of them)

Which data structure/algorithm will you use for this? Since this is an interview-question, efficient algorithm/data structure would be appreciated. The interviewer did not specify what order of algorithm he was looking for.

I came up with std::map<string,std::pair<string,int> > solution. The interviewer asked if I can do anything better than this or if the key set is so large it won't be efficiently handled by map, what should be done?


I think the first step would be to remove all "non-usera" lines since you're doing per-user parsing. This would be a one-time job separating all users into different files. After that you can do a line-by line analysis keeping only a couple of lines in "history". You can do this using a simple line parser without having to store the whole file in-memory.

If it's going to be need something like a data structure necessarily, you might want to look into map-reduce paradigm -- Hadoop would be ideal for files on the scale of 10GB +.


As for me, I see that the keyword is: same session. In this case, what we need to do is not to read the entire log file, instead try to make a guess for this particular session.

We need to remember, that a session will be alive for specific period of time. The timing is set by the server, e.g. 20 minutes. So, after we find out when the user is logged in, what we need to do is to get the pointer to the specific location on the file on when the user logged out or last active session.


You can first think of sorting the file using external sort (break the file into chunks that fit into the memory and sort each chunk and merge them back ) you can break the sorted file again into same number of chunks but keeping track of range each chunk corresponds to. With this, a binary search can be done to find the relevant chunk, load it and search for any user.

Since it is a sorted file, similar user entries will be consecutive, infact one can also count the occurences during merging of chunks and write the value appended to the line, for later use.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号