I'm trying to select a maximum of 10 related articles, where a related开发者_开发百科 article is an article that has 3 or more of the same keywords as the other article.
My table structure is as follows:
articles[id, title, content, time]
tags[id, tag]
articles_tags[article_id, tag_id]
Can I select the related articles id and title all in one query?
Any help is greatly appreciated.
Assuming that title is also unique
SELECT fA.ID, fA.Title
from
Articles bA,
articles_tags bAT,
articles_tags fAT,
Articles fA
where
bA.title = 'some name' AND
bA.id = bAT.Article_Id AND
bAT.Tag_ID = fAT.Tag_ID AND
fAT.Article_ID = fA.ID AND
fA.title != 'some name'
GROUP BY
fA.ID, fA.Title
HAVING
count(*) >= 3
Where to exclude the 'seed' article
Because I don't care exactly WHICH tags I match on, just THAT I match on 3 tags, I only need tag_id and avoid the join to the tags table completely. So now I join the many-to-many table to itself to find the articles which have an overlap.
The problem is that the article will match itself 100% so we need to eliminate that from the results.
You can exclude that record in 3 ways. You can filter it from the table to before joining, you can have it fall out of the join, or you can filter it when you're finished.
If you eliminate it before you begin the join, you're not gaining much of an advantage. You've got thousands or millions of articles and you're only eliminating 1. I also believe this will not be useful based on the best index for the article_tag mapping table.
If you do it as part of the join, the inequality will prevent that clause from being part of the index scan and be applied as a filter after the index scan.
Consider the index on article_tags as (Tag_ID, Article_ID). If I join the index to itself on tag_id = tag_id then I'll immediately define the slice of the index to process by walking the index to each tag_id my 'seed' article has. If I add the clause article_id != article_id, that can't use the index to define the slice to be processed. That means it will be applied as a filter. e.g. Say my first tag is "BLUE". I walk the index to get all the articles which have "BLUE". (by ID of course). Say there are 50 rows. We know that 1 is my seed article and 49 are matches. If I don't include the inequality, I include all 50 records and move on. If I do include the inequality, I then have to check each of the 50 records to see which is my seed and which isn't. The next tag is "Jupiter" and it matches 20,000 articles. Again I have to check each row in that slice of the index to exclude my seed article. After I go through this 2,5,20 times (depends on tags for that seed article), I now have a completely clean set of articles to do the COUNT(*) and HAVING on. If I don't include the inequality as part of my join but instead just filter the SEED ID out after the group by and having then I only do that filer once on a very short list.
@updated to exclude the searched article itself!
Something along these lines
select *
from articles
inner join (
select at2.article_id, COUNT(*) cnt
from articles a
inner join articles_tags at on at.article_id = a.id
# find all matching tags to get the article ids
inner join articles_tags at2 on at2.tag_id = at.tag_id
and at2.article_id != at.article_id
where a.id = 1234 # the base article to find matches for
group by at2.article_id
having count(*) >= 3 # at least 3 matching keywords
) matches on matches.article_id = articles.id
order by matches.cnt desc
limit 10; # up to 10 matches required
If you can write a query to get ids of records that have matches, then you can certainly have that same query return you the titles. If your real question is 'how do I write the query to return the matches?', then please say so and I'll edit this answer with more details along those lines.
精彩评论