How to filter a table based on email address suffix

Question

I have a table of over 100K names and addresses . I would like to filter the table to keep only those emails I think are not spam.

i have for example addresses as such

[email protected]
[email protected]
[email protected]

I would like to filter now those addresses that have only digit before the @ symbol as well as those emails which have only digit after the @, but before the suffix .com.

I know I can extract them using str_split and grepl, but I can't fit them into a filter query to remove them from the table.

pattern <- "[email protected]"
str_split(pattern, '@') # this will split the address based on the sumbol

str_split(string = str_split(pattern, '@')[[1]][2], pattern = "\\.") # this will split the doamin name based on the dot separating the suffix from the numbers.

as.numeric(str_split(string = str_split(pattern, '@')[[1]][2], pattern = "\\.")[[1]][1]) # This for example will check if the string extracted above contains only numbers, if not it will return NA

But how do I combine this in a tidyverse query?

thanks

P.S. I know this is a farfetched question, but is there some kind a spam filter for email address one can use within R?

Ronak Shah · Accepted Answer · 2023-11-03 09:20:49Z

3

I think this pattern should help you identify the spam email as per your condition.

^\\d+@|@\\d+\\.com

To use it in filter you may use grepl or str_detect from stringr.

data %>% filter(grepl('^\\d+@|@\\d+\\.com', email))

To get rows which are not spam negate the condition using !.

data %>% filter(!grepl('^\\d+@|@\\d+\\.com', email))

Example :

x <- c('[email protected]', '[email protected]', '[email protected]', '[email protected]')
grepl('^\\d+@|@\\d+\\.com', x)
#[1]  TRUE  TRUE  TRUE FALSE

edited Nov 3, 2023 at 9:20

answered Nov 3, 2023 at 8:34

Ronak Shah

386k20 gold badges164 silver badges227 bronze badges

1

Note that (\\d+)@ has the same effect as (\\d)@ because there is nothing preceeding it in the RegEx. The + within @\\d+\\.com cannot be removed as there is something (i.e. the @) before the \\d+.
– AdrianHHH
Commented Nov 3, 2023 at 8:46
thanks, this is truly simple, but is there a way to test, if there are ONLY digits at the start of the address?
– Assa Yeroslaviz
Commented Nov 3, 2023 at 9:16
if you reduce the regex to ^\\d+@ then it would only filter for exclusive digits before @.
– DuesserBaest
Commented Nov 3, 2023 at 9:42

Add a comment |

marc_s · Accepted Answer · 2023-11-03 14:29:34Z

It's a rather simple solution and I think there might be a cleaner way without creating all these extra columns:

adress <- c("[email protected]","[email protected]","[email protected]")

adf <- as.data.frame(adress)

adf[c("Before","After")] <- str_split_fixed(adf$adress, '@',2) # this will split the address before @

adf[c("After2","com")] <- str_split_fixed(adf$After,"\\.",2) # this will split the remaining @ 

library(dplyr)
adf <- adf %>% filter(grepl('[a-zA-Z]', Before)) 

adf <- adf %>% filter(grepl('[a-zA-Z]', adf$After2))

adf$adress

[1] "[email protected]"

Collectives™ on Stack Overflow

How to filter a table based on email address suffix

2 Answers 2

Not the answer you're looking for? Browse other questions tagged
r
regex
filter
tidyverse
spam
or ask your own question.

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Not the answer you're looking for? Browse other questions tagged rregexfiltertidyversespam or ask your own question.

Related

Not the answer you're looking for? Browse other questions tagged
r
regex
filter
tidyverse
spam
or ask your own question.