Development of an object assessment format/protocol

Discussion:

Rich Kulawiec

2013-03-04 13:29:24 UTC

I've been thinking about this for a long time, and would like to find
out what others have been doing in this area (if anything) and whether
this is a topic we can or should collectively pursue.

Here's the problem statement:

We've been using DNS to communicate information about the assessment
of certain objects -- IP addresses, host names, domain names -- and
while that has its advantages (notably that most of the software already
exists, is already installed, is reasonably well-understood, etc.) it
also limits the vocabulary we can use. Moreover, there are objects
that we might want to talk about whose information isn't easily
communicated via DNS -- e.g., web pages, email addresses. We use other
kinds of methods for communicating those, including downloadable files,
APIs, etc.

We need, I think, a mechanism via which we can ask more complex questions
and get more comprehensive answers. We need a mechanism which isn't
a hack on top of DNS, but which has been developed from the ground up
specifically for this purpose.

At the moment, there are a number of ad hoc ways that this happens:
for example, Joe Wein maintains a rather large list of spammer/phisher
email addresses. (And domains, too.) The Malwaredomains folks have
lists of domains. The Stopforumspam folks have lists of domains and
IP addresses. There are DNSBLs and RHSBLs like the ones run by Spamhaus.
There are various projects to identify malicious web pages. And so on.

And all of these are great, except: they all use different ways to
express information. Some of them can be queried; some can't. Some
of them carry metadata like "how did we decide this?" or 'when did
we decide this?" or "for further reference, see:" and some don't.
Some of them support methods for asking narrower/broader questions,
some of them don't.

What I'm suggesting, therefore, is that we need (a) a standardized
way to express these things and (b) a standardized protocol by which
we can ask questions and get answers. For instance:

Does the web page at http://example.com/foo.html contain malware?

Is the address ***@example.net associated which phishing?

What can you tell me about the domain example.com?

Has the IP address 192.168.0.20 sent spam recently?

Certainly all of these things are possible today, by asking various
information sources in various ways. But not in an integrated,
unified fashion which would yield results that could be compared
to each other or integrated with each other programatically.

(For example, I might wish to ask 5 different information sources
about 192.168.0.20 and weight their opinions. Or I might want
to ask an open-ended question like "what do you know about example.com?")

In all these instances, opinions come with metadata: whose opinion
is this? At what time was it rendered? Is there are time at which
it should be considered no-longer-valid? Is there a confidence
level associated with this opinion? Is the answer specific to
the object that was asked about or does it apply more broadly?
(e.g., I asked about 192.168.0.20 but got back an opinion that
applies not only to that, but to all of 192.168.0.0/24.)

Where I'm going, probably predictably, is that the format for
both questions and answers may be XML-based in order to provide
sufficient expressive power. (Yes, that's verbose. Very much
the antithesis of the terse Q/A format we use with DNS. I haven't
been able to decide if that's a good, bad or neutral thing,
other than noting that using XML has the advantage of making
information immediately palatable to a wide range of software.)

So let me see if I can phrase the questions this way:

1. Is such a format needed?
2. Is a query-response protocol needed to transmit it?
3. If so, does anything already exist which would lend itself
to (1) and (2) with minimal changes? If so, is it desirable
to run that experiment?
4. If not, then is there sufficient utility in this approach
that it's worth pursuing?
5. If this exists, will it be used? Is there sufficient reason
for changes from what already exists?

(I'm aware of draft-dskoll-reputation-reporting, but it doesn't
cover all kinds of objects I have in mind here. I'll note in
passing though that it attempts to be as concise as possible,
which is a good thing and a fine thing, but does limit the
scope of both questions and answers.)

---rsk

Martijn Grooten

2013-03-04 15:46:14 UTC

Permalink

And all of these are great, except: they all use different ways to express
information. Some of them can be queried; some can't. Some of them carry
metadata like "how did we decide this?" or 'when did we decide this?" or
"for further reference, see:" and some don't.
Some of them support methods for asking narrower/broader questions,
some of them don't.

Is the reason different sources use different ways to express information the fact that there is no suitable protocol? Or is it a mere consequence of the fact that sources have different things they are willing and able to share?

I think the idea is nice. Whether such a format is really needed I'm not sure. I can see how having more information available makes for better decisions, but I am worried the accuracy gained isn't worth the performance lost.

Perhaps you can come up with examples of where such a protocol would be useful?

Martijn.

________________________________

Virus Bulletin Ltd, The Pentagon, Abingdon, OX14 3YP, England.
Company Reg No: 2388295. VAT Reg No: GB 532 5598 33.

Emanuele Balla (aka Skull)

2013-03-04 16:59:48 UTC

Permalink

On 3/4/13 4:46 PM, Martijn Grooten wrote:
e protocol? Or is it a

Post by Martijn Grooten
mere consequence of the fact that sources have different things they
are willing and able to share?
I think the idea is nice. Whether such a format is really needed I'm
not sure. I can see how having more information available makes for
better decisions, but I am worried the accuracy gained isn't worth
the performance lost.

As someone who's been thinking and experimenting on this for maybe the
last 3 or 4 years: yes (IMHO) the protocol would be useful.

IMHO (read this post like I had filled it with "IMHO" all over the
place), the reason why a similar protocol won't be a big performance
loss is that probably DNS cache in DNSBL-like lookups is not as useful
as most people would expect...

The problem why it's not been built so far is that those who are mostly
interested in such a protocol being available (data providers, as they
have data they could provide through it) rarely have the resources to
create something like that.

Partly because creating protocols is not their main job.

Partly because they would have to provide both server and client
implementation, otherwise nobody will be able to use their data.

Partly because - to be as smart as DNS- the client needs to be as
"complex" as DNS: redundacy, load-balancing, lowest-latency server
selection, etc.: all the logic has to go in the client, not in the
servers layout (like in "let's run anycast and forget about the
client"), for several reasons

Partly because, if the rest of the industry doesn't follow their lead,
the service will be unusable for most users and will simply never get
traction.

Partly because there're lots of issues about providing/looking_up data
with extreme granularity (privacy, exposure, etc), so these data
providers are not sure the service would be so useful, after all, and
they prefer to just provide the data in a raw format to those who are
smart enough to figure out on their own how to use them.

Post by Martijn Grooten
Perhaps you can come up with examples of where such a protocol would be useful?

Straight to the point: abusive URLs on legit domains
. There's no (easy/effective) way to encode an entire URL in a DNS request.
At least, that's the reason why I've been thinking about this topic for
the last 4 years... :-\

Martijn Grooten

2013-03-04 18:28:36 UTC

Permalink

Straight to the point: abusive URLs on legit domains . There's no
(easy/effective) way to encode an entire URL in a DNS request.
At least, that's the reason why I've been thinking about this topic for the last
4 years... :-\

Can't you just use HTTP for that? There is an easy and effective way to encode URLs in HTTP - and HTTP is pretty good at returning all sorts of responses: a single character (0=good, 1=bad), some XML, some JSON, something else. There is obviously some overhead from the TCP connection and the request and response headers, but I wonder if there are many cases in which:
- this overhead is a huge problem;
- the request can't easily be 'encoded' into DNS.

Rich's examples all seem pretty easy to encode into DNS, but more importantly, to me they shout for HTTP POST. When Rich's idea of asking for context (expiration time, range to which the answer applies) is used well, it could actually save you a lot of further requests.

Note: some web proxies are already using HTTP to make requests about whether a particular URL is bad. In web proxies time really does matter (delaying all web pages by a second seriously affects perceived performance).

Martijn.

________________________________

Virus Bulletin Ltd, The Pentagon, Abingdon, OX14 3YP, England.
Company Reg No: 2388295. VAT Reg No: GB 532 5598 33.

Paul Smith

2013-03-04 18:43:47 UTC

Permalink

Post by Martijn Grooten

Straight to the point: abusive URLs on legit domains . There's no
(easy/effective) way to encode an entire URL in a DNS request.
At least, that's the reason why I've been thinking about this topic for the last
4 years... :-\

Can't you just use HTTP for that?

Well, HTTP seems a bit 'heavyweight' for this to me. That's one of the
advantages of DNS - it's UDP, so no packets to set up short-lived
sessions. (Other advantages, AFAICS, are distributed caching, and
widespread support)

I suppose you could keep a HTTP session open for a while, but, you'd
need a beefy server to handle the zillions of sessions you'd have to
have open at once. DNS doesn't have 'sessions' so you don't have this
problem.

OTOH, a disadvantage of DNS is that it's UDP, so you have to handle
retries etc yourself.

So, if you're looking at something like this, you need to first of all
think UDP or TCP? UDP is easy & quick to have lots of packets flying
around, but you have extra work to handle retries, and some of the
benefit of UDP could be gained by just having long-lived sessions
between reputation source and reputation checker. But, this may cause
issues for servers and firewalls (could a typical server/firewall have
hundreds of thousands of active TCP sessions? A NAT firewall would die
quickly, but could a non-NAT firewall cope?)

If you decide UDP is the most efficient, then DNS is very attractive,
because you already have distributed caching 'built-in' to the Internet
infrastructure, but if we're willing to dump that capability, then I'm
fairly sure we could come up with something with the suitable
capabilities which would fit in a UDP packet size - once we can decide
what the 'suitable capabilities' are...

If TCP is the way to go, then the world is your oyster, but I'd be
concerned about speed and the server requirements. Anyone know how many
queries someone like Spamhaus gets an hour?

-

Paul Smith Computer Services
Tel: 01484 855800
Vat No: GB 685 6987 53

Emanuele Balla (aka Skull)

2013-03-04 20:41:35 UTC

Permalink

Post by Martijn Grooten

Straight to the point: abusive URLs on legit domains . There's no
(easy/effective) way to encode an entire URL in a DNS request.
At least, that's the reason why I've been thinking about this topic for the last
4 years... :-\

Can't you just use HTTP for that?

You could, for sure.
But you won't have redundancy/loaf_balancing/best_peer_selection in the
client: you'd need to wrap something around it (through SRV records for
the client, and clustering, anycast, geoDNS to direct the client to the
best server, etc).
This will increase the requirements for running such services significantly.

Also you'll move the entire thing to TCP, requiring sessions/sockets,
much more expensive to scale properly, and also much more susceptible to
DDoS than UDP-based protocols.

Then take into account the amount of queries major DNSBLs satisfy at the
moment (on DNS, where there's at least some caching in place): >100Kqps.

All in all, I'm quite confident there are not many entities wanting to
provide service to the internet at large over a similar infrastructure...

Rich Kulawiec

2013-03-04 17:10:10 UTC

Permalink

Post by Martijn Grooten
Is the reason different sources use different ways to express
information the fact that there is no suitable protocol? Or is it a
mere consequence of the fact that sources have different things they
are willing and able to share?

That's a pair of great questions, and I can see reasons to answer "yes"
to both.

On the one hand: there's no standardized way to do this (beyond DNSBLs
and RHSBLs, which we've piggybacked on DNS). On the other hand, you're
right, different people are making different statements about different
entities -- IP addresses, domains, web pages, email addresses, etc. --
so *if* there existed some standardized way to express this, it would
have to let them say the same things they're saying now...because otherwise
they'd probably have no reason to use it.

So I dunno.

Post by Martijn Grooten
Perhaps you can come up with examples of where such a protocol would be useful?

Sure. Let me show these using some pseudocode, just to illustrate the
concept. Let's presume that example.org is asking questions of example.com.

Question:
query-proto-version = 1.0
query-to = blah.example.com
query-time = Mon Mar 4 16:06:37 UTC 2013
object type = ipv4
object value = 192.168.0.3
object query = spam source?
Answer:
answer-proto-version = 1.1
answer-from = blah.example.com
answer-time = Mon Mar 4 16:06:38 UTC 2013
answer-valid-time = Fri Mar 1 13:05:00 UTC 2013
answer-expiration-time = Fri Mar 8 13:05:00 UTC 2013
answer = yes

This is the equivalent of a DNSBL check -- except that the answer
also contains two more items. It includes an "answer-valid-time",
which could be "the time that we started giving out this answer",
and "answer-expiration-time", which could be the time that this
answer is scheduled to expire. Thus the former could mean "we listed
this IP address at 1:05 PM last Friday, because that's when our sensors
told us to" and the latter could mean "unless we see a reason to
extend the listing, we're going to drop it at 1:05 PM this Friday".

Question:
query-proto-version = 1.0
query-to = blah.example.com
query-time = Mon Mar 4 16:06:37 UTC 2013
object type = URL
object value = http://example.net/some/page.html
object query = infected with malware?
Answer:
answer-proto-version = 1.1
answer-from = blah.example.com
answer-time = Mon Mar 4 16:06:38 UTC 2013
answer-valid-time = Fri Mar 1 14:05:00 UTC 2013
answer-expiration-time = Fri Mar 8 14:05:00 UTC 2013
answer = no

This is a very similar Q/A: in this case the answer is negative,
but it also has an expiration time. (Let's presume that example.com
is crawling sites at weekly intervals, thus there is no reason for
this answer to [possibly] change until the next crawl is done.
The requestor might be okay with this answer, or it might want
a more recent one -- in which case it will need to ask someone else.)

Question:
query-proto-version = 1.0
query-to = blah.example.com
query-time = Mon Mar 4 16:06:37 UTC 2013
object type = ASN
object value = 123456789
object query = hijacked?
Answer:
answer-proto-version = 1.1
answer-from = blah.example.com
answer-time = Mon Mar 4 16:06:38 UTC 2013
answer-valid-time = Fri Mar 1 14:10:00 UTC 2013
answer-expiration-time = Fri Apr 5 14:10:00 UTC 2013
answer = yes
answer-additional: http://example.com/hijacks/123456789

Also very similar. I posited a much longer expiration time because
this is probably not going to be a quickly-remediated problem. I've
also shown an addition to the answer, which in this case is just a URL
where something consumable by humans might be found.

To expand on those just a little bit: "object type" could probably
encompass things like:

IPv4/IPv6 addresses
networks (by handle?) (by CIDR?)
ASNs
domains, subdomains, hosts
URLs
email addresses

"object query" could include the examples above, and much more obviously,
but should exclude those things that we already have ways to find out,
e.g., this should not be a way to query for a DNS A record, because
that's just kinda silly.

There are two (at least two) ways to go with this: one would be to
make it concise and use UDP. Another would be to make it verbose
and use TCP. (Insert long discussion here about performance tradeoffs.)
I'm not sure that this is worth getting into unless the high-level
idea flies: if we don't actually need a standard format and a standard
protocol that uses it, then those tradeoffs don't matter.

---rsk

Dave Crocker

2013-03-04 17:06:55 UTC

Permalink

Post by Rich Kulawiec
We need, I think, a mechanism via which we can ask more complex questions
and get more comprehensive answers. We need a mechanism which isn't
a hack on top of DNS, but which has been developed from the ground up
specifically for this purpose.

By way of exploring a baseline, how does what you are asking for differ
from the Repute work:

http://datatracker.ietf.org/wg/repute/charter/

Can it be a value-add enhancement to it? If not, why?

d/

--
Dave Crocker
Brandenburg InternetWorking
bbiw.net

Barry Shein

2013-03-04 19:58:44 UTC

Permalink

My only thought is how much the coming era of 1,000+ new top-level
domains, internationalization of the DNS space, and IPv6 will motivate
new thinking vis a vis spam and related.

I'm sometimes quoted, from back around 1990, as joking that I
remembered the good old days when you could read all the Usenet
newsgroup topics in one day as they breached 100,000 groups -- an echo
of an older comment by someone else that they remembered when you
could read all the content every day.

Soon, "I remember the good old days when you could recognize all the
TLDs, including the ccTLDs, and had all the gTLDs memorized and could
even say why the were each created!" (other than perhaps .COOP :-)?

So, for example, if your (Joe or Jane End-User) risk assessment of a
.COM or a recognizable string in .COM is lower than some others like
maybe random-noise.country-i-never-heard-of what does it mean when
that spreads to 1,000 or more, possibly closer to 2,000, TLDs alone
let alone internationalization? How can filters even be kept up
without automation? etc. What will they make of punycode domains? etc.

Same question for 128-bit addressing which if nothing else may let
miscreants be much more highly motile.

And I'll toss in the progress of smart phone technology for free, who
thought even a few years ago I could carry around so much spam in my
pants pocket! Now, 64GB is de rigeur. And how go the smartphone
botnets? shudder.

Maybe it's just the calm before the storm?

John Levine

2013-03-05 01:28:39 UTC

Permalink

Post by Rich Kulawiec
I've been thinking about this for a long time, and would like to find
out what others have been doing in this area (if anything) and whether
this is a topic we can or should collectively pursue.

Have you followed what the REPUTE working group did? Its goal was to
come up with a way to publish reputation information more general than
the handful of bits you can get from a DNSBL. There were some
interesting ideas but it's run out of gas and is about to be wound up.

draft-ietf-repute-considerations-00.txt
draft-ietf-repute-email-identifiers-06.txt
draft-ietf-repute-model-03.txt
draft-ietf-repute-media-type-05.txt
draft-ietf-repute-query-http-04.txt

If people had stuff they wanted to do, I expect it could be spun up
again. If they were sufficiently researchy and people had concrete
plans, I could probably persuade Lars to let the ASRG continue until
they were done.